© 2006 American Thoracic Society doi: 10.1164/rccm.200602-197ST
An Official ATS Statement: Grading the Quality of Evidence and Strength of Recommendations in ATS Guidelines and RecommendationsTHIS OFFICIAL STATEMENT OF THE AMERICAN THORACIC SOCIETY (ATS) WAS ADOPTED BY THE ATS BOARD OF DIRECTORS, DECEMBER 2005
Grading the strength of recommendations and the quality of underlying evidence enhances the usefulness of clinical practice guidelines. Professional societies and other organizations, including the American Thoracic Society (ATS), should reach consensus about whether they will use one common grading system and which of the numerous grading systems they would apply across all guidelines. The profusion of guideline grading systems confuses consumers of guidelines, and undermines the value of the grading exercise in conveying a transparent message. In response to this dilemma, the international GRADE working group has developed an approach that is useful for many guideline contexts, and that several national and international organizations have adopted. The GRADE system classifies recommendations as strong or weak, according to the balance of the benefits and downsides (harms, burden, and cost) after considering the quality of evidence. The quality of evidence reflects the confidence in estimates of the true effects of an intervention, and the system classifies quality of evidence as high, moderate, low, or very low according to factors that include the study methodology, the consistency and precision of the results, and the directness of the evidence. On recommendation of the ATS Documents Development and Implementation Committee, the ATS adopted the GRADE approach for its guidelines in line with many other organizations that have recently chosen the GRADE approach. This article informs ATS guideline developers, investigators, and those interpreting future ATS guidelines that follow the GRADE approach about the methodology and applicability of ATS guidelines and recommendations.
Clinical practice guidelines (CPGs) offer recommendations for the management of typical patients. These management decisions involve balancing the expected benefits and downsides (harms, burden, and costs). To make evidence-based medical decisions, clinicians also need to integrate recommendations with their own clinical judgment, and with individual patient circumstances, values, and preferences (1). A systematic approach to grading the strength of management recommendations can minimize bias and aid interpretation (2, 3). Most guideline developers, including the American Thoracic Society (ATS), recognize the need for grading, and journals are increasingly demanding such systems for publication of guidelines and recommendations. The ATS Documents Development and Implementation Committee was charged with developing, adapting or identifying, and adopting a grading system that will guide ATS panels in the development of recommendations and help clinicians interpret the recommended actions (46). The Grades of Recommendation, Assessment, Development, and Evaluation (GRADE) working group has conducted a review of existing grading systems and developed a system for grading the quality of evidence and strength of recommendations of CPGs that addresses disadvantages of prior systems (2, 7, 8). These disadvantages include the lack of separation between quality of evidence and strength of recommendation, the lack of transparency about judgments, and the lack of explicit acknowledgment of values and preferences (2, 7, 9). The aim of the independent GRADE group includes reducing confusion among guideline panels and users as a result of the existence of many, often scientifically outdated, grading systems. Following the comprehensive assessment, development, and dissemination of the work of the GRADE group, several organizations and guideline developers, including the World Health Organization, the American College of Chest Physicians (ACCP), the American Endocrine Society, and UpToDate, have adopted the GRADE system in its original format or with relatively minor modifications. The GRADE system is based on a sequential assessment of the quality of evidence, followed by assessment of the balance between benefits versus downsides and subsequent judgment about the strength of recommendations. Because frontline consumers of recommendations will be most interested in the best course of action, the GRADE system places the strength of the recommendation first, followed by the quality of the evidence. Separating the judgments regarding the quality of evidence from judgments about the strength of recommendations is a critical and defining feature of this new grading system. The newly formed standing ATS Documents Development and Implementation Committee agreed to adopt the GRADE approach developed by the GRADE working group based on these issues of methodology, practicality, and applicability. The ATS leadership has selected several members of the GRADE working group who are involved in disseminating the approach and collaborated with other organizations, including the ACCP, to serve on this committee (46, 9, 10). The first project of this committee is described in this document and informs ATS guideline developers, investigators, and those interpreting future ATS guidelines that follow the GRADE approach in greater detail than prior documents (9). Specifically, this document describes the GRADE approach and factors that influence the process of grading based on several examples. This document does not describe the way consensus is reached by a guideline panel during a guideline development process.
Guideline developers make recommendations to administer, or not administer, an intervention on the basis of tradeoffs between benefits on the one hand, and downsides (harms, burden, and cost) on the other. If benefits outweigh downsides, guideline panels will recommend that clinicians offer a treatment to appropriately chosen patients. Conversely, if downsides outweigh benefits, the guidelines will recommend against the implementation of such a treatment. The strength of a recommendation reflects the degree of confidence that the desirable effects of adherence to a recommendation outweigh the undesirable effects. Desirable effects can include beneficial health outcomes, less burden, and savings. Undesirable effects can include harms, more burden, and costs. Burdens are the demands of adhering to a recommendation that patients or caregivers (e.g., family) may dislike, such as having to take medication or the inconvenience of going to the doctor's office. Although the degree of confidence is a continuum, the GRADE approach classifies recommendations for or against treatments into two grades, strong and weak. If guideline developers are confident that the desirable effects of adherence to a recommendation outweigh the undesirable effects, they will make a strong recommendation within the context of a described intervention. This confidence arises in several ways. High-quality evidence should provide precise estimates of both benefits and downsides, and the balance should be clear (recommendations to quit smoking to prevent adverse consequences of tobacco smoke exposure or recommendation for bronchodilators in patients with known chronic obstructive pulmonary disease [COPD]). A weak recommendation is one for which a guideline panel concludes that the desirable effects of adherence to a recommendation probably outweigh the undesirable effects, but the panel is not confident. Thus, if guideline developers believe that benefits and downsides are finely balanced, or appreciable uncertainty exists about the magnitude of benefits and/or downsides, they offer a weak recommendation.
CPGs are intended for typical patients, but clinicians are becoming increasingly aware of the importance of patient values and preferences in individualized clinical decision making. One way to interpret strong and weak recommendations is in relation to patient values and preferences. For decisions in which it is clear that benefits far outweigh downsides, or downsides far outweigh benefits, almost all patients will make the same choice, and guideline developers can offer a strong recommendation (see Box 1
).
Thus, another way for clinicians to interpret strong recommendations is, for typical patients, that they should "just take the recommended action" and offer the intervention to their patients. On the other hand, when clinicians face weak recommendations, they should more carefully consider the benefits, harms, and burden in the context of the patient before them. These situations arise when benefits and downsides are closely balanced, or because of uncertainty in benefits and/or downsides, in which appreciable numbers of patients, because of variability in values and preferences, will make different choices. In such situations, guideline developers will offer weak recommendations (Box 2
). Individualization of clinical decision making in weak recommendations remains a challenge. Although clinicians always should consider patients' preferences and values, when they face weak recommendations they may have a more detailed conversation with patients than for strong recommendations to ensure that the ultimate decision is consistent with the patient's values. For patients who are interested, a decision aid that presents patients with both benefits and downsides of therapy is likely to improve knowledge and decrease decision-making conflict, and it may promote a decision most consistent with underlying values and preferences (13). Because of time constraints and because decision aids are not universally available, clinicians cannot use decision aids in all patients and, for strong recommendations, the use of decision aids is inefficient. Other ways of interpreting strong and weak recommendations relate to performance or quality indicators. Strong recommendations are candidate performance indicators. For weak recommendations, performance could be measured by monitoring whether clinicians have discussed recommended actions with patients or their surrogates or carefully documented the evaluation of benefits and downsides in the patient's chart. Similar interpretations follow for public policy derived from guidelines. Strong recommendations require less debate than weaker recommendations. Table 1 summarizes several ways that developers and consumers of guidelines can interpret strong and weak recommendations.
Clinicians, patients, third-party payers, institutional review committees, other stakeholders, or the courts should never view recommendations as dictates. Even strong recommendations based on high-quality evidence will not apply to all circumstances and all patients. Consumers of CPGs may reasonably conclude that following some strong recommendations based on high quality will be a mistake for some patients. No CPGs or recommendations can take into account all of the often-compelling unique features of individual clinical circumstances. Thus, nobody charged with evaluating clinicians' actions should attempt to apply recommendations in rote or blanket fashion.
Factors that Influence the Strength of A Recommendation
Guideline panels should, in general, make stronger recommendations for interventions that decrease adverse outcomes with high patient importance (14) (those to which, on average, patients assign greater values and preferences) than those that decrease outcomes of lesser patient importance (Box 3).
Returning to the first example in Box 2, the initial choice made by the patient to accept adjusted-dose warfarin for 1 year versus shorter periods (< 3 mo) for the prevention of DVT recurrence or other adverse outcomes in patients with initial DVT illustrates several of the factors that will influence the strength of a recommendation (Box 4).
A patient's baseline risk of the adverse outcome (sometimes called control event risk) that treatment is expected to prevent may prove a key consideration (Table 2 and Box 5).
Another way of dealing with different baseline risks is to offer specific recommendations for several risk strata. For example, in the example above regarding COPD exacerbation, a guideline panel could offer a recommendation for patients with higher baseline risk and one for patients with lower baseline risk. Offering specific recommendations can help users of guidelines selecting the appropriate recommendations. Data about patient preferences and values are often limited. Although it is ideal for clinicians to elicit patient preferences and values directly from patients, and for guideline panels to obtain values and preference estimates from population-based studies, such studies are often unavailable. When value or preference judgments are particularly important for the interpretation of recommendations, authors should describe the key values they have attributed in making recommendations. For example, providing a recommendation for use of inhaled corticosteroids in mild COPD would require a statement about the higher value assigned to the fewer exacerbations, the possible, but uncertain, slower rate of FEV1 decline, and the questionable mortality reduction compared with avoiding the harms from thrush, reduced bone mineral density, increased fracture risk, the burden of using inhalers and the cost associated with therapy. For a guideline panel to offer a strong recommendation, it has to be quite certain about the various factors that influence the strength of a recommendation and have the relevant information at hand that supports a clear balance toward either the benefits (to recommend an action) or the downsides (to recommend against an action) that influence a recommendation. In situations when a guideline panel is uncertain whether the balance is clear or when the relevant information is not available, a guideline panel should be more cautious and, in most instances, opt to make a weak recommendation. To achieve a balanced view when formulating recommendations, a multidisciplinary panel with broad representation, including clinicians, methodologists, generalists, patient representatives, and experienced guideline developers, should be assembled and proper group processes for reaching consensus on guidelines should be followed (2022).
Guideline developers should offer clinicians as many indicators as possible for understanding and interpreting the strength of recommendations. For strong recommendations, the GRADE working group has suggested adopting terminology such as "We recommend . . ." or "Clinicians should . . ." When panels make a weak recommendation, they should use less definitive wording, such as "We suggest . . ." or "Clinicians might . . ." Furthermore, guideline panels should describe the population (described by the disease and other identifying factors) and intervention (as detailed as feasible) when they offer recommendations as specifically as possible.
Before grading the quality of evidence, guideline developers and other groups making recommendations should conduct or identify a well-done systematic review and produce a transparent evidence summary on which to base judgments. One advance of the GRADE system is that, if justified by the available evidence, the judgments allow for strong recommendations in the setting of evidence from observational studies. At the same time, the GRADE system exemplifies how high-quality evidence should allow for weak recommendations (Box 7). In previous grading systems, grading primarily depended and focused on the quality of the underlying evidence, including the number of available studies. The severe infection examples and the lung cancer examples suggest that a separation of the strength of a recommendation from the quality of evidence (i.e., RCTs or observational studies) is important for making recommendations. However, the basic study design remains crucial in determining our confidence in estimates of beneficial and detrimental intervention effects. In the GRADE system, the highest quality evidence comes from one or more well-designed and well-executed RCTs yielding consistent and directly applicable results. High-quality evidence can also come, under unusual circumstances, from well-done observational studies (e.g., well-conducted and controlled cohort studies) yielding very large effects. RCTs with important limitations and well-done observational studies yielding large effects constitute the moderate-quality category. Well-done observational studies and, on occasion, RCTs with very serious limitations will be rated as low-quality evidence. The very-low-quality category includes poorly controlled observational studies and unsystematic clinical observations (e.g., case series or case reports). This grading follows the principle that all relevant clinical studies and observations provide evidence, the quality of which varies. However, the system also clarifies that expert opinion is not a category of evidence. Expert opinion represents an interpretation of evidence, including evidence ranging from observations in an expert's own practice (uncontrolled observations) to the interpretation of RCTs and meta-analyses known to the expert in the context of other experiences and knowledge. The ATS adopted the GRADE four-category system of quality of evidence (high, moderate, low, and very low quality; Table 3) where the quality of evidence reflects our confidence that estimates of an intervention's benefits and downsides generated from research are accurate.
Factors that Decrease the Quality of Evidence The following limitations may decrease the quality of evidence supporting a recommendation (Table 4).
The factors influencing the quality of evidence may be additive such that the presence of several of these factors, if judged important, would lower the quality of evidence by more than one category. Each of these factors (methodologic limitations, indirectness, heterogeneity, and imprecision) may also decrease the quality of evidence associated with observational studies (moving the categorization of such evidence from low to very low quality).
Factors that Increase the Quality of Evidence
Guideline panels usually provide a single rating of quality of evidence for every recommendation. Recommendations, however, depend on evidence regarding a number of outcomes. Thus, it may be necessary to report a single evidence grade when the quality of evidence differs across important outcomes. Guideline panels should determine the quality of evidence for each outcome, but in terms of overall quality of evidence, the lowest quality of data available for any one of the critical outcomes determines the overall quality of evidence (Box 7).
Guideline panels may refer to the checklist provided in Table 5 while developing and grading recommendations. The example (Box 8) from the management of impending sepsis shows how panelists might work through the issues.
The ATS has produced numerous guidelines, many of them in collaboration with other guideline developers or organizations. Although widely recognized, the guidelines have been variable in the extent to which they have adhered to methodologic standards (28), and they have applied a variety of approaches to grading the quality of evidence. For example, some collaborative efforts involve grading systems that rate the quality of evidence but do not provide a grade for the strength of a recommendation. The following example provides an additional reason for a new, sensible grading system for the ATS. The Global Initiative for Chronic Obstructive Lung Disease (GOLD) guidelines state (29): "The most common causes of an exacerbation are infection of the tracheobronchial tree and air pollution, but the cause of about one-third of severe exacerbations cannot be identified (Evidence B)." Grading the evidence for etiologic questions in guidelines presents challenges because clinicians typically do not make recommendations about prognostic or etiologic factors (information about etiology) and the evidence does not come from randomized comparisons of one risk factor versus another. As a result, randomized designs do not provide higher quality evidence or information about etiologic factors than observational studies and generic grading systems therefore do not apply to such statements. Thus, grading of etiologic information is irrelevant for guidelines and recommendations because action follows from knowing that modifying etiologic factors influences outcomes. Grading in guidelines therefore should be restricted to recommended actions. Because of the need to clarify methodologic issues around grading the quality of evidence and recommendations and to unify and improve the existing grading methodology applied by ATS guideline developers, we proposed the use of the GRADE approach. The framework summarized in Table 6 generates recommendations ranging from a strong recommendation based on high-quality evidence to weak recommendations based on very-low-quality evidence.
Strengths and Limitations One of the major merits of GRADE is the simplicity of its two-category system of grading recommendations. The behavioral implications of strong and weak recommendations provide practical guidance to clinicians and other users (Table 1). The definition of categories of methodologic problems and merits allows an explicitness and transparency that other systems lack. The ATS makes no official recommendation to others for using the GRADE approach, but guideline panels considering using GRADE can anticipate support by the GRADE working group that is not available for other systems. Independent of ATS efforts, the large group of methodologists involved in GRADE conduct regular workshops around the world and have acted as resources for any group considering to use GRADE. The approach offers the possibility of working electronically and making guideline material available on the World Wide Web (www.gradeworkinggroup.org). Evidence tables and recommendations could form the sole publication in print, whereas information required for decision making by guideline panels and for those clinicians who require an in-depth understanding of all the evidence could be deposited in electronic format and connected via hyperlinks. Additional advantages include that the GRADE system applies to diagnostic recommendations similarly to how it applies to questions about therapy. The final recommendation from a diagnostic question depends on the balance between benefits and downsides of the diagnostic strategy in terms of patient important outcomes, although, until recently, these outcomes have been measured infrequently. Finally, the novel approach to grading the quality of evidence for each important outcome and applying the quality of all critical outcomes to the final quality grade provides increased transparency about evidence supporting recommendations that help in responding to health care questions. The health care questions are often complex and associated with finely balanced benefits and downsides. Adopting the GRADE approach also has some disadvantages that are inherent to any grading system. Systems currently used to analyze scientific data for the purpose of creating CPGs have not been tested rigorously for validity and reproducibility. GRADE is no exception. Establishing criteria for the validity of rating of quality of evidence is extremely challenging. Establishing criteria for the validity of the direction and strength of recommendation is even more problematic because it depends on underlying values and preferences that would have to be precisely specified. The "impact" of a CPGthe degree to which it affects behaviorsdoes not qualify as a measure of outcome because impact does not necessarily reflect a guideline's internal validity (i.e., the extent to which the process of development has produced an approximation of "scientific truth"). Even if the process yields truth, it may not necessarily convince sufficiently to alter clinicians' behaviors. Many factors influence many behaviors, including the respect for the methods used to create a guideline, or reputations of panelists enlisted to draft it, the societies supporting it, or of the journal that publishes a guideline. Thus, the impact of a guideline on behaviors is a poor measure of the validity of the processes used to create it. These challenges result in a situation in which none of the competing systems have been validated; thus, validity was not a criterion that the ATS document and implementation committee applied when making its choices.
Consumers of grading systems often raise concerns about the reproducibility of the grading process. When the GRADE group assessed an early version of its grading system across medical specialty areas, there was varying agreement about the quality of evidence for the rated outcomes ( Lack of reproducibility is of less concern if judgments are made transparent and consumers can track reasons for decisions about the grading by guideline panels. If guideline developers applied consistent approaches to evaluating quality of evidence and grading recommendations, differences in judgments could be more easily understood. One merit of the GRADE system is the transparency of the judgments, which is strongest for rating the quality of the evidence. Table 5 presents a summary of the sequential judgments a guideline panel would make following the GRADE approach. Readers of a graded guideline or recommendation should be aware that judgments about the quality of evidence, including those following the GRADE approach, require experience and expertise by guideline panels about the addressed health care question and research methodology. As described above, however, expert opinion does not constitute a form of evidence but an interpretation of existing evidence. Other disadvantages of adopting the GRADE approach include the requirement for resources to conduct detailed assessment of the evidence and the requirement of consumers to develop some basic understanding of the system. The latter is of concern for any grading system, and GRADE's choice of a simple two-category approach to the strength of the recommendation facilitates ease of understanding. Some users of recommendations may find a two-category rating of the strength of recommendation too simplistic for the problems clinicians encounter in daily practice. However, there are several issues to consider that speak for the simpler choice of a two-category grading of the strength of recommendations over more categories. First, there are several ways (Table 1) consumers of guidelines can interpret these recommendations. Second, when balancing the continuum of benefits and downsidesoften a very challenging processguideline panels can choose between two categories of strength against an action and two categories for an action. Third, the GRADE system explicitly asks for a detailed and transparent description of the underlying judgments and values that influence a recommendation. Thus, consumers of guidelines have the option of making different choices (predominantly in the case of weak recommendations) when they have information that leads them to disagree with the judgments and have evidence that the values of their patients differ. The ATS Documents and Implementation Committee recognizes the limitations of GRADE, particularly with respect to validity and reproducibility. There are, however, no competing systems that are superior in this regard, and GRADE has many strengths. Because we see compelling arguments for adopting a single, uniform approach to grading recommendations that is consistent or nearly consistent with systems adopted by other leading organizations (9), the ATS Documents Committee has chosen GRADE as the preferred current methodology for rating the quality of evidence and strength of recommendations. The ATS adopted the original GRADE four-category grading system for the quality of evidence. The latter represents an important distinction to the GRADE approach adapted by the ACCP that combines the low and very low quality of evidence (9). The ACCP refrained from using the very-low-quality category in part because, for many of the therapeutic areas that ACCP guidelines focus on, such as antithrombotic guidelines, higher quality primary evidence exists (31).
In the grading system the Documents Development and Implementation Committee adopted for the ATS, the strength of any recommendation depends on two factors: the quality of the evidence regarding treatment effect and the tradeoff between benefits and downsides of an intervention. The system classifies methodologic quality in four categories: randomized trials that show consistent results, or observational studies with very large treatment effects (high quality); randomized trials with methodologic limitations, or observational studies with large effect (moderate quality); and observational studies without exceptional strengths, or randomized trials with very serious limitations (low quality). We classify unsystematic clinical observations (e.g., case reports and case series) as evidence of very-low-quality evidence (very low quality). The balance between benefits and downsides falls into one of two categories. Recommendations are either strong, defined as being "confident that adherence to the recommendation will do more good than harm or that the net benefits are worth the costs," or weak, defined as being "uncertain that adherence to the recommendation will do more good than harm OR that the net benefits are worth the costs." Panels can make recommendations for or against a given intervention. The language of strong recommendations (worded as "we recommend" or "should" in the actual recommendation) reflects the following clinical message: the recommendation applies to most patients under most circumstances. The language of weak recommendations (worded as "we suggest" or "might") reflects a different clinical message: the need to consider more carefully than usual individual patients' circumstances, preferences, and values. The uncertainty associated with weak recommendations follows either from poor-quality evidence (if we are uncertain of benefits and downsides, it is not wise to make a strong recommendation for or against), or from closely balanced benefits versus downsides.
This statement was prepared by the ATS Documents Development and Implementation Committee. Members of the committee are as follows: HOLGER J. SCHÜNEMANN, M.D., PH.D. (chair), Rome, Italy ROMAN JAESCHKE, M.D., M.SC., Hamilton, Canada DEBORAH J. COOK, M.D., M.SC., Hamilton, Canada WILLIAM F. BRIA, M.D., Ann Arbor, Michigan ALI A. EL-SOLH, M.D., M.P.H., Buffalo, New York ARMIN ERNST, M.D., Boston, Massachusetts BONNIE F. FAHY, R.N., M.S.N., Phoenix, Arizona RICHARD L. GELULA, M.S.W., Washington, D.C. MICHAEL K. GOULD, M.D., M.S., Stanford, California KATHLEEN L. HORAN, M.D., Stanford, California, JERRY A. KRISHNAN, M.D., PH.D., Baltimore, Maryland CONSTANTINE A. MANTHOUS, M.D., Providence, Rhode Island JANET R. MAURER, M.D., Anthem, Arizona WALTER T. MCNICHOLAS, M.D., Dublin, Ireland ANDREW D. OXMAN, M.D., M.SC., Oslo, Norway GORDON RUBENFELD, M.D., Seattle, Washington GERARD M. TURINO, M.D. (vice-chair), New York, New York GORDON GUYATT, M.D., M.SC., Hamilton, Canada JEFFREY S. WAGENER, M.D., Denver, Colorado Conflict of Interest Statement: H.J.S., R.J. and G.G. are members of the GRADE working group that developed the GRADE grading system. The GRADE working group is an informal group of methodologists and guideline developers with interest in improving guideline methodology. H.J.S., R.J., and G.G. participate in developing a freely available software (GRADEpro) for applying the GRADE approach. They have no direct financial interest in the GRADE approach or the GRADEpro software. D.J.C., W.F.B., A.A. E.-S., A.E., B.F.F., M.K.G., K.L.H., J.A.K., C.A.M., J.R.M., W.T.M., A.D.O., G.R., and G.M.T. do not have a financial relationship with a commercial entity that has an interest in the subject of this manuscript.
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||