|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
To measure the reliability of chest radiographic diagnosis of acute respiratory distress syndrome
(ARDS) we conducted an observer agreement study in which two of eight intensivists and a radiologist, blinded to one another's interpretation, reviewed 778 radiographs from 99 critically ill patients.
One intensivist and a radiologist participated in pilot training. Raters made a global rating of the
presence of ARDS on the basis of diffuse bilateral infiltrates. We assessed interobserver agreement in
a pairwise fashion. For rater pairings in which one rater had not participated in the consensus process we found moderate levels of raw (0.68 to 0.80), chance-corrected (
0.38 to 0.55), and chance-independent (
0.53 to 0.75) agreement. The pair of raters who participated in consensus training
achieved excellent to almost perfect raw (0.88 to 0.94), chance-corrected (
0.72 to 0.88), and
chance-independent (
0.74 to 0.89) agreement. We conclude that intensivists without formal consensus training can achieve moderate levels of agreement. Consensus training is necessary to achieve
the substantial or almost perfect levels of agreement optimal for the conduct of clinical trials. Meade
MO, Cook RJ, Guyatt GH, Groll R, Kachura JR, Bedard M, Cook DJ, Slutsky AS, Stewart TE. Interobserver variation in interpreting chest radiographs for the diagnosis of acute respiratory distress syndrome.
| |
INTRODUCTION |
|---|
|
|
|---|
Acute respiratory distress syndrome (ARDS) is an advanced form of acute lung injury characterized by diffuse pathophysiological changes of increased capillary permeability, inflammation, and tissue repair. The clinical syndrome includes a triad of hypoxemia, decreased lung compliance, and chest radiographic abnormalities. The high incidence of ARDS in patients with predisposing clinical conditions such as sepsis, gastric aspiration, and multiple trauma [up to 35% (1)], and the associated mortality of 20 to 74% (2, 3), have made ARDS a major concern for clinicians and investigators.
ARDS is a more severe form of the continuum of acute lung injury, the threshold for which is somewhat arbitrary. Because of the arbitrariness of this threshold, defining ARDS and identifying the syndrome in individual patients presents a challenge. This problem of definition has led to considerable difficulties in comparing epidemiologic data relating to ARDS incidence (4), difficulties that will be resolved only through multinational collaboration (5). Variability in defining and identifying ARDS is also an important concern in clinical trails that consider ARDS as an inclusion criterion or as a study outcome.
Whatever definition one chooses, the diagnosis of ARDS depends in part on identifying characteristic radiographic abnormalities. To be consistently useful, interpretation of a radiological investigation must be reliable. Highly desirable in the delivery of clinical care, reliability becomes crucial in clinical studies that rely on radiologic findings. Lack of reproducibility will inflate required samples sizes, and potentially lead to false-negative trial results. The limited interobserver agreement that investigators have usually observed when examining radiographic interpretation (6) suggests that both clinicians and scientists should attend to this issue.
We conducted a multicenter randomized trial of a pressure- and volume-limited ventilation strategy (10) in patients at high risk for ARDS. Our study demonstrated similar outcomes for the alternative strategies that we examined. When planning this study, we considered using ARDS as a possible inclusion criterion, and as a possible outcome. We rejected ARDS as an inclusion criterion because we ultimately decided our intervention might prevent the development of ARDS. We rejected ARDS as an outcome since the two different ventilation strategies used different mean airway pressures, and hence would potentially bias chest radiograph and oxygenation criteria of ARDS. We were nevertheless interested in the frequency with which ARDS occurred, and in the reliability with which we might measure that frequency. We therefore examined the extent to which intensive care physicians and a radiologist could agree on the radiologic diagnosis of ARDS.
| |
METHODS |
|---|
|
|
|---|
Source of Chest Radiographs
We used films from patients enrolled in our randomized trial at seven
participating hospitals in Toronto (Ontario, Canada). Adult patients
who met the following criteria were eligible for the trial: intubated
less than 24 h; peak airway pressures
30 cm H2O; hypoxemia: PaO2/
FIO2 < 250, on positive end-expiratory pressure (PEEP) = 5 cm H2O;
one or more known risk factors for ARDS. The trial excluded patients
with the following characteristics: anticipated duration of ICU admission < 48 h; very unlikely survival, defined by premorbid or acute life
expectancy; heart failure; acute asthmatic exacerbation; high risk of
cardiac arrhythmia or ischemia; intracranial abnormalities associated
with intracranial hypertension; or pregnancy.
Because of difficulty in obtaining the films, we omitted all films from the Ottawa center; from the Toronto Hospitals, we omitted films that film library personnel could not locate. Study patients had chest radiographs taken at least once daily, and we chose the first film from each consecutive day of participation. We included 841 films from 99 patients; individual patients provided from 1 to 32 films (median, 7).
Raters
Three raters interpreted each radiograph. Seven study intensivist/investigators, one from each participating hospital, provided the first interpretation by reading films done at their hospital (Rater 1). After the randomized trial was completed, two other raters, one an intensivist (M.M., Rater 2) and one a radiologist (J.K., Rater 3) interpreted each film, independently and without knowledge of other interpretations.
Preparation of Chest Radiographs
Site investigators reviewed films at the time they were taken. To prepare the films for review by Raters 2 and 3, we shuffled them in batches of approximately 150 as they arrived at the study office, numbered them in their new sequence, removed each film from the associated envelope, and covered the identification label with an opaque sticker bearing the study film number. The purpose of these preparations was to minimize bias that might occur if raters reviewed serial films from a single patient in sequence.
Review Process
Site investigators recorded their interpretations of study films on data
forms included in the randomized trial. Site investigators had no
study-specific training (no study-specific definitions or standardized
techniques) in judging the presence of ARDS-related infiltrates. Raters 2 and 3 began by independently interpreting 63 films from 11 patients
we refer to these films as the "training set." Raters 2 and 3 then repeated the review of the training set films, this time with one
another, discussing the reasons for disagreement and refining the
standards and rules they would apply when the interpretation was difficult. We refer to this process as the "standardized review." Raters 2 and 3 then completed their interpretation of the full sample of 841 films, including the 63 films from the training set. The training set
films were included at random among the others and the raters were
thus unlikely to identify them.
Radiograph Interpretation
Each interpreter made two ratings in accordance with two definitions for ARDS that are commonly used in clinical trials. One rating, based on the definition of ARDS provided by an American-European Consensus Conference (AECC) statement (5), involved deciding whether a chest radiograph had diffuse bilateral infiltrates. Although the original AECC statement specifies "bilateral infiltrates" in the list of criteria defining ARDS, we specified "diffuse bilateral infiltrates" for two reasons. First, the AECC statement included a discussion of ARDS as a diffuse process that is therefore associated with diffuse infiltrates. Second, we were unwilling to interpret films with discrete bilateral subsegmental infiltrates as being consistent with ARDS. The refined criteria arising from the standardization process included conventional definitions from the radiology literature (11), defining infiltrate as "any ill-defined opacity in the lung that neither destroys nor displaces the gross morphology of the lung and is presumed to represent a pathophysiological process." The refined criteria also included defining diffuse as widespread and continuous, by which the reviewers meant involving at least 80% of a lung field and not excluding specific lung segments.
The other rating, based on the Lung Injury Severity Score (4), involved deciding how many quadrants contained an area of consolidation. The consensus standardization led to a definition of consolidation as "a homogeneous opacity in the lung characterized by little or no loss of volume, by effacement of pulmonary blood vessels, and sometimes by the presence of an air bronchogram" and excluded definite effusions and masses. To distinguish between the upper and lower quadrant of a lung field, Raters 2 and 3 agreed to use the horizontal plane of the ipsilateral pulmonary artery at its midpoint at the hilum. When this landmark was obscured, they used the contralateral pulmonary artery and, when both were obscured, they used the midpoint of the height of the lung fields.
Statistical Methods
We were interested in the level of agreement in each of the three possible pairings among Raters 1, 2, and 3. We refer to the pairing of Raters 2 and 3 as the "standardized pair" to distinguish this pairing from other pairings in which one rater (the site intensivist/investigator) had not participated in the consensus process.
We measured agreement among raters by addressing two questions.
The first question, Is this chest radiograph consistent with ARDS?, is
relevant to clinical practice or to the use or ARDS as an inclusion criterion in clinical trials. To address this issue, we would like as many
films as possible. Therefore, it would be convenient if we could treat
the 841 films from this study as if they were 841 films from different
patients. However, we have serial films from 99 patients; therefore,
we cannot treat our observations as if they came from different patients (using technical languages, as if they were independent). If we
did assume independence, results from standard
-type analyses could
be subject to major distortion. We will return to this issue shortly.
The second question, Did this patient develop ARDS?, applies to the use of ARDS as a study outcome. Measuring agreement among raters in this setting requires reviewing all films for each patient, and developing a criterion for a series of films being consistent with ARDS. We tested two possible criteria: (1) any film consistent with ARDS, and (2) films on two consecutive days consistent with ARDS.
Because seven intensivists contributed to "Rater 1," we began by comparing odds ratios of agreement between each of the seven and Raters 2 and 3 with respect to the presence of diffuse bilateral infiltrates on Day 1, two consecutive days, or any day. Testing failed to reject the null hypothesis, i.e., that the seven intensivists achieved the same levels of agreement with Raters 2 and 3. Testing for heterogeneity of odds ratios generated by different observers and pooling across estimates if no heterogeneity is found is standard statistical methodology (12).
For comparisons of rating of the presence or absence of diffuse bilateral infiltrates, we calculated raw agreement, chance-corrected agreement (using
), and chance-independent agreement (using
). Table 1 presents the formulas for our measures of agreement based on a 2 × 2 table. The rationale for using these three methods is as follows. Raw agreement
the proportion of films in which both raters
conclude that diffuse infiltrates were, or were not, present
can be
misleading. In particular, if two raters both make a high or low proportion of positive ratings, raw agreement will be high even if the raters
are just guessing. That is, their agreement will be high simply by
chance. High agreement by chance tends to occur when two observers
believe the prevalence of the clinical entity of interest is high or low in
the population under study.
|
Because of this problem with raw agreement, we calculated chance-
corrected agreement, using the
statistic (13). While avoiding spuriously high levels of agreement due to chance,
has its own limitations
that have led to sharp criticism (14). One of the major difficulties with
is that when the proportion of positive ratings is extreme, the possible agreement above chance agreement is small, and it is difficult to
achieve even moderate values of
. Thus, if one uses the same raters
in a variety of settings, as the proportion of positive ratings becomes
extreme,
will decrease even if the way the raters interpret films does not change.
To address this limitation, we also calculated chance-independent
agreement using
, a relatively new approach to assessing observer
agreement (15). One begins by estimating the odds ratio from a 2 × 2 table displaying the agreement between two observers, such as the
one presented in Table 1. The odds ratio is given by OR = ad/bc. In
this case it is simply the odds of a positive classification by rater B
when rater A gives a positive classification divided by the odds of a
positive classification by rater B when rater A gives a negative classification. As such, it provides a natural measure of agreement. This
agreement can be made more easily interpretable by converting it into
a form that takes values from
1.0 (representing extreme disagreement) to 1.0 (representing extreme agreement). The
statistic makes
this conversion by the following formula:
|
(1) |
When both margins are 0.5 (that is, both raters conclude that 50% of
the patients are positive and 50% negative for the trait of interest)
is equal to
.
has three important advantages over existing approaches. First,
it is independent of the level of chance agreement. Thus, investigators
could expect to find similar levels of
whether the distribution of results is 50% positive and 50% negative, or 90% positive and 10% negative. This is not true for measures of the
statistic, a chance-corrected index of agreement.
Second,
allows modeling approaches that the
statistic does
not. For example, in the present data set, because of the possible lack
of independence in degree of agreement across multiple films from a
single patient,
would not allow us to take full advantage of the 841 films that our raters evaluated.
allowed us to adjust for the degree
of intrapatient correlation in assessments of serial radiographs, and
thus make more efficient use of the data and generate narrower confidence intervals around the level of agreement. Third,
also allowed
us to test whether differences in agreement between pairings were significant, an option not available with
.
For ratings of the presence or absence of diffuse bilateral infiltrates, we compared not only the agreement between the three pairings of raters, but also the agreement between the standardized pairing (Raters 2 and 3) on the training set before and after the standardized review. Because of the possibility that viewing the training set twice may have influenced the standardized raters' interpretation of those films, we omitted them from the primary comparisons. Thus, the primary comparisons of the three possible pairings of raters included only 778 films.
As we have mentioned, we could not use
to calculate agreement
on the presence or absence of diffuse bilateral infiltrates using all
films, because of the lack of independence in multiple films on the
same patients. We were able to assess agreement across all 778 films
based on the
statistic and applied maximum likelihood estimation
based on the noncentral hypergeometric distribution to generate estimates that account for the degree of correlation in multiple films coming from the same patient (the APPENDIX describes the approach to
maximum likelihood estimation).
We also conducted significance tests on the agreement between
the three pairings of raters on the three ratings of the presence of radiographic ARDS (bilateral infiltrates present on first film, any film,
and two consecutive films), and on the agreement between the consensus raters before and after training. We interpreted both
and
results as follows: values of less than 0, poor; 0 to 0.2, slight; 0.2 to 0.4, fair agreement; 0.4 to 0.6, moderate agreement; 0.6 to 0.8, substantial
agreement; and values of 0.8 to 1.0 represent almost perfect agreement (16).
Methods for calculating chance-independent agreement with multiple categories
in this case multiple quadrants
remain undeveloped. Therefore, to assess agreement between the three pairings of
raters on the rating of consolidation in 0 to 4 quadrants, we relied on
weighted
with quadratic weights allowing for partial agreement (17).
We have explained why, because of lack of independence, we could
not use all films for assessing
, and thus used the new methodology
for chance independent agreement. Because we did not have an
equivalent methodology to deal with multiple quadrants, we used only
the first film on each patient to address agreement on the rating of the
number of quadrants involved.
| |
RESULTS |
|---|
|
|
|---|
The patients contributed from 1 to 33 films each to the agreement process, with a mean of 8.9. The seven intensivists who contributed to the "Rater 1" comparisons evaluated films from between 3 and 27 patients. The proportion of patients judged by Raters 1, 2, and 3, respectively, to have bilateral infiltrates present on Day 1 were 0.54, 0.27, and 0.30; 0.70, 0.60, and 0.61 for the proportion of patients who had diffuse bilateral infiltrates present on any day; and 0.64, 0.40, and 0.41 for the proportion of patients who had diffuse bilateral infiltrates present on two consecutive days. For the seven intensivists who contributed to the Rater 1 ratings, the proportions of patients with diffuse bilateral infiltrates present on Day 1, any day, or two consecutive days, respectively, ranged from 0 to 0.78, 0.20 to 0.91, and 0.20 to 0.83. The rater with the 0 and 0.2 proportions reviewed films from only 5 patients.
Table 2A to 2C presents the agreement across the three
pairings of raters for the three approaches to judging bilateral
infiltrates present or absent, using raw agreement,
and
.
Agreement between Raters 2 and 3 was substantial to almost
perfect for all three criteria, using all three approaches. Raw
agreement between the other two pairings varied from 0.68 to 0.80. The agreement between these two pairings was moderate for all three criteria, using
, and moderate to substantial
using
.
We were interested in whether the consistent trend showing higher agreement in the standardized pairing could be a
chance phenomenon. While methods for testing the statistical
significance of two
values in this situation have not been developed, the methodology of chance-independent agreement
allows this comparison. Despite the consistency of the trend
toward greater agreement in the standardized pairing, the difference between the levels of agreement approached conventional levels of significance for only one of the three ratings related to bilateral infiltrates (p values of 0.83, 0.05, and 0.12 for
first film positive, any film positive, and two consecutive films
positive by Raters 2 and 3 versus 1 and 3; 0.24, 0.91, and 0.95 by Raters 2 and 3 versus 1 and 2).
This lack of significance could be a problem of power
we
may not have had enough films to exclude chance as an explanation. This problem could be ameliorated by using all of the
films. Using all films, however, requires adjustment for any
lack of independence in ratings of multiple films from the
same individual. Including all films in the evaluation of diffuse
infiltrates and adjusting for lack of independence, the
for
Raters 2 and 3 was 0.69 (95% CI, 0.60-0.77), for Raters 1 and
2 the
was 0.60 (95% CI, 0.44-0.72), and for 1 and 3 the
was 0.56 (95% CI, 0.41-0.69). The difference in these levels of
was highly significant (p values of < 0.001 comparing Raters 2 and 3 with either 1 and 2, or 1 and 3).
Table 3 addresses the hypothesis that the reason for the superior agreement of Raters 2 and 3 and the other pairs was the consensus process Raters 2 and 3 undertook in reviewing the
first 63 films together. Table 3 presents the level of agreement
related to the presence of bilateral infiltrates before and after
the consensus process. While the number are small, there is a
strong trend for a higher level of agreement after the consensus process. Here, the small data set leads to empty cells (cells
with 0 observations), which makes it difficult to make meaningful calculations of
.
|
The weighted
for the number of quadrants involved in
the first film of each patient was as follows for the three pairings: Raters 2 and 3, 0.74 (95% CI, 0.63-0.85); Raters 1 and 2, 0.47 (95% CI, 0.31-0.63); Raters 1 and 3, 0.54 (95% CI, 0.42-
0.67).
| |
DISCUSSION |
|---|
|
|
|---|
We found moderate to good agreement on the presence of diffuse bilateral infiltrates suggestive of ARDS, irrespective of
which of a number of possible criteria we used. This level of
agreement is high in comparison with most clinical ratings,
and many of the radiographic interpretations, that clinicians
use regularly in clinical practice. For instance, the intensivists
in our study demonstrated considerably better agreement
than did those who participated in a prior study of interpretation of chest radiographs of patients with ARDS. Beards and
coworkers found a
of only 0.05 for intensivists' rating of the
number of quadrants in which consolidation was present (18).
In the clinical trial setting, however, agreement that is less than excellent compromises precision of measurement, and may result in misleading findings, large sample size requirements, or both. For instance, consider a trial enrolling patients with established ARDS, in which the presence of bilateral infiltrates would constitute one criterion for inclusion. The site intensivist and Rater 2 (the study intensivist) agreed on 68% of the ratings of the presence of bilateral infiltrates in the first film from each patient (Table 2A). This limited level of agreement would lead to appreciable differences in the patients enrolled in the study. Similarly, if a study considered ARDS as an outcome and the presence of diffuse infiltrates at any time while the patients stayed in the ICU contributed to the diagnosis, intensivists' ratings would agree only 78% of the time (Table 2A). This limited level of agreement could contribute substantial random error to the study results.
Fortunately, there is a partial solution to this problem. Development of standardized criteria and reporting forms; pilot testing; and training of raters through review of disagreements, discussion of the reasons, and agreement about how to deal with difficult judgments are accepted methods of maximizing agreement in a wide variety of clinical ratings. These methods have resulted in acceptable levels of agreement in interpretation of pediatric chest radiographs in a multicenter study (19). We have provided empirical evidence of the magnitude of improved agreement that clinical trialists studying radiological findings in critically ill patients can achieve by modest pilot testing and consensus development. This process decreased the disagreement on the presence of infiltrates on the first film of each patient to 10% and on any film to 8% (Table 2A).
Strengths of this study include the careful blinding of the radiographs, and of the raters, to one another's interpretation; the participation of both intensivists and a radiologist; the relatively large number of films read and the resulting relatively narrow confidence intervals; and our rigorous approach to data analysis. The study would have been stronger yet if we had the resources to include more radiologists and intensivists, and conducted a more systematic evaluation of a training period that would allow raters to develop consensus standards. The intensivist who read each film was a critical care fellow at the time of the study. Stage of training might have influenced the degree of improvement with training, and including additional readers at varying stages of training would have allowed us to explore this issue.
Inferences from our study may be limited by the lack of detail and explicitness in the current definitions of ARDS (5, 11). Available guidelines for reading and interpreting chest radiographs in patients receiving mechanical ventilation do not solve this problem, as they too offer only general approaches rather than explicit criteria (20). As a result, we developed our own detailed criteria; our criteria, however, do not have the benefit of a wide consensus. Ongoing work is likely to ameliorate or solve this problem in the future.
In reporting our results, we have relied on an innovative
approach to measuring agreement with binary ratings. Like
traditional measures of agreement, the
statistic takes values
from
1.0 to 1.0. As we have described in METHODS,
has
three important advantages over existing approaches. First, it
is independent of the level of chance agreement. Second,
allows full use of information from nonindependent observations (in this case, multiple films from each patient). Third,
allows testing of whether variations in agreement between different pairings of the same raters are significant. These options are not available with
. We believe these advantages of
may ultimately lead to its replacing
as the standard measure of agreement for binary clinical ratings. Until we gain further experience with the new method, however, we suggest investigators report both the standard
and the
statistic.
In summary, we have demonstrated that intensivists can achieve moderate levels of agreement in the radiologic diagnosis of ARDS without specific training. Further consensus training can increase the level of agreement to substantial or almost perfect. Clinicians involved in clinical trials should seriously consider pilot training and assessment of the level of agreement in making clinical and radiographic ratings to enhance the power and accuracy of their studies.
|
| |
Footnotes |
|---|
Correspondence and requests for reprints should be addressed to Thomas E. Stewart, M.D., Department of Medicine, Mount Sinai Hospital, Suite 427-600, University Avenue, Toronto, ON, Canada M5G 1X5. E-mail: tom.stewart{at}utoronto.ca
(Received in original form September 2, 1998 and in revised form July 7, 1999).
Acknowledgments: Supported in part by the Physicians' Services Incorporated Foundation of Ontario, the Ontario Thoracic Society, and Bayer Corporation.
| |
References |
|---|
|
|
|---|
1. Garber, B. G., P. C. Hebert, J. D. Yelle, R. V. Hodder, and J. McGowan. 1996. Adult respiratory distress syndrome: a systematic overview of incidence and risk factors. Crit. Care Med. 24: 687-695 [Medline].
2. Miller, R. S., L. D. Nelson, S. M. Di Russo, E. J. Rutherford, K. Safcsak, and J. A. Morris. 1992. High-level positive end-expiratory pressure management in trauma-associated adult respiratory distress syndrome. J. Trauma 33: 284-291 [Medline].
3. Bell, R. C., J. J. Coalson, J. D. Smith, and W. G. Johanson. 1983. Multiple organ system failure and infection in adult respiratory distress syndrome. Ann. Intern. Med. 99: 293-298 .
4. Murray, J., M. Matthay, J. Luce, and M. Flick. 1988. An expanded definition of the adult respiratory distress syndrome. Am. Rev. Respir. Dis. 135: 720-723 .
5. Bernard, G. R., A. Artigas, K. L. Brigham, J. Carlet, K. Falke, L. Hudson, M. Lamy, J. R. LeGall, A. Morris, and R. Spragg. 1994. Report of the American-European consensus conference on acute respiratory distress syndrome: definitions, mechanisms, relevant outcomes, and clinical trial coordination. Am. J. Respir. Crit. Care Med. 149: 818-824 [Abstract].
6. Tudor, G. R., D. Finlay, and N. Taub. 1997. An assessment of inter- observer agreement and accuracy when reporting plain radiographs. Clin. Radiol. 52: 235-238 [Medline].
7.
Guyatt, G. H.,
M. Lefcoe,
S. D. Walter,
L. E. Griffith,
D. King,
C. Zylak,
N. Hickey, and
G. Carrier.
1995.
Interobserver variation in computerized tomographic diagnosis of intrathoracic lymphadenopathy in patients with potentially resectable lung cancer.
Chest
107:
116-119
8. Maguire, W. M., P. G. Herman, A. Kahn, M. Simon-Gabor, V. Cruz, and T. M. Eacobacci. 1994. Interobserver agreement using computed radiography in the adult intensive care unit. Acad. Radiol. 1: 10-14 [Medline].
9. Bloomfield, F. H., R. L. Teele, M. Voss, D. B. Knight, and J. E. Harding. 1999. Inter- and intra-observer variability in the assessment of atelectasis and consolidation in neonatal chest radiographs. Ped. Radiol. 29: 459-462 [Medline].
10.
Stewart, T. E.,
M. O. Meade,
D. J. Cook,
J. T. Granton,
R. V. Hodder,
S. E. Lapinsky,
C. D. Mazer,
R. F. McLean,
E. S. Rogovein,
B. D. Schouten,
T. R. J. Todd, and
A. S. Slutsky.
1998.
Evaluation of a ventilation strategy to prevent barotrauma in patients at high risk for
acute respiratory distress syndrome.
N. Engl. J. Med.
338:
355-361
11. Fraser, R. G., J. A. Peter Pare, P. D. Pare, R. S. Fraser, and G. P. Genereux. 1988. Diagnosis of Diseases of the Chest, 3rd ed. W.B. Saunders, Philadelphia. xiii-xx.
12. Breslow, N. E., and N. E. Day. 1980. Statistical Methods in Cancer Research, Vol. 1: The Analysis of Case-control Studies. International Agency for Cancer Research.
13. Fleiss, J. L.. 1971. Measuring nominal scale agreement among many raters. Psychol. Bull. 76: 378-382 .
14.
McClure, M., and
W. C. Willett.
1987.
Misinterpretation and misuse of
the kappa statistic.
Am. J. Epidemiol.
126:
161-169
15. Cook, R. J., and V. T. Farewell. 1995. Conditional inference for subject-specific and marginal agreement: two families of agreement measures. Can. J. Stat. 23: 333-344 .
16. Landis, J. R., and G. G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33: 159-174 [Medline].
17. Cohen, J.. 1968. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70: 213-220 .
18. Beards, S. C., A. Jackson, L. Hunt, A. Wood, C. M. Frerk, G. Brear, J. D. Edwards, and P. Nightingale. 1995. Interobserver variation in the chest radiograph component of the lung injury score. Anaesthesia 50: 928-932 [Medline].
19. Cleveland, R. H., M. Schlucter, B. P. Wood, W. E. Berdon, M. I. Boechat, K. A. Easley, M. Meziane, R. B. Mellins, K. I. Norton, E. Singleton, and L. Trautwein. 1997. Chest radiograph data acquisition and quality assurance in multicentre studies. Pediatr. Radiol. 27: 880-887 [Medline].
20. Winer-Muram, H. T., S. A. Rubin, M. Miniati, and J. V. Ellis. 1992. Guidelines for reading and interpreting chest radiographs in patients receiving mechanical ventilation. Chest 102(Suppl.): 565S-570S .
| |
APPENDIX |
|---|
Let yijk = 1 if rater j classifies subject i as having diffuse bilateral infiltrates on day k, k = 1, 2, . . . , ki, j = 1, 2, 3, i = 1, 2, . . . , 99, and let yijk = 0 otherwise. The classifications for a given subject are dependent over time, so yijk and yijl are correlated. Furthermore, we are interested in relating the classifications from different raters, which is best achieved through a regression model. We can relate assessments by Raters 1 and 2 through the following random effects model:
logit(
i1k) =
i +
yi2k,
where
~ N(
,
2) are iid, and
is the log odds ratio, reflecting the association between Raters 1 and 2. It is preferable, however, to condition on yi1. =
kyi1k, which is sufficient for
i
to obtain a noncentral hypergeometric distribution that is a
function of
alone (McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. Chapman & Hall, London). This
can be done for every patient, and one can take the product of
the resulting likelihoods to obtain an overall likelihood that is
just a function of
. The resulting likelihood is the same as that
arising from the conditional analysis of 2 × 2 × k tables arising from stratified case-control studies, and therefore it can be
maximized using EGRET, SAS, Splus, and many other computer packages for statistical analysis.
The test for the effect of training is accomplished by adding in a main effect of an indicator variable to a conditional logistic regression model, where the variable indicates whether the responses were obtained before or after a retraining session. For example, we may fit
log(
12) =
12 +
12X
where
12 is the odds ratio reflecting the agreement between
Observers 1 and 2, and we may test H0:
12 = 0. If H0 is not rejected, then we would claim that the degree of agreement between Observers 1 and 2 does not depend on whether one is
assessing agreement before training or after training. If H0 is
rejected, the sign and magnitude of
12 indicates whether the
level of agreement has deteriorated or improved, and how
much it has changed.
Note that these more sophisticated methods were adopted
to handle the dependence of assessments within observers
over time. Thus the formula
= [(ad)1/2
(bc)1/2]/[(ad)1/2 + (bc)1/2] does not apply here. It does for all other applications
discussed in this article.
This article has been cited by other articles:
![]() |
L. C. Bevis, G. M. Berg-Copas, B. W. Thomas, D. G. Vasquez, R. Wetta-Hall, D. Brake, E. Lucas, K. Toumeh, and P. Harrison Outcomes of Tube Thoracostomies Performed by Advanced Practice Providers vs Trauma Surgeons Am. J. Crit. Care., July 1, 2008; 17(4): 357 - 363. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. W. Rice, A. P. Wheeler, G. R. Bernard, D. L. Hayden, D. A. Schoenfeld, L. B. Ware, and for the National Institutes of Health, National He Comparison of the SpO2/FIO2 Ratio and the PaO2/FIO2 Ratio in Patients With Acute Lung Injury or ARDS Chest, August 1, 2007; 132(2): 410 - 417. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. B. Ware and M. A. Matthay Acute Pulmonary Edema N. Engl. J. Med., December 29, 2005; 353(26): 2788 - 2796. [Full Text] [PDF] |
||||
![]() |
J Davies, S M Tibby, and I A Murdoch Should parents accompany critically ill children during inter-hospital transport? Arch. Dis. Child., December 1, 2005; 90(12): 1270 - 1273. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. N. Gong, W. Zhou, P. L. Williams, B. T. Thompson, L. Pothier, P. Boyce, and D. C. Christiani -308GA and TNFB polymorphisms in acute respiratory distress syndrome Eur. Respir. J., September 1, 2005; 26(3): 382 - 389. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. McGinn and G. Guyatt Kappa statistic Can. Med. Assoc. J., July 5, 2005; 173(1): 17 - 17. [Full Text] [PDF] |
||||
![]() |
M. E. Graat, J. Stoker, M. B. Vroom, and M. J. Schultz Can We Abandon Daily Routine Chest Radiography in Intensive Care Patients? J Intensive Care Med, July 1, 2005; 20(4): 238 - 246. [Abstract] [PDF] |
||||
![]() |
F. Michard, V. Zarka, S. Alaya, S. Sakka, and M. Klein Better Characterization of Acute Lung Injury/ARDS Using Lung Water Chest, March 1, 2004; 125(3): 1166 - 1167. [Full Text] [PDF] |
||||
![]() |
M. Licker, M. de Perrot, A. Spiliopoulos, J. Robert, J. Diaper, C. Chevalley, and J.-M. Tschopp Risk Factors for Acute Lung Injury After Thoracic Surgery for Lung Cancer Anesth. Analg., December 1, 2003; 97(6): 1558 - 1565. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. S. Bae, P. M. Waters, and D. Zurakowski Reliability of Three Classification Systems Measuring Active Motion in Brachial Plexus Birth Palsy J. Bone Joint Surg. Am., September 1, 2003; 85(9): 1733 - 1738. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. S. Martin, E. W. Ely, F. E. Carroll, and G. R. Bernard Findings on the Portable Chest Radiograph Correlate With Fluid Balance in Critically Ill Patients Chest, December 1, 2002; 122(6): 2087 - 2095. [Abstract] [Full Text] [PDF] |
||||
![]() |
K Atabai and M A Matthay The pulmonary physician in critical care * 5: Acute lung injury and the acute respiratory distress syndrome: definitions and epidemiology Thorax, May 1, 2002; 57(5): 452 - 458. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. J. Nuckton, J. A. Alonso, R. H. Kallet, B. M. Daniel, J.-F. Pittet, M. D. Eisner, and M. A. Matthay Pulmonary Dead-Space Fraction as a Risk Factor for Death in the Acute Respiratory Distress Syndrome N. Engl. J. Med., April 25, 2002; 346(17): 1281 - 1286. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. W. Ely and E. F. Haponik Using the Chest Radiograph To Determine Intravascular Volume Status : The Role of Vascular Pedicle Width Chest, March 1, 2002; 121(3): 942 - 950. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. D. BERSTEN, C. EDIBAM, T. HUNT, J. MORAN, and T. A. A. N. Z. I. C. S. C. T. GROUP Incidence and Mortality of Acute Lung Injury and the Acute Respiratory Distress Syndrome in Three Australian States Am. J. Respir. Crit. Care Med., February 15, 2002; 165(4): 443 - 448. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. TOBIN Critical Care Medicine in AJRCCM 2000 Am. J. Respir. Crit. Care Med., October 15, 2001; 164(8): 1347 - 1361. [Full Text] [PDF] |
||||
![]() |
M. O. MEADE, G. H. GUYATT, R. J. COOK, R. GROLL, J. R. KACHURA, M. WIGG, D. J. COOK, A. S. SLUTSKY, and T. E. STEWART Agreement between Alternative Classifications of Acute Respiratory Distress Syndrome Am. J. Respir. Crit. Care Med., February 1, 2001; 163(2): 490 - 493. [Abstract] [Full Text] |
||||
![]() |
L. B. Ware and M. A. Matthay The Acute Respiratory Distress Syndrome N. Engl. J. Med., May 4, 2000; 342(18): 1334 - 1349. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Proc. Am. Thorac. Soc. | Am. J. Respir. Cell Mol. Biol. |