|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
The aim was to determine reliability of lung function measurements performed according to recommendations of the American Thoracic Society (ATS) at a screening program in a large South African gold mine and to determine the usefulness of the reliability coefficient G for monitoring the reliability of lung function measurements in a mass screening program. The reliability coefficient G estimates the amount of random error of measurement, relative to the total variation in a measurement. The coefficient G was calculated as a correlation coefficient between two consecutive lung function tests performed within 6 mo, over a period of 43 mo on 3,378 miners. There was significant temporal variability in the reliability. For FEV1, the coefficient G showed increased variability over the first 5 mo and stabilized at a value of 0.93 for the next 23 mo, after which it systematically declined over the next 15 mo. We estimated that in a large screening program, an optimal sample size of around 900 miners, examined randomly throughout the year, on a yearly basis, would provide a sufficient sample to examine monthly or quarterly fluctuation in the reliability. The value of the reliability coefficient G did not change when the time between two consecutive tests increased up to 15 mo. In conclusion, monitoring of lung function reliability in a screening program by the reliability coefficient G should improve data quality, and provide a measure on which the confidence in a decision-making process could be based when examining temporal changes in lung function for individual subjects. Hnizdo E, Churchyard G, Barnes D, Dowdeswell R. Assessment of reliability of lung function screening programs or longitudinal studies.
| |
INTRODUCTION |
|---|
|
|
|---|
Screening for lung function impairment in subjects exposed to respiratory hazards should be able to identify those individuals whose lung function falls below predicted values and those who demonstrate accelerated loss of lung function (1). The accuracy with which subjects with "true" undue loss of lung function are identified depends on the reliability of the measurements. In longitudinal lung function testing, continuous monitoring of reliability, in addition to the quality control measures recommended by the American Thoracic Society (ATS) (2), should improve the data quality and also provide an index of reliability on which decisions can be based.
Generally, the reliability of lung function measurements reflects both systematic errors (e.g., procedural differences) and random errors of measurement (e.g., due to temporary restriction) (3, 4). The amount of variation caused by random error of measurement, i.e., an error that cannot be explained by known systematic effects, can be measured by the reliability coefficient G (5). The reliability coefficient G was previously used to correct for the effect of lung function fluctuation within individual subjects when predicting "true" loss. Application of the reliability coefficient to monitoring of the reliability of lung function measurements in a screening program has not been described previously in the literature.
Exposure to silica dust is a known risk factor for chronic obstructive pulmonary disease (COPD) (6). Thus, an effective lung function screening program in silica-exposed workers could lead to prevention of COPD. In the present study we had the following objectives: (1) to evaluate the reliability of a lung function screening program in a South African gold mine and to determine the usefulness of the data for epidemiologic research; (2) to evaluate the applicability of the reliability coefficient G for the assessment of a temporal pattern in reliability; and (3) to develop a method of monitoring reliability of lung function measurements in such a screening program.
| |
METHODS |
|---|
|
|
|---|
Study Population
Miners at a large South African gold mining company have spirometry done routinely at an initial examination, periodically every 3 yr, and on leaving the company (exit). The population of miners decreased from 71,515 in 1994 to 43,359 in 1997 due to closure of entire mine shafts for economic reasons, then all miners were discharged; or due to downsizing of mine shafts, then miners mostly took voluntary retirement. From May 1994, when screening started, to March 1998, data from 113,120 tests were computerized (14,267 in 1994, 28,402 in 1995, 25,288 in 1996, 32,381 in 1997, and 12,782 in 1998).
All black males who had two lung function tests within 6 mo between May 1994 and March 1998 were used to investigate objectives (1) and (2). Six months was the shortest period that provided a sufficient number of subjects to evaluate a trend in the reliability coefficient G while ensuring no aging effect. The most frequent reason for a second examination was an exit examination, or miners returning from extended leave. Miners with medical reasons for a test were excluded. In total, there were 3,513 miners who qualified. Of these, we excluded 80 (2.3%) in whom FVC or FEV1 were outside the 99% confidence intervals (CI), and 55 (1.6%) in whom the within-person difference was outside the 99% CI, leaving 3,378 miners. To study objective (3), we used the same selection criteria, except that there was no limit on the time interval between the two tests, and identified 16,249 miners.
Lung Function Measurements
Maximal forced expiratory maneuvers are recorded in a computerized database using a Hans Rudolph pneumotachograph (Flowscan; Electromedical Systems Inc., South Africa). The system software requires and validates a calibration with a 3-L syringe. Calibration is done 3 to 4 times per day. Barometric pressure and temperature are entered via the keyboard for correction of volumes to BTPS. During testing, flow versus volume tracings are displayed. A minimum of three acceptable and reproducible forced expiratory maneuvers are obtained according to the standards recommended by the ATS. The miners only perform an exhalation maneuver. All testing is done by nursing personnel with a college diploma in spirometry testing, and trained in the techniques of performing spirometry to ATS standards. Height is measured to the nearest centimeter in stocking feet. Data recorded for each test includes the date of test, date of birth, height, weight, the highest FVC, the highest FEV1, and forced expiratory flow at 25 to 75% of forced vital capacity (FEF25-75%).
Statistical Methods
Reliability coefficient G
background. The lung function tests (FVC,
FEV1, FEF25-75%, and the ratio of FEV1/FVC [FEV1%]) are continuous normally distributed variables with the mean, µ, and the variance,
2. It is well recognized that lung function tests are prone to measurement errors (3, 4). The errors can be broadly categorized as systematic
errors of measurement and random errors of measurement. Theoretically, a systematic error could be removed from the data, provided we
have information on its origin (e.g., procedural changes, a technician
effect, seasonal variability). A systematic error changes mainly the
mean, µ, i.e., it shifts the distribution. Generally, by a random error of
measurement we understand not only the random error in the measurement procedure itself but also, and more importantly, random
fluctuation in the measured quantity that reflects the variability in
lung function within an individual subject. This fluctuation can be due
to factors such as a subject's fatigue, bronchoconstriction, diurnal or
seasonal variation, and acute response to allergens. By definition, the random error of measurement does not change the mean, but can change the size of the variance,
2. When testing the reliability of a
lung function measurement, we estimate the size of the random error
of measurement, relative to the total variation in the measurement
across subjects, i.e., we compare the amount of the within-person variability relative to the between-person variability (5). The statistic that
measures the relative size of the random error of measurement is the
reliability coefficient G (5, 9, 10). A provides statistical details on the coefficient G.
Appendix shows that the simplest method of estimating the reliability coefficient G is to repeat lung function tests for a set of subjects
in a few weeks or months and to calculate the correlation coefficient
MaMb between the first and second measurements. The time interval
T between the two tests should be long enough to include all the potential short-term effects involved in the random measurement error,
but short enough to avoid systematic changes, e.g., due to age.
Temporal changes in reliability. To determine temporal changes,
we calculated the overall average within-person difference in lung
function and the coefficient of reliability G. Then, we examined the
temporal changes in the average monthly within-person differences
and in the reliability coefficient G. We used the analysis of covariance
(SAS PROC GLM) to adjust for the effect of age, the month of testing, and the time interval T between the two tests, on the monthly
within-person difference in FEV1 (second test
first test) (11). Next,
we identified a reliable period of testing and recalculated the reliability statistics for the reliable period only. To demonstrate usefulness of
the data for epidemiologic purposes, we examined the loss of lung function within age strata for the reliable period.
Method of monitoring of reliability in a large screening program. We first examined the relationship between the reliability coefficient G and the time interval between two tests T, to establish whether the reliability coefficient G changed with time T. We used the model (5)
|
(1) |
where the correlation coefficient
t,t+T, calculated for increasing time
interval T, is related to T. Then G and
are estimated from a simple
linear regression obtained by taking the natural logarithm of Equation 1, i.e., log (
t,t+T) = g
T. The estimated value of G, G = exp(g), provides the best estimate of G at time T = 0. The estimated slope
provides information on the change in
with increasing time T.
Next, we estimated the optimal sample size required for regular
yearly testing, so that monthly monitoring of the reliability coefficient
G could be done. We drew the required sample size by random sampling of subjects from a period of 1 yr; calculated the monthly reliability coefficients Gi, i = 1 to 12; and plotted the average reliability
coefficient G =
i=1 to 12 Gi/12 and the coefficient of variation CV (calculated from Gi, i = 1 to 12), against the changing sample size. The
sample size at which the CV became constant is considered optimal.
| |
RESULTS |
|---|
|
|
|---|
Table 1 (A) shows the characteristics at the first and second lung function tests, for the 3,378 miners who had two tests within 6 mo. The average age at the first test was 41.8 yr (SD, 10.1) and the average period between the two tests was 3.73 mo. The average within-person differences for the lung function tests were negative and statistically significant. The reliability coefficient G was highest for FEV1.
|
Figure 1a shows the average difference between second and first FEV1, according to the month of the first test, adjusted for age and the interval T. There is a period of larger variability up to September 1994, and decreased variability up to August 1996, and a period of large negative changes from September 1996 to June 1997, followed by large positive changes. Figure 1b shows that the reliability coefficient G for FEV1 also declines from September 1996.
|
To obtain the "best" estimate of the random error of measurement for the individual lung function, we selected the period when the screening program was most reliable, i.e., from October 1994 to August 1996. Subjects whose first or second tests were done outside the "reliable" period were excluded. Table 1 (B) shows the improvement in the reliability statistics for the 1,001 subjects who had both tests done within the reliable period. To demonstrate the usefulness of the data for epidemiologic purposes, we present the reliability statistics by age categories for the reliable period only (Table 2).
|
Figure 2 shows fitted regression curves for the relation between the reliability coefficient G and time T between two
tests, for all miners who had two tests regardless of the time
interval. Figure 2a includes data on 2,802 miners who had two
tests within the reliable period (October 1994 to August 1996).
The maximum time interval T was 22 mo, but the number of
subjects with tests more than 15 mo apart was small, and the
coefficient G became unreliable after T = 15. The value of G
was consistently above 0.90 up to T = 15, the estimated value
of G at T = 0 was 0.93 (95% CI, 0.91-0.99), and the value of
the slope
=
0.0010 (p = 0.20). When the regression curve
was fitted up to 22 mo, then the estimated value of G at T = 0 was 0.95 (95% CI, 0.92-0.99). Figure 2b includes the whole
screening period for 16,249 subjects. The maximum time interval T was 36 mo. The reliability coefficient G declined steeply
within 15 mo
the estimated value of G at T = 0 was 0.87 (95% CI, 0.86-0.89), and the value of the slope
=
0.0014
(p = 0.002). The number of subjects was large for all points
(ranging from 206 to 790).
|
Figure 3 shows the relationship between the average reliability coefficient G (calculated from the monthly coefficients), the CV for the monthly reliability coefficients G, and the sample size required per year to monitor the lung function reliability on a monthly basis. The optimal sample size is approximately 900 subjects.
|
| |
DISCUSSION |
|---|
|
|
|---|
The reliability of lung function tests was evaluated in a screening program where the testing was designed and intended to be done according to ATS recommendations. A temporal trend
was established for within-person difference in FEV1 and for
the coefficient of reliability G calculated on 3,378 miners who
had two consecutive tests within 6 mo, over a period of 43 mo.
Figure 1 shows that the reliability stabilized after 5 mo of testing, remained consistently "reliable" over a period of 23 mo
(G = 0.93) after which it systematically declined. The negative
difference in FEV1 (2nd
1st test) after September 1996 is a
result of much lower second tests. When subjects with any
measurements done outside of the reliable period were excluded, the reliability coefficient increased and became more consistent over time. During the initial period of 4 mo (May to September 1994), the variability may have been higher because of a learning process. The decrease in reliability from
September 1996 may have been caused by increased layoffs in
the mine and increased workload for the lung function technicians. In a retrospective evaluation of the screening program
practices, after the reliability analysis was completed, it was
also found that there were lapses in the auditing of the calibration log to ensure that calibration is done accurately.
With regard to the usefulness of the data for epidemiologic purposes, only the data from the "reliable" period (average reliability coefficient for FEV1 G = 0.93) appear to be useful. The within-person differences and the coefficient of reliability G improved substantially during the reliable period (see Table 1). The data also show consistency when stratified by age (see Table 2). The within-person differences show a negative decline in FEV1 from 35 to 44 yr of age. Although the within-person differences were not statistically significant from zero, the pattern is consistent with a published longitudinal study in which the onset of decline in FEV1 for males was from 36 yr of age (12). In contrast, cross-sectional studies report the decline in FEV1 per year to be constant at 20 to 30 ml/yr (13). The cross-sectional means for FEV1 (first measurement means) in Table 2 also show a decline starting from 25 yr of age. The cross-sectional decline in younger ages in our study could be the result of a strong cohort effect, as height declined systematically with age and the young miners were much taller than the 50-yr-olds. Thus, the availability of the reliability coefficient G could provide a measure of usefulness of longitudinal studies, or screening program data for epidemiologic research.
How reliable should be a lung function screening program,
so that it is able to identify accelerated loss of airflow in specific groups of subjects, for example, in smokers? Even if the
coefficient of reliability G is approximately 0.93, a large sample size is required to identify small losses to be statistically
significant. For example, the observed change in FEV1 for the
age category 35 to 44 yr of
22.5 ml per 3.78 mo (see Table 2)
is much higher than expected. However, a minimal sample
size required for this difference to be statistically significant is
approximately 463 subjects. The literature suggests that at
least 4 yr of follow-up are required to detect the effect of
smoking in a group of subjects in a longitudinal study (16).
Whether this is so depends on the data reliability. The higher
the reliability coefficient G, the more likely undue loss of lung
function caused by disease, smoking, or occupational exposure can be detected. The reliability coefficient G provides a
measure of confidence that can be assigned to a change in lung
function observed in groups of subjects or in individual subjects. For example, if any of the two measurements are from a
period with low reliability, then the confidence in the observed change in lung function is lower than if both tests were
done during a reliable period.
The results demonstrate that monitoring of data reliability in screening programs, or longitudinal studies, could help to identify lapses in the reliability at an early stage, and provide a measure of confidence in the data. In a large screening program, as in our study, when the lung function testing is not done on a yearly basis, a "small" dynamic cohort of subjects tested on a yearly basis could provide a basis of a reliability program. How regularly should the subjects be tested? According to the data from the "reliable" period, the reliability coefficient G for FEV1 declined little up to the time interval T = 15 mo and the best estimate of G at T = 0 was 0.93 (or 0.95 when the model was fitted to T = 22) (Figure 2a). Thus, in a reliable program, the subjects could be tested every year and this should not have an effect on the reliability coefficient G. However, if the program is not reliable, then the reliability coefficient G would decline rapidly with time T (Figure 2b).
How large should the dynamic cohort be? According to Figure 3, the optimal sample size required to monitor reliability on a monthly basis using the reliability coefficient G is approximately 900 subjects. (At those sample sizes the variability [CV] in the monthly reliability coefficient G stabilizes.) For quarterly monitoring, the sample would be smaller. If the subjects have lung function tests done yearly, and the testing is evenly distributed throughout the year, then a trend in the monthly or quarterly reliability coefficients G can be monitored. For the first year of the program, the second tests could be done after 3 mo to get an early feedback from the data, and to obtain good baseline data on each subject. The reliability coefficient is simply calculated as the correlation coefficient between the first and second tests, across all the subjects who had two tests within a year. The reliability can be also evaluated for individual technicians and spirometers.
A limitation of the present study is that the random error of measurement included systematic effects (e.g., technician, instruments) that, because of lack of recorded data on these effects, could not be excluded. Despite this, the "best" estimate of the random error of measurement, i.e., within-subject variation, was estimated as 5 to 7% of the total variation in the FEV1 for the reliable period. Another major limitation of the program is lack of records on the acceptability and reproducibility of each lung function. However, a longitudinal study of white South African gold miners who had a 1-yr interval between two lung function tests (10), and who were tested in one main lung function laboratory, reported similar values of the reliability coefficient G for FVC, FEV1, FEF25-75%, and FEV1% of 0.899, 0.929, 0.836, and 0.786, respectively.
In conclusion, the study shows that despite the fact that the lung function testing was designed and intended to be done according to the standards recommended by the ATS, there were lapses in reliability over time. In response to the reliability analysis, a retrospective evaluation of the program identified various limitations. Thus, continuous monitoring of data reliability should help to maintain good data quality. The coefficient of reliability G appears to be a simple tool for monitoring the data reliability that can also provide a measure of confidence on which the assessment of changes in lung function in groups of subjects or individual subjects can be based. The study also demonstrates that it is possible to have a reliable screening program that generates data for epidemiologic research, and that the availability of a reliability coefficient G provides a measure of usefulness of the data for epidemiologic research.
| |
Footnotes |
|---|
Correspondence and requests for reprints should be addressed to Eva Hnizdo, National Center for Occupational Safety and Health, 1095 Willowdale Road, MS PB 163, Morgantown, WV 26505. E-mail: EXH6{at}cdc.gov
(Received in original form February 5, 1999 and in revised form June 23, 1999).
Acknowledgments: The authors thank the Anglogold Health Services from the Freegold mines in Welcome, South Africa, for allowing us to use lung function data, Ms. Tanusha Singh who helped with computer programming, and Dr. Jill Murray from NCOH for her valuable comments.
Supported by the Safety in Mining Research Advisory Committee.
| |
References |
|---|
|
|
|---|
1. American Thoracic Society. 1982. Surveillance for respiratory hazards in the occupational setting. Am. Rev. Respir. Dis. 126: 952-956 [Medline].
2. American Thoracic Society. 1995. Standardization of spirometry: 1994 update. Am. J. Respir. Crit. Care Med. 152: 1107-1136 [Medline].
3. American Thoracic Society. 1991. Lung function testing: selection of reference values and interpretative strategies. Am. Rev. Respir. Dis. 144: 1202-1218 [Medline].
4. Becklake, R.. 1986. Concepts of normality applied to the measurement of lung function. Am. J. Med. 80: 1158-1164 [Medline].
5. Shepard, D. S.. 1981. Reliability of blood pressure measurements: implications for designing and evaluating programs to control hypertension. J. Chron. Dis. 34: 191-209 [Medline].
6. Wiles, F. J., and M. H. Faure. 1977. Chronic obstructive lung disease in gold miners. In W. H. Walton, editor. Inhaled Particles IV, Part 2. Pergamon Press, Oxford. 727-735.
7. Cowie, R. L., and S. K. Mabena. 1991. Silicosis, chronic airflow limitation, and chronic bronchitis in South African gold miners. Am. Rev. Respir. Dis. 143: 80-84 [Medline].
8. Hnizdo, E.. 1990. Combined effect of silica dust and tobacco smoking on mortality from chronic obstructive lung disease in gold miners. Br. J. Ind. Med. 47: 656-664 [Medline].
9. Gardner, M. J., and J. A. Heady. 1973. Some effects of within-person variability in epidemiological studies. J. Chron. Dis. 26: 781-795 .
10. Irwig, L., H. Groeneveld, and M. Becklake. 1988. Relationship of lung function loss to level of initial function: correcting for measurement error using the reliability coefficient. J. Epidemiol. Community Health 42: 383-389 [Abstract].
11. Snedecor, G. W., and W. G. Cochran. 1967. Statistical Methods. Iowa State University Press, Ames, IA.
12. Burrows, B., M. D. Lebowitz, A. E. Camilli, and R. J. Knudson. 1986. Longitudinal changes in forced expiratory volume in one second in adults. Am. Rev. Respir. Dis. 133: 974-980 [Medline].
13. Crapo, R. O., A. H. Morris, and R. M. Gardner. 1981. Reference spirometric values using techniques and equipment that meet ATS recommendations. Am. Rev. Respir. Dis. 123: 659-664 [Medline].
14. Knudson, R. J., M. D. Lebowitz, C. J. Holberg, and B. Burrows. 1983. Changes in the normal maximal expiratory flow-volume curve with growth and aging. Am. Rev. Respir. Dis. 127: 725-734 [Medline].
15. Quanjer, P. H., editor. 1983. Standardized lung function testing: report of the working party. Bull. Eur. Physiopathol. Respir. 19(Suppl. 5):1- 95.
16. Dales, R. E., J. A. Hanley, P. Ernst, and M. R. Becklake. 1987. Computer modelling of measurement error in longitudinal lung function data. J. Chron. Dis. 40: 769-773 [Medline].
| |
APPENDIX |
|---|
To describe the statistical theory for the reliability coefficient
(4, 8, 9), let us assume that for an individual subject there is a
"true" value of a lung function, L. This true value L is observed with a random error
, resulting in measurement M,
where M = L +
. The observed value M is distributed normally with a mean µ and variance
M2 . Assuming that L and
are normally distributed, then the variance of M is
M2 =
L2 + 
2 . The ratio of the true value variance
L2 to that of the variance
M2 of the observed value is referred to as the reliability
coefficient G and can be expressed as
|
The variance 
2 , required for calculation of G, can be estimated from repeated independent measurements (Ma and
Mb) of the same true value L on the same subject over a period of time T (weeks or months). It can be shown that the
within-person variance of the difference of the two measurements (
2Ma - Mb) is twice the variance of the random error of measurement, 2
2 . This follows (10) because
|
(2) |
Further, we can assume that
2Ma =
2Mb =
2M and that
is independent of L. Then the covariance term is the same whether
derived from M or L, i.e.:
|
(3) |
|
(4) |
Finally, using Equation 3 and the fact that
2M =
Ma 2 ·
Mb2 , one gets for the reliability coefficient G
|
(5) |
The above shows that the simplest method of estimating the
reliability coefficient G is from a reexamination of lung function in a series of cases over a few weeks or months and calculation of the correlation coefficient
MaMb. The 95% confidence interval (CI) for the observed correlation coefficient r is
estimated by CI(
) = r ± Z
· (1/
).
The value of G and the size of the random error of measurement can be also calculated directly from Equations 2 and
5 using variances from Table 1. For example, if we substitute the variances for FEV1 in Table 1 into Equation 2, then
MaMb = 1/2 (0.4223 + 0.4318
0.0938) = 0.3802. Then from
Equation 5, G = 0.3802/(
·
) = 0.8903. The
variance of the observed FEV1 is defined as
M2 =
L2 + 
2 , where the variance of the true values L,
2L = 0.3803, and the
variance of the random error of measurement 
2 = 1/2
(0.0938) = 0.0469.
This article has been cited by other articles:
![]() |
N F Schlecht, K Schwartzman, and J Bourbeau Dyspnea as clinical indicator in patients with chronic obstructive pulmonary disease Chronic Respiratory Disease, October 1, 2005; 2(4): 183 - 191. [Abstract] [PDF] |
||||
![]() |
E Hnizdo, L Yu, L Freyder, M Attfield, J Lefante, and H W Glindmeyer The precision of longitudinal lung function measurements: monitoring and interpretation Occup. Environ. Med., October 1, 2005; 62(10): 695 - 701. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Perez-Padilla, J. Regalado-Pineda, L. Mendoza, R. Rojas, V. Torres, V. Borja-Aburto, and G. Olaiz Spirometric Variability in a Longitudinal Study of School-Age Children Chest, April 1, 2003; 123(4): 1090 - 1095. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Hnizdo, G. Churchyard, and R. Dowdeswel Lung function prediction equations derived from healthy South African gold miners Occup. Environ. Med., October 1, 2000; 57(10): 698 - 705. [Abstract] [Full Text] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Proc. Am. Thorac. Soc. | Am. J. Respir. Cell Mol. Biol. |