|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Daytime sleepiness is a common consequence of repeated arousal in obstructive sleep apnea (OSA).
Arousal indices are sometimes used to make decisions on treatment, but there is no evidence that
arousals are detected similarly even by experienced observers. Using the American Sleep Disorders
Association (ASDA) definition of arousal in terms of the accompanying electroencephalogram (EEG) changes, we have quantified interobserver agreement for arousal scoring and identified factors affecting it. Ten patients with suspected OSA were studied; three representative EEG events during each of light, slow-wave, and rapid-eye-movement (REM) sleep were extracted from each record (90 events total) and evaluated by experts in 14 sleep laboratories. Observers differed (ANOVA, p < 0.001) in the number of events scored as arousal (totals ranged from 23 to 53 of the 90 events).
Overall agreement was moderate (
= 0.47), but it was best for events during slow-wave sleep, moderate for REM, and poor for light sleep (
= 0.60, 0.52, and 0.28, respectively). Agreement was unrelated to arousal duration. We conclude that the ASDA definition of arousal is only moderately repeatable. Account should be taken of this variability when results from different centers are compared.
| |
INTRODUCTION |
|---|
|
|
|---|
Obstructive sleep apnea (OSA) and related conditions have been implicated in road accidents and cardiovascular disease (1). The severity of daytime sleepiness in patients with these conditions has been shown to correlate with the frequency of short transient arousals from sleep (2). Therefore, the arousal index (average number of arousals per hour of sleep) is widely quoted in the literature as an indicator of the severity of sleep disorder and may be used to make decisions on treatment. Unfortunately, the standard epoch-based sleep assessment criteria of Rechtschaffen and Kales (3) have insufficient resolution for reliable documentation of these transient disturbances to sleep. In the absence of a definitive standard, many workers have devised their own definition of arousal, usually in terms of the accompanying electroencephalogram (EEG) changes. For example, Bonnet (4) defined arousals as "appearance of alpha, a sleep stage change or EEG speeding." Guilleminault and coworkers (5) introduced a duration criterion with "alpha bursts in the central EEG derivation lasting 3 s or longer," and Pitson and colleagues (6) added a submental electromyogram (EMG) criterion and defined three levels of arousal with "increases in high-frequency EEG and/or EMG of (1) shorter than 3 s, (2) 3 to 10 s, or (3) longer than 10 s."
No information is available on the repeatability of such definitions, and limited data on their relationships with each other (7) indicate that agreement is poor. Consequently, the American Sleep Disorders Association (ASDA) recently extended the criteria of Rechtschaffen and Kales (3) to a proposed standard that would "specifically and reliably identify the occurrence of transient arousals." ASDA (8) defined arousal as "an abrupt shift in EEG frequency, which may include theta, alpha and/or frequencies greater than 16 Hz but not spindles." There are a number of qualifying rules, the most important of which are: (1) 10 s of continuous sleep must precede the arousal, (2) Arousals must be at least 3 s in duration, and (3) Arousals during rapid-eye-movement sleep must be accompanied by an increase in submental EMG. This last criterion is not applied during light or slow-wave sleep.
The ASDA criteria have been adopted de facto as a standard for the assessment of arousal. Significant interobserver variation has been reported when sleep architecture is evaluated using the traditional criteria (9), but we are unaware of any data related to repeatability using this new definition of arousal.
The aim of the present study was to quantify the agreement between expert observers in the detection of arousal as defined by ASDA, and when possible, to identify the causes of disagreement.
| |
METHODS |
|---|
|
|
|---|
Patients and Sleep Investigations
The study was based on polysomnographic records obtained in 10 subjects (eight male, two female) presenting to a sleep clinic with complaints of excessive snoring and/or daytime sleepiness. Each underwent 6 h of polysomnography; the following signals were recorded continuously on digital tape (EDR128; Earth Data, Southampton, UK) and thermal chart recorder (TA4000; Gould Ltd, Ilford, UK) at 5 mm/s: two channels of electroencephalogram (EEG, derivations CZ/A1 and OZ/A1), two channels of electrooculogram (EOG, derivations LOC/A1 and ROC/A1), submental electromyogram (EMG), and airflow by nasoral thermistor. Each 6-h sleep record was assessed for sleep stage according to the criteria of Rechtschaffen and Kales (3), and for apneas and hypopneas (defined for this study as a 50% or greater reduction in airflow that lasted for 10 s or longer). The number of apneas and hypopneas per hour of sleep (apnea/hypopnea index) was determined for each patient; summary data are given in Table 1.
|
Classification and Selection of EEG Events
We wished to assess observer agreement in the classification of EEG "events" that were both likely and unlikely to reflect arousal. Therefore, each sleep record was reviewed to identify such events, defined for the purpose of this study as a transient change in EEG, with or without changes in EOG or EMG, and with a duration as long as 20 s. The arousal review was conducted by one of us (M.J.D) with 5 yr of experience in reviewing polysomnographic records by ASDA criteria, and it is summarized in Table 2. For each event, the sleep stage for the preceding 10 s of sleep was determined according to the criteria of Rechtschaffen and Kales (3), and classified as light (Stages I/II), slow-wave (Stages III/IV), or rapid-eye-movement (REM) sleep. In order to ensure a wide range of events, they were chosen so that in each subject and in each of the three sleep stages there was:
|
one event that (by our judgment) clearly met the ASDA definition
of arousal,
one event that (by our judgment) clearly did not meet the ASDA
definition,
and one event that we could not easily categorize by the ASDA definition, for example, because the EEG change was indistinct or its
duration was uncertain.
In all, 90 events were studied (10 patients × three stages × three types of event).
Creation of Chart Records
For each of the 90 events in turn, 40 s of physiological signals, including the entire event in question, was recorded on a page of chart paper. The 40-s period was adjusted so that the event was situated randomly in the last 30 s of the record. All chart records, an example of which is shown in Figure 1, were in exact accordance with the recommendations of ASDA, with derivations of left and right EOG (LOC/ A1 and ROC/A1), submental EMG, and central and occipital EEG (CZ/A1 and OZ/A1).
|
Distribution of Chart Records
The 90 events were ordered using a five block random design; each block had 18 randomly ordered events with six each of light, slow-wave, and REM sleep stage. Because the ASDA criteria make some reference to sleep stage, and in order to avoid known variability in the assessment of sleep stage influencing the results, this (light, slow-wave, or REM) was indicated on the accompanying assessment sheet. A copy of all 90 records, the assessment sheet, and the ASDA definition of arousal (which includes 18 example events) were sent to sleep investigators at 20 different sleep laboratories in Europe.
Observer's Assessment Protocol
Statistical Analysis
Summary of event classifications. Replies were received from 14 centers. All 14 observers classified all 90 events, so that there were 90 × 14 = 1,260 Classification values (Yes = 1 or No = 0), indicating how an observer had rated a specific event. Interobserver differences in Classification were investigated using one-way analysis of variance (ANOVA), with Observer (1 to 14) as the explanatory variable, assuming significance at p < 0.01.
For each individual event, pairwise agreement (the number of
agreeing pairs of observers, as a proportion of the 91 possible pairs)
was calculated. It could be interpreted as the probability that two randomly chosen observers would agree about the classification. We defined a consensus of 14 or 13 observers (100%, 86% pairwise agreement) as good agreement, 12 observers (74% pairwise) as moderate agreement, and 11 or fewer observers (
64% pairwise) as poor agreement. As indicated below, these definitions were consistent with our
use of good, moderate, and poor overall agreement with the Kappa
statistic.
Quantification of overall interobserver agreement. The Kappa statistic (10), which is not appropriate for individual events, was used to
quantify overall agreement. Kappa (
) is calculated from mean pairwise agreement, normalized for the expected overall agreement
caused by chance. By definition,
= 1.0 for complete agreement, and
= 0.0 for agreement no better than chance alone. The definitions of
good agreement (
0.6), moderate agreement (0.6 >
0.4), and
poor agreement (
< 0.4) were proposed by Fleiss (10) in line with the
common understanding of these terms. For this study, the values of
= 0.4 and
= 0.6 corresponded to mean pairwise agreements of 70 and 80%, respectively.
Relation of agreement to patient, sleep stage, and arousal duration. Patient and sleep stage effects were investigated by two-way ANOVA, with individual values of pairwise agreement as the outcome variable. Each event was categorized by patient number (1 to 10) and sleep stage (light, slow-wave, or REM), which were the explanatory variables.
ASDA specified the 3-s minimum duration of arousals because "identification and agreement on events of shorter duration are difficult to achieve." However, the criterion may conceivably introduce uncertainty about the classification of events close to 3 s in duration, and we anticipated a reduced agreement for these events. Where one or more observers had marked a particular event as arousal, its mean duration was calculated. For these events, we investigated the relation of pairwise agreement with mean duration using Spearman's rank correlation coefficient (rs).
| |
RESULTS |
|---|
|
|
|---|
Summary of Event Classifications
Of the 1,260 event classifications, overall 542 (43%) were recorded as "Yes" and 718 (57%) as "No". The observers were certain about 957 of the classifications (76%) and uncertain (rated the classification as only probable) about the remaining 303 (24%). It was noteworthy that in 55 of the 90 events, there were observers who were certain that the event was an arousal, and others who were certain that it was not an arousal.
The observers differed significantly in the total number of events scored "Yes" (ANOVA, p < 0.001), with the number scored "Yes" ranging from 23 to 53 of the total of 90 (Figure 2).
|
Individual observers' responses for each of the 90 events are also shown in Figure 2. Agreement was good (14 or 13 observers agree) for 40 events (44%), moderate (12 observers agree) for 11 events (12%) and poor for the remaining 39 events (43%).
Quantification of Overall Interobserver Agreement
There was moderate agreement overall for the 90 events (
= 0.47). For the 60 events we had classified as arousal or no
arousal before distribution, agreement was better but still categorized as moderate (
= 0.57). For the 30 events that we
could not easily categorize, agreement was poor (
= 0.27).
Relation of Agreement to Patient, Sleep Stage, and Arousal Duration
The level of agreement showed no significant difference between patients (ANOVA, p = 0.24), but it did differ with
sleep stage (ANOVA, p = 0.007). Agreement by sleep stage is
shown in Figure 3; corresponding values of
= 0.28 (poor
agreement),
= 0.60 (good agreement), and
= 0.52 (moderate agreement) were calculated for events during light, slow-wave, and REM sleep, respectively.
|
Two observers did not mark arousal duration on any records. A value of mean duration was available for 72 events; the mean ± SD of mean duration was 7.2 ± 3.0 s, giving an indication of the overall spread of events assessed by the experts. The mean duration of an event bore no significant relation to interobserver agreement (rs = 0.14).
| |
DISCUSSION |
|---|
|
|
|---|
Comments on the Experimental Protocol
Replies were received from 14 of the 20 centers that initially agreed to take part in the study. The reviewers comprised eight clinicians with a major interest in respiratory sleep disorders, two polysomnographers, and four clinical scientists. They reviewed a mean of more than 200 polysomnographic records per annum, with a mean 10 years of experience doing so; only two observers had less than 5 years of experience. The study was performed in Europe, and therefore only one expert was ASDA-certified; four were actively involved in teaching polysomnographic methods. All but two were assessing arousal routinely, and all but two of these had adopted the ASDA standard. Clearly, therefore, the ASDA criteria are in widespread use in respiratory sleep investigations.
Although there is no reason to think that the observers who replied were particularly good or bad, they may not be completely representative of sleep experts. For example, it is possible that the best observers were keenest to assist in the experiment, or, conversely, that the best observers were already satisfied with their reliability and saw no need for the study.
No observers had major reservations that prevented them from undertaking the task, but two important points were made. First, many laboratories use an EEG montage, channel gain, and paper speed that differ from the ASDA standard. This study adhered to the standard, supported by the published examples that accompany the ASDA definition. Nevertheless, some observers were rating what were, for them, nonstandard polygraphic data.
Second, some commented that they would have found it useful to see the entire chart record, instead of a 40-s epoch from which to assess the event. In the words of one observer, "alpha intrusion into sleep is usually 1-2 counts per second lower than the alpha frequency detected during wakefulness; this information may have been pertinent in the discrimination of alpha intrusion from transient arousals with alpha."
Although both these factors may have affected agreement, it should be noted that the ASDA criteria and examples stand by themselves, without reference to baseline polygraphic data.
Interobserver Agreement
There were marked differences between observers in the total
number of arousals scored, leading to only moderate overall
agreement. Whereas two raters giving random responses
would tend to agree 50 times out of 100, the value of
= 0.47 indicates that two expert observers would be expected to
agree on about 74 of every 100 occasions. To obtain a wide
range of data for the study, 30 events that the authors found
difficult to categorize were deliberately included. Experience
indicates that many events fit this description, and the data in
Table 2 suggest that such events are about as common as those
that may be readily classified as arousal. This would imply that
the balance of events distributed was close to that observed in
practice. However, even when these "uncertain" events were
excluded from analysis, overall agreement was still only moderate;
= 0.57, equivalent to 79 of every 100 events.
In practice, agreement would probably improve further by the addition of respiratory polygraphic variables since in patients of the type studied, arousal is usually accompanied by resumption or increase of airflow. Nevertheless, with a robust definition of arousal, an experienced observer should in principle be able to categorize events reliably using the definition alone.
Factors Contributing to Poor Interobserver Agreement
Interobserver agreement was not related to either the patient or the duration of the event. The ASDA 3-s criterion has been criticized in the literature because shorter events, which may be of physiological importance, appear in the EEG record (11). ASDA claim that "identification and agreement on events of shorter duration are difficult to achieve," but the criterion might itself introduce disagreement for events of an approximately 3-s duration. Certainly, there is poor agreement on the duration of apneas and hypopneas (12), which are scored from relatively simple physiological criteria. In the current study, it was therefore thought short events would result in the least agreement, but this proved not to be the case. There was no evidence that the proposed 3-s criterion contributed to variability in recognition of arousals.
Of the factors examined, only sleep stage demonstrably affected agreement. It was best for slow-wave sleep, where the slow, high amplitude EEG is most clearly different from the high frequency, low amplitude EEG associated with arousal. Agreement was poor for light sleep. It can be seen in Table 2 that in the patients studied, greater than 80% of all arousals occurred during light sleep, perhaps because regular arousal leads to profound reductions in REM and slow-wave sleep. Good agreement for events during light sleep is therefore potentially important for a population of subjects such as these. The EEGs of light and REM sleep are similar, so that similar levels of agreement would be expected. The disappearance of rapid eye movements (not discussed by ASDA but noted by Schieber and colleagues [13]) may increase confidence in scoring some arousals from REM sleep. Furthermore, the requirement in REM sleep for a concurrent increase in EMG for recognition of arousal may have increased the observer's confidence. A similar rule might improve agreement for events during light sleep. In Figure 1, for example, the EMG increase would raise confidence in the Yes classification. Of course, EMG was displayed in the current study and we cannot say to what extent it was used by the observers in reaching their decisions.
It has been shown (7) that the addition of an EMG criterion to an ASDA-like definition can halve the number of
arousals detected in sleep records by the same expert observer. Thus, an apparently minor change in the definition of
arousal can be a source of considerable variability, which the
ASDA criteria take an important step towards eliminating.
Although a more specific definition of arousal might prove to
be more repeatable, it might not necessarily be more relevant
in functional terms
some evidence suggests that EEG-based
criteria alone underestimate the number of events with physiological importance (11). The development of an ideal standard requires better understanding of the precise components
of arousal that are responsible for daytime sleepiness.
Conclusions
We have shown only moderate agreement between observers in the assessment of transient arousal. The level of agreement depended on the sleep stage from which the arousal took place, with slow-wave sleep showing the best consistency, followed by REM and light sleep. Agreement was not related to interpatient differences, nor to uncertainties in arousal duration.
Although the ASDA criteria are intended to stand alone, it is likely that agreement would be better in clinical practice because the reviewers would normally have access to complete sleep records, which include respiratory variables. In addition, it is possible that agreement would be improved by making the criteria more specific, perhaps by extending an EMG criterion to all sleep.
Many members of the sleep research community use the ASDA definition of arousal or one similar. When presenting research data or making clinical decisions based on arousal indices, they should be aware of the underlying uncertainty in the definition of arousal.
| |
Footnotes |
|---|
Correspondence and requests for reprints should be addressed to Michael Drinnan, Regional Medical Physics Department, Freeman Hospital, Newcastle Upon Tyne NE7 7DN, UK.
(Received in original form May 14, 1997 and in revised form January 20, 1998).
Acknowledgments: The writers wish to thank the following and their colleagues who cooperated in the study: Prof. G. Aubert, Belgium; Dr. R. Cayton, UK; Dr. R. Conradt, Germany; Dr. P. Deegan, Ireland; Dr. H. Engleman, UK; Dr. M. Flanigan, UK; Ms. Y. Hewitt, UK; Dr. M. Morrell, UK; Dr. P. Rees, UK; Dr. A. Simonds, UK; Dr. D. Spence, UK; Dr. J. Stradling, UK; Dr. J. Wedzicha, UK; Dr. A. Woodcock, UK.
| |
References |
|---|
|
|
|---|
1. Findley, L. J., E. Unverzagt, and P. M. Suratt. 1988. Automobile accidents involving patients with obstructive sleep apnea. Am. Rev. Respir. Dis. 138: 337-340 [Medline].
2. Bonnet, M. H.. 1985. Effect of sleep disruption on sleep, performance and mood. Sleep 8: 11-19 [Medline].
3. Rechtschaffen, A., and A. Kales, editors. 1968. A Manual of Standardized Terminology, Techniques and Scoring System for Sleep Stages in Human Subjects. Brain Information Service, UCLA, Los Angeles.
4. Bonnet, M. H.. 1987. Sleep restoration as a function of periodic awakening, movement, or electroencephalographic change. Sleep 10: 364-373 [Medline].
5.
Guilleminault, C.,
R. Stoohs,
A. Clerk,
M. Cetel, and
P. Maistros.
1993.
A cause of excessive daytime sleepiness: the upper airway resistance
syndrome.
Chest
104:
781-787
6. Pitson, D., N. Chhina, S. Knijn, M. Van Herwaaden, and J. Stradling. 1994. Changes in pulse transit time and pulse rate as markers of arousal from sleep in normal subjects. Clin. Sci. 87: 269-273 [Medline].
7. Mathur, R., and N. J. Douglas. 1995. Frequency of EEG arousals from nocturnal sleep in normal subjects. Sleep 18: 330-333 [Medline].
8. Atlas Task Force of the American Sleep Disorders Association. 1992. EEG arousals: scoring rules and examples. Sleep 15: 174-184 .
9. Karacan, I., W. C. Orr, T. Roth, M. Kramer, J. T. Shurley, J. I. Thornby, S. F. Bingham, and P. J. Salis. 1978. Establishment and implementation of standardized sleep laboratory data collection and scoring procedures. Psychophysiology 15: 173-179 [Medline].
10. Fleiss, J. L. 1981. The measurement of inter-rater agreement. In J. L. Fleiss, editors. Statistical Methods for Rates and Proportions. Wiley, New York. 212-225.
11. Rees, K., D. P. S. Spence, J. E. Earis, and P. M. A. Calverley. 1995. Arousal responses from apneic events during non-rapid-eye-movement sleep. Am. J. Respir. Crit. Care Med. 152: 1016-1021 [Abstract].
12. Bliwise, D., N. G. Bliwise, H. C. Kraemer, and W. Dement. 1984. Measurement error in visually scored electrophysiological data: respiration during sleep. J. Neurosci. Methods 12: 49-56 [Medline].
13. Schieber, J. P., A. Muzet, and P. J. R. Ferriere. 1971. Les phases d'activation transitoire spontanées au cours du sommeil normal chez l'homme. Arch. Sci. Physiol. 25: 443-465 .
This article has been cited by other articles:
![]() |
W. T. McNicholas Diagnosis of Obstructive Sleep Apnea in Adults Proceedings of the ATS, February 15, 2008; 5(2): 154 - 160. [Abstract] [Full Text] [PDF] |
||||
![]() |
G V Robinson, J R Stradling, and R J O Davies Sleep {middle dot} 6: Obstructive sleep apnoea/hypopnoea syndrome and hypertension Thorax, December 1, 2004; 59(12): 1089 - 1094. [Abstract] [Full Text] [PDF] |
||||
![]() |
D Schlosshan and M W Elliott Sleep * 3: Clinical presentation and diagnosis of the obstructive sleep apnoea hypopnoea syndrome Thorax, April 1, 2004; 59(4): 347 - 352. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Super, C. van der Togt, H. Spekreijse, and V. A. F. Lamme Internal State of Monkey Primary Visual Cortex (V1) Predicts Figure-Ground Perception J. Neurosci., April 15, 2003; 23(8): 3407 - 3414. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. I. PACK, J. E. BLACK, J. R. L. SCHWARTZ, and J. K. MATHESON Modafinil as Adjunct Therapy for Daytime Sleepiness in Obstructive Sleep Apnea Am. J. Respir. Crit. Care Med., November 1, 2001; 164(9): 1675 - 1681. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.V. Smurra, M. Dury, G. Aubert, D.O. Rodenstein, and G. Liistro Sleep fragmentation: comparison of two definitions of short arousals during sleep in OSAS patients Eur. Respir. J., April 1, 2001; 17(4): 723 - 727. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. MORRELL, L. FINN, H. KIM, P. E. PEPPARD, M. SAFWAN BADR, and T. YOUNG Sleep Fragmentation, Awake Blood Pressure, and Sleep-Disordered Breathing in a Population-based Study Am. J. Respir. Crit. Care Med., December 1, 2000; 162(6): 2091 - 2096. [Abstract] [Full Text] |
||||
![]() |
J. ARGOD, J.-L. PÉPIN, R. P. SMITH, and P. LÉVY Comparison of Esophageal Pressure with Pulse Transit Time as a Measure of Respiratory Effort for Scoring Obstructive Nonapneic Respiratory Events Am. J. Respir. Crit. Care Med., July 1, 2000; 162(1): 87 - 93. [Abstract] [Full Text] |
||||
![]() |
R. N. KINGSHOTT, M. VENNELLE, C. J. HOY, H. M. ENGLEMAN, I. J. DEARY, and N. J. DOUGLAS Predictors of Improvements in Daytime Function Outcomes with CPAP Therapy Am. J. Respir. Crit. Care Med., March 1, 2000; 161(3): 866 - 871. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. N. Exar and N. A. Collop The Upper Airway Resistance Syndrome Chest, April 1, 1999; 115(4): 1127 - 1139. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. S. BENNETT, B. A. LANGFORD, J. R. STRADLING, and R. J. O. DAVIES Sleep Fragmentation Indices as Predictors of Daytime Sleepiness and nCPAP Response in Obstructive Sleep Apnea Am. J. Respir. Crit. Care Med., September 1, 1998; 158(3): 778 - 786. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Proc. Am. Thorac. Soc. | Am. J. Respir. Cell Mol. Biol. |