The American Academy of Sleep Medicine Inter-Scorer Reliability Program: Sleep Stage Scoring

Study Questions:

What is the inter-scorer reliability among visual sleep stage scoring?

Methods:

More than 2,500 scorers, most with 3 or more years of experience scoring sleep studies, examined nine clinical sleep studies with 18,000 epochs and more than 3 million scoring decisions. Data were collected between June 2011 and April 2012. Diagnostic sleep studies and positive airway pressure titration studies were used. Only adult studies were analyzed. Scorers were surveyed about their level of experience and training. No connection between scoring data and survey response was made. The analysis determined the agreement with the score chosen by the majority of scorers.

Results:

The analysis included examples depicting a normal (3), mild or minimal sleep apnea (4), severe sleep apnea (1), and severe periodic limb movement disorder (1) study. More than 95% were nonphysicians, 87% were identified as registered polysomnographic technologists, and 84% received on-the-job training as a scorer. Sleep stage agreement averaged 82.6%. Agreement was highest for REM stage sleep (R), with stage 2 sleep (N2) and wake stage (W) approaching the same level. Scoring agreement for stage 3 sleep (N3) was 67.4% and was lowest for stage 1 sleep (N1) at 63.0%. Scorers had particular difficulty with the last epoch of stage W before sleep onset, the first epoch of N2 after N1, and the first epoch of stage R after stage N2. Discrimination between stages N2 and N3 was particularly difficult for scorers.

Conclusions:

The authors concluded that with current rules, inter-scorer agreement in a large group is approximately 83%, a level similar to that reported for agreement between expert scorers. Agreement in the scoring of transitions between sleep stages was low. Modifications to the scoring rules to improve scoring during sleep stage transitions may result in improvement.

Perspective:

Visual interpretation in medicine is common and by its nature subjective. A high reliability of scoring a sleep study is necessary for a sleep center’s credibility. This study reported reliability similar to published inter-rater reliability among ‘expert’ scorers, which is interesting given this large group of scorers had varied backgrounds, training, and experience.

Keywords: Sleep, REM, Cardiology, Nocturnal Myoclonus Syndrome, Inservice Training, Sleep Apnea Syndromes, United States


< Back to Listings