Separating Myth from Fact: A Review of Research on the Field Sobriety Tests
Ronald H. Nowaczyk
Clemson, SC 29634
For over a decade Marcelline Burns, senior author of an often-cited 1977 NHTSA (National Highway Traffic Safety Administration) report and co-author of a 1981 NHTSA study, has traveled across the country extolling the virtues of the new and improved Field Sobriety test (FST) battery. The FST battery, as recommended by NHTSA, consists of three tests that are supposed to predict an individual’s blood alcohol (BAG) level. The tests are the Horizontal Gaze Nystagmus (HGN) test, the Walk-and-turn test and the One-leg stand test. None of these tests were specifically developed to identify BAG level, but have been used by law enforcement as indicators of driving impairment.
NHTSA claims that the new version of the FST battery is scientific and can differentiate between impaired and unimpaired drivers. Until recently Burns’ testimony has gone unchallenged because few attorneys have the prerequisite understanding of statistics and test development to critically evaluate the NHTSA reports and effectively cross-examine NHTSA’s witnesses. Judges who have recently heard the “rFST of the story” are either not admitting the FST entirely or declaring it unscientific and not allowing police to use such terms as “tests” “results” “passed” or “failure.”2
The prosecution in DUI trials has long held a decided advantage over the defense because of misconceptions about the effectiveness of the FST. Even defense attorneys have often accepted the premise that the FST has a measure of value in predicting driving impairment. In essence, NHTSA representatives have for over a decade enjoyed a free ride, but the road has recently developed some serious pot holes.
Research (Cole & Cole, 1991; Cole & Nowaczyk, 1994) and expert testimony offered by Cole & Nowaczyk have enabled judges and attorneys to better understand the limitations of the FST. In the past, NHTSA representatives have made outlandish claims as to the effectiveness of the FST even though these claims are not supported by their own research data. Because of these sins of omission and an occasional sin of commission, many myths have developed concerning the validity and reliability of the FST battery. The present article attempts to separate the facts from the myths.
Myth 1: The Field Sobriety TFST (FST) battery predicts driving impairment.
Fact: NHTSA never attempted to determine if the FST could predict driving Impairment. There is not a single study linking the recommended FST battery directly to driving impairment. The fact is, there never wilt be a simple roadside coordination task that can predict driving impairment. In one of NHTSA’s own reports, the following statement is made “… even valid, behavioral tests are likely to be poor predictors either of actual behind-the-wheel driving …or of accidents” (p. 2-7, Snapper, Seaver & Schwartz, 1981.) The stated goal in the 1977 study was to determine the relationship between FST and intoxication and driving impairment. However, they did not investigate the relationship between the FST and driving impairment.
While there is a relationship between BAC level and driving impairment, the relationship is not likely to be a simple linear one. Therefore, it is not appropriate to assume that 1) if FST performance and BAC are related and 2) if BAC and driving Impairment are related, therefore, 3) FST and driving impairment are related. The relationships among these factors are too complex to assume a simple relationship as NHTSA might like you to conclude. There are comments among NHTSA researchers themselves alluding to this conclusion. In the 1981 NHTSA study, the researchers conclude,”…Individuals vary in alcohol tolerance, and infrequent drinker may be severely impaired at a BAC of 0.05, whereas a heavy drinker may show only minimal Impairment at this level” (p. 19). Dr. Moskowitz, one of the co-authors of both the 1977 and 1981 NHTSA studies, co-authored a later review of research on driving and alcohol, levels and concluded in a presentation at a scientific conference that,”... studies of driving simulator and on-the-road testing varied widely in results. This is due to the wide range of behavioral demands required by diverse control and visual search requirements” (Moskowitz & Robinson, 1987, p. 85). It is obvious that research is needed examining the relationship between FST and driving performance directly. That research has not yet been conducted. Dr. Burns herself indicated that the FST battery has its value in predicting BAG levels (Burns, 1984).
Myth 2: The FST battery is 80 percent accurate in differentiating between Individuals with BAC levels above or below .10.
Fact: The 1981 NHTSA study is the one cited by NHTSA as evidence of an 80 percent accuracy rate with the use of the FST battery. That study tested 296 subjects. Thirty-three percent of the subjects in the study had a BAC level of .00 and 34 percent were given dose levels calculated to raise BAG levels to .05. Another 11 percent of the subjects had BAG levels approximating .15, with some having BACs as high as .18. An officer should have no difficulty correctly identifying totally alcohol-free subjects as being unimpaired. Although slightly more difficult, one would expect officers to correctly classify subjects with BAC levels of .05 as being unimpaired. They should also have little difficulty correctly classifying subjects with the BAG levels of .15. In effect, 78 percent of the subjects fall into these extreme categories. Only 22 percent of the subjects were in the critical BAC range around .10. When the tests must differentiate in this critical range, they fail miserably. The overall accuracy rate of .80 is. misleading when over two-thirds of the decisions are “gimmies,” people with little or no alcohol or levels of .15.
For the remaining subjects, the officers have a 50/50 chance of being correct just on the basis of guessing. With the “easy” decisions and a guessing rate of .50, the reported 80 percent accuracy rate does not look exceptionally good. The question should not be how does the FST help officers correctly classify subjects 80 percent of the time. Instead, the question asked should be “Why doesn’t the FST do a better job helping the officers reach the correct decision?” In fact, the 1977 NHTSA report contains the following admonition, “Again, it should be pointed out that all the evidence from these data suggest it is unrealistic to attempt to use behavioral tests to discriminate BACS in a .02 margin around a given level” (P.41).
Myth 3: The FSTs are tests accepted by the scientific community.
Fact: Anastasi (1988) defines a test as being an objective and standardized measure of behavior, in the behavioral sciences, specific criteria must be met for a behavioral test to be accepted. The primary criteria include establishing the reliability, validity, and standardized administration of the test. Reliability and validity involve the consistency of test scores and the relationship of the score to the behavior it is designed to measure. Standardization includes uniformity of procedure in administering the test as well as the scoring of the test. For test scores to be meaningful the test conditions under which the tests are administered must not be causing differences in test scores. A test that has not been standardized or does not outline exact procedures for administration and scoring would not be considered a scientific test.
An important step in the standardization of a test is the development of norms and as the name suggests, a norm is the normal, average or typical score. Scores can only be interpreted by comparing them with scores obtained by others. There are no adequate norms for the FST battery. Common sense dictates and research supports the belief that motor skills decline with age. The FST, however, provides no basis for interpreting the results for individuals at various age levels. Although, manuals for DWl training suggest that tests should not be given to individuals who are 60 years of age or older or to a person more than 50 pounds overweight, it provides no information on how to evaluate the performance of a 45 year old versus a 20 year old (NHTSA, 1992).
Examiners cannot adequately interpret a score, unless they know the mean and the standard deviation of the distribution. NHTSA leads us to believe that the “norm” for a sober person would be a test score of 0; that is, no errors in performance. Yet, we know from the 1977 NHTSA study that all of the sober people in that study made at least one error. In fact, the mean number of error “cues” scored among the sober individuals was 10.56.3
Even if NHTSA’s claim that the FST is not a norm-referenced test, but rather a criterion-referenced test (that is, that a certain score (criterion point) indicates failure), there are no data indicating how this criterion score might vary as a function of age, gender, or motor coordination. Even, if such norms were produced from the NHTSA 1977 and 1981 studies, they would be of limited value given that they are based on laboratory testing, not testing in the field.
Myth 4: The field sobriety tests are reliable.
Fact: Reliability refers to the consistency n test scores. Reliability scores can range from a low of .00, which indicates no consistency, to 1 .00, which indicates perfect consistency. A test with a reliability value of .90 would indicate that 90 percent of the variability in the test scores is attributed to true differences in performance and 10 percent would be due to error. Most well-established tests (e.g., Wechsler scales for lQ, SAT, GRE) have reliability values greater than .90. The scientific community expects reliability coefficients to be in the upper .80s or .90s for a test to be scientifically reliable (Anastasi, 1958; Rosenthal & Rosnow, 1990).
The HGN, One-leg stand, and Walk-and-turn tests have test-retest reliabilities of .66, .72, and .61 respectively with a combined reliability of 77. This means that 34 percent of the HGN, 28 percent of the One-leg stand and 39 percent of the Walk-and-turn test scores can be attributed to errors in scoring. If 23 percent of the score on a breathalyzer depended on the manufacturer of the device, would it be allowed into evidence? Quite possibly the most telling lack of reliability of the FST battery is that when different officers tested the same subjects at the same dose level on different days, the reliability was only .59. This means that 41 percent of the score was due to error. These reliabilities are far too low to be useful in making important decisions. By contrast the reliability of the BAC machine readings was .96, indicating a high level of reliability.
Myth 5: The field sobriety tests are scientifically valid.
Fact: The 1977 NHTSA study reported the results in terms of validity coefficients. The validity coefficient for HGN, One-leg stand and Walk-and-turn tests was .67, .48, and .55 respectively with a combined validity coefficient of 67. For example, if the officer used the individual FSTs, the accuracy in predicting the BAC levels would increase by only 26 percent with the HGN test, 12 percent with the one-leg stand test and 16 percent with the walk-and-turn test. If all three tests were administered, accuracy in predicting BAC levels would improve by only 26 percent. The error in predicting BAG levels using the HGN, the one leg stand, and the walk-and-turn combined would be 74 percent as large as it would be by chance.
For the FST battery to be a valid predictor of BAC, it must not only identify individuals above a BAC level of .10 as “failing, “ but also identify individuals below .10 as “passing.” That is, the test must have discriminative power. In NHTSA’s own studies, a significant proportion of people who were below the .10 BAG standard in effect at that time were falsely viewed as being impaired. In the 1977 Burns and Moskowitz study, 46.5 percent of the “arrest” decisions by participating officers were incorrect. Of the 101 arrest decisions, 47 subjects had BAG levels less than .10. The authors, themselves conclude, “Obviously, an error rate of 46.5 percent in making arrests is not acceptable” (p.25).
In the follow-up study by Tharp et. Al. (1981), the false arrest rate was 32 percent. The primary reason for the decrease in false alarm from 46.5 percent in the ‘77 NHTSA study to 32 percent in the 1981 study was not due solely to the “new improved FST,” but partly to the distribution of subjects across the dose levels. In the ‘77 NHTSA study 27 percent of subjects were in the critical range (BAC in the middle range) and in the ‘81 NHTSA study only 22 percent of subjects were in the middle range. In other words the distribution in the ‘81 NHTSA study made discriminations easier. If the ‘81 NHTSA study had used the same distribution of BAC levels that were employed in the ‘77 NHTSA study, the false arrest rate would have been higher than 32 percent and probably would have matched the “unacceptable” 46.5 level of the ’77 NHTSA study. These validity scores are quite low and suggest that the FST battery is of little benefit for an officer determining BAC levels.
Myth 6: NHTSA has validated the FST in a field setting.
Fact: The 1977 and 1981 NHTSA studies were conducted in a laboratory setting. It is obvious that laboratory studies are very different from studies performed in a natural or field setting. Laboratories are quite different from real life situations. For example, the influence of alcohol on the individual depends greatly on the social context, as well as the expectations of the person. Subjects in these NHTSA studies were told not to eat eight hours prior to the testing. Test subjects were tested at 15-minute intervals, and the study began early in the morning. This would mean that many subjects had not eaten for long as 12 hours before being tested. It is doubtful that a person drinking in a natural setting would fast for hours and then consume alcohol at unknown ethanol levels.
Laboratories are artificial by nature and only gives an indication of what one might expect in a field setting. In the conclusions of the 1981 NHTSA study, the authors recommended that the field sobriety test should be validated in the field for 18 months and in various localities across the nation. The 1983 NHTSA study by Anderson, et al., the purported “field validation” of the FST battery, did not meet those recommendations, A 3-month study was conducted in a limited number of locations on the east coast. Dr. Bums has testified on cross examination4 that the FST has never been adequately field tested. Most importantly the FST has never been standardized or validated in a field getting.
Myth 7: The NHTSA studies have been published in Peer Reviewed Journals.
Fact: Neither of the 1977 or 1981 NHTSA studies has been published in a scientific peer-reviewed journal. The publications have been limited to technical reports issued by NHTSA. Dr. Burns has admitted on cross examination3 that the method and results sections were too lengthy to be published in a scientific journal. Based on this logic lengthy but important studies would never be published. It is difficult to see how the NHTSA could claim that the FST Is accepted in the scientific community, when results of studies on the validation of the FST have never appeared in a scientific peer reviewed journal, which is’ a basic requirement for acceptance by the scientific community.
Myth 8: There is a consistent relationship between BAC levels and driving impairment.
Fact: The literature on the effects of alcohol is so diverse that one can only conclude that any demanding task may be impaired at almost any BAC level. Research indicates that there are substantial individual variations in the metabolism of alcohol which would, most likely affect performance. Performance is also affected by individual differences and individuals with identical BAC levels, may very well have different levels of impairment (Hurst and Bagley, 1972; Moskowitz, Daily and Henderson, 1974). Many studies involving the influence of alcohol on impairment find a rather significant number of subjects whose performance actually increases after the consumption of alcohol. In a study conducted under the auspices of the California Highway Patrol and various law enforcement agencies, Giguire (1985) found that 17 percent of his subjects with doses calculated to achieve BAG levels of .10 improved driving performance on a closed course. Mangarin & Standery (1989) also found no effects of alcohol dose on a video driving performance despite an unusually high dose calculated to achieve a BAG of .16. These studies and others suggest a complex relationship between BAC levels and performance and offers little support for setting specific BAC impairment levels and certainly does not support the assumption that BAG levels could be used as a substitute criteria for driving impairment.
Myth 9: People who are not impaired can “pass” the Field Sobriety Tests.
Fact: Cole and Nowaczyk (1991) had 21 adults perform field sobriety tests who were completely alcohol free, as confirmed by breath tests. The subjects were given six tests including a heel to toe test and a one leg stand test. None of the subjects was under the extreme pressure that is associated with a roadside detention situation. Two separate groups of law enforcement officers gathered at different times to judge the performance of the participants. These were actual police officers who had received standard training in the observation and Identification of intoxicated drivers. The officers were then asked to identify individuals who had too much to drink to drive. Of 147 responses by the police officers, 68 of those responses (46 percent) indicated that a completely sober person was too intoxicated to drive, The average police experience was 12 years. Interestingly, the officer with the least experience had the fewest wrong responses.
Compton (1985) found false positive rates for totally alcohol free participants to be as high as 54 percent for some police departments. In the 1981 NHTSA study 18 percent of alcohol-free subjects and 31 percent of subjects with BAC levels of .05 were judged to be impaired. Clearly, there is a strong tendency for certified alcohol-free participants to fail Field Sobriety Tests.
Myth 10: The Horizontal Gaze Nystagmus (HGN) Test is the most sensitive test for measuring Impairment.
Fact: Because the HGN test is a physiological task unlike the other Field Sobriety Tests which are psychomotor, divided-attention tasks, it is sometimes viewed as being the most sensitive of the three tests. Also, some of NHTSA’S research indicates it has the strongest relationship with BAC (e.g., Burns & Moskowitz, 1977 (p. 17]; Anderson, et at., 1983 (Table 2]). Yet, some of NHTSA’s own data raise question marks about its ability to discriminate among individuals with different BAG levels.
In a report commissioned by NHTSA, Snapper, Seaver and Schwartz (1981) reviewed the Burns & Moskowitz study and conclude, “Nystagmus, on the other hand was not a highly-rated test. ... First, Burns and Moskowitz evaluated tests with respect to the relationship between performance on the test and blood alcohol concentration (BAC). A close relationship between these two variables does not necessarily imply a close relationship between performance on the nystagmus test and driving performance, or between test performance and accidents. Specifically, it is not apparent that performance on the nystagmus test reflects, any skills related to driving. In addition, examining a driver for nystagmus may be difficult operationally and somewhat unsafe. Scoring is quite subjective and would require careful training for the test administrator” (p. 4-4).
The difficulty in scoring is illustrated in the Tharp, et al. study where we find a weak relationship between an officers ability to judge the angle of nystagmus onset and the actual angle as measured by a machine. Officers are instructed that onset of nystagmus before 45 degrees of eye movement to the outside is an indication of a BAC above .10. Yet, we find that of the 10 officers who participated in the Tharp et al. study, 5 had correlation coefficients less than .44, with 2 in the .23 to .26 range. This indicates little relationship between what the officers judged .the angle of onset to be and what the machine actually recorded as the angle of onset,
The 45 degree angle of onset itself is troubling. Based on NHTSA’s own research, a 45 degree angle corresponds to a BAC of approximately .05 or .06, not .10 (Tharp, et al., 1981). A more appropriate angle, based on their findings, is 41 or 40 degrees not 45 degrees. A BAC level of .08 would correspond to an angle of onset of approximately 43 degrees. The task for the officer to detect such small changes is quite daunting, if not impossible.
Follow-up research on impairment and performance with the HON has shown it can lead officers to falsely conclude a person has a BAG above .10 when it is not. Compton, in a NHTSA study (1985), reported the findings of a study where individuals were stopped at simulated sobriety checkpoints. The subjects, dosed to different BAC levels, were encouraged to act as though they were not impaired. The officers gave “failing” scores (4 points or higher) to 15 percent of the sober individuals and 64 percent of those with BAC levels between .05 and .09 (the average BAC level in this condition was .07).
Giguire (1985) had 24 Navy personnel drive on a closed course under sober and intoxicated conditions. In addition to evaluating their driving performance, Giguire had officers administer the Field Sobriety Tests. Of the 13 subjects with BACs below .10 (between .064 and .099), 12 showed evidence of impairment based on the HGN. The HGON is not as accurate a test for determining BAG as NHTSA would like you to believe.
Because of its widespread use, the FST battery has been assumed to be a reliable and valid predictor of driving impairment. NHTSA has done little to dispel that assumption. Law enforcement cannot be blamed for its use of the FST battery. Training documents refer to NHTSA reports and provide what appears to be supporting evidence for the validity of the FST battery. In addition, there is little doubt that individuals who have high BAC levels will have difficulty performing the FST battery. However, what the law enforcement community and courts fail to realize is that the FST battery may mislead the officer on the road to incorrectly judge individuals who are not impaired. The FST battery to be valid must discriminate accurately between the impaired and non-impaired driver, NHTSA’s own research on that issue (Anderson, et. al., 1983; Bums & Moskowitz, 1977; Tharp, et al. 1981) has not been subjected to peer review by the scientific community. In addition, a careful reading of the reports themselves provides support for the inadequacy of the FST battery. The reports include low reliability estimates for the tests, false arrest rates between 32 and 46.5 percent, and a field test of the FST battery that was flawed. Because officers in many cases had breathalyzer results at the time of the arrest. NHTSA clearly ignored the printed recommendations of its own researchers in conducting that field study.
What is needed is a careful examination of the complex relationships among motor coordination tasks, BAG level and driving impairment. Tests should be developed based on our understanding of these relationships. The current method of selecting the “best of what is out there” is not serving the public well.
Anderson, I.E., Schweitz,R. M. & Snyder, M. 8. (1983). Field evaluation of a behavioral battery for DWI. Final Report, DOT-HS-806-676, 1983.
Anastasi, A. (1988). Psychological Testing, Sixth edition. NY: Macmillan Press.
Burns, M. & Moskowitz, H. (1977). Psychophysical tests for DWI arrest. Final Report, DOT-HS-802-424, NHTSA, 1977.
CoIdwelI, B. B., Penner, D. W., Smith, H. W., Lucas, 0. H. W., Rodgers, R. F. & Darroch F. (1958). Effect of ingestion of distilled spirits on automobile driving skill. Quartery Journal of Studies on Alcohol, 19, 590-616.
Cole, R. M. & Cole, S. N. (1991). New proof that field sobriety tests are “failure designed.” OWl Journal, 6(2), 1-5.
Cole, S. & Nowaczyk, S. H. (1994). Field sobriety tests: are they designed for failure? Perceptual and Motor Skills, 79, 99-104.
Compton, R. P. (1955). Pilot test of selected DWI detection procedures for use at sobriety checkpoints. Final Report, DOT- H S_806-724.
Giguire, W. (1985). Impairment caused by moderate blood alcohol levels in a closed course: preliminary demonstration. In S. Kaye & G. Meier (Eds.), Alcohol, Drugs and Traffic Safety. Proceedings 9th International Conference.
Hurst, P.M. and BagIey, S.K. Acute adaptation to the effects of alcohol. Quart. J. Stud. Alc., 33, 358-378, 1972.
Moskowitz, H., Daily, J. And Henderson, A. Acute tolerance to behavioral impairment by alcohol in moderate and heavy drinkers. DOT-NHTSA,TM (L) - 4970/013/00, 64 pp., 1974.
Moskowitz, H. & Robinson, C. (1987). Driving-related skills impairment at low blood alcohol levels. In P. C. Noordzij & A. Roczbach (Eds.), Alcohol, drugs and traffic safety. Elsevier Science Publishers. pp. 79-86.
Moskowitz, H. & Robinson, C. (1988). Effects of low doses of alcohol on driving-related skills: a review of the evidence. Final Report, DOT-HS-807-280.
NHTSA, National Highway Traffic Safety Administration (1992). DWl Detection and Standardized Field Sobriety Testing. DOT- PB94-780 228.
Rosenthal, A. & Rosnow, R. L. (1991). Essentials of Behavioral Research. (2nd ed.) New York: McGraw-Hill.
Snapper, K. J., Seaver, D. A., & Schwartz, J. P. (1981). An assessment of behavioral tests to detect impaired drivers. Final Report, DOT-HS-806-211.
Tharp, V., Burns, M. & Moskowitz, H. (1981). Development and field test of psychophysical tests for DWI arrests. Final Report, DOT-HS-805-864.