Voice Recognition - Forensic Psychology

Voice recognition, or “earwitness” identification, has not received the amount of research or public interest that eyewitness identification has received in recent years. A 1983 survey of British legal cases, however, found more than 180 cases at that time in which voice identifications were used as evidence. But a growing body of research suggests that the use of voice identifications in court is just as dangerous, if not more so, than reliance on eyewitness identification. Research consistently shows that voice recognition is less accurate than face recognition under similar circumstances and that the same factors that affect eyewitness reliability can also create problems for the earwitness. Potential jurors, however, often overestimate the accuracy of voice recognition in forensic contexts.

Voice Recognition in the Courtroom

Perhaps the most famous use of voice recognition evidence in a criminal trial was in the trial of Bruno Richard Hauptmann, executed in 1936 for the kidnapping and murder of the infant son of the aviator Charles Lindbergh. The Lindbergh case was called the trial of the century, and one of the most dramatic moments in the trial was when Lindbergh himself took the stand. Describing the night of the ransom drop-off 3 years before the trial, Lindbergh spoke of hearing a voice from 100 yards away while he waited in his car for a friend to hand over the ransom. When Hauptmann was arrested 29 months later, Lindbergh was brought to the police station to listen to Hauptmann repeat the words of the kidnapper: “Hey doctor! Over here, over here.” Lindbergh testified under oath that he was certain that Hauptmann’s voice was the voice of the kidnapper. Experts still disagree over whether the jury reached the correct verdict in finding Hauptmann guilty.

Voice identification has played a role in at least one well-publicized case of erroneous conviction in Canada. In October 1984, a 9-year-old girl named Christine Jessop disappeared from her home in Ontario and was found dead almost 3 months later. She had been stabbed to death, apparently shortly after her disappearance. The investigation quickly focused on a neighbor, Guy Paul Morin, who was arrested in April 1985. Although Morin had a strong alibi, he was brought to trial in 1986 and was initially acquitted. But in Canada, the prosecution can appeal an acquittal, and Morin was retried in 1991. The second trial lasted almost 9 months, and the second jury found Morin guilty.

Although many errors occurred in the investigation of Christine Jessop’s death and the trials of Guy Paul Morin, one dramatic piece of evidence at the trial came from Christine’s mother, Janet. She testified that on the night of Christine’s funeral, she heard an unknown male voice crying out near her home, “Help me, help me, oh

God, help me!” She later identified this voice as that of her neighbor, Morin, with whom she had spoken over the fence just a few times. The prosecution claimed that Morin experienced a fit of remorse after the funeral and cried out in emotional agony from his home. While we cannot know the role that this testimony played in the jury’s decision, one thing is clear: The wrong man was ultimately convicted. DNA testing revealed several years later that Morin could not possibly be the killer, and he was exonerated in 1995. The real killer has never been found.

Earwitness Research

Morin’s voice was mistakenly identified by a casual acquaintance. Lindbergh, in contrast, was called on to identify a voice that he had heard only once. Both types of identification have forensic relevance. Usually, voice identifications are made in situations in which the witness or victim was unable to see the perpetrator’s face because of darkness or because the perpetrator wore a mask. Sometimes, the victim of a crime may recognize the perpetrator’s voice as that of a former co-worker or even a relative. The victim tells the police that he or she recognized the voice, and the identified person becomes the main suspect. Many cases in which voice identification is used as evidence, however, involve the identification of a stranger’s voice. In such cases, when a suspect has come to light, a voice lineup may be played for the witness, usually in the form of a tape-recorded series of short clips of several parties speaking. The witness is asked to indicate whether any of the voices is the voice of the perpetrator. A voice showup may also be used, in which the witness is asked to listen to only one voice and to indicate whether this voice is the voice of the perpetrator. For example, witnesses to a bank robbery in North Carolina were asked to listen to a tape-recording from a previous convenience store robbery, in an effort to gather evidence that the two crimes were committed by the same person.

Why would such identifications result in errors? As with face recognition by eyewitnesses, it is important to recognize that memory for a voice does not operate like a tape recorder or a video camera. A listener encodes certain salient features of a voice into memory (e.g., pitch, loudness, accent, or unusual pronunciation or cadence) when it is heard, but later recognition of the voice as familiar is also heavily influenced by context, expectations, and logical reasoning. For example, if you answered your telephone right now, your identification of the voice at the other end of the line would depend partly on your actual auditory memory for voices and partly on your expectations of who might be calling you, your knowledge of people who know your phone number, and even considerations such as the time of day. And almost all of us have had the experience of picking up the telephone, expecting a particular caller, “identifying” the voice as that of a friend or relative, only to realize minutes later that the caller is actually a stranger who has dialed a wrong number.

In a typical earwitness experiment, participants listen to a recorded statement of a particular duration and may or may not be informed that they will be asked to recognize the voice later. After a period of time, the participants are exposed to a voice lineup consisting of several different voices and are asked to choose the voice that had uttered the original statement. Participants also often rate their confidence in their choice or are asked whether they are certain enough to testify in court regarding their identification. In a study by Daniel Read and Fergus Craik, for example, college students heard a series of statements, including a male target voice saying, “Help me, help me, oh God, help me!” (the words heard by Christine Jessop’s mother) and were asked to rate the emotionality of each statement. They did not know that they would be asked to recognize any of the voices in the future. At a class meeting 17 days later, the same students were asked to listen to a series of 20-second, conversational utterances by 6 male speakers and to choose the one that had uttered the statement in question. The target voice was one of the voices in the lineup. Pure guessing would have resulted in a chance performance level of 17% (1 out of 6). In fact, the accuracy of the students in the study was only 20% correct, no better than chance.

Most studies also incorporate a “target-absent” lineup to measure the likelihood of a false identification when the lineup does not contain the actual perpetrator. Such research points out the danger of misidentifying an unfamiliar voice as familiar; even with a relatively lengthy exposure to a distinctive target voice, false identification rates in such a target-absent lineup can be as high as 90% to 100%.

Factors Influencing Voice Recognition Accuracy

The likelihood of correctly identifying a voice depends on a number of factors or estimator variables, many of which also influence eyewitness accuracy. Limited exposure to a voice can lead to decreased accuracy; the longer the time that the perpetrator spends talking, the more likely the witness is to properly encode the voice characteristics. It is important to recognize, however, that witnesses are likely to overestimate the length of time that the perpetrator spent speaking. A 30-second speech sample, for example, is typically remembered as having lasted from 90 seconds to more than 2 minutes. The amount of time that passes between initially hearing a voice and then being tested for recognition is also critical. The longer the delay between exposure and testing, the greater the chance of error becomes, particularly errors in the form of false recognitions of innocent persons’ voices. Background noise can interfere with the witness’s ability to encode voice characteristics. The proximity of the witness to the speaker is also important, with closer proximity being associated with greater accuracy.

The ability to see a perpetrator’s face may also adversely affect the recognition of the perpetrator’s voice, a phenomenon known as the face overshadowing effect. It is thought that a witness pays relatively more attention to the face when it is visible, resulting in decreased voice identification accuracy. Studies have shown, however, that instructions to pay attention to the voice do not significantly reduce the face overshadowing effect, suggesting a process that may not be under the witness’s conscious control. Use of voice recognition evidence in situations where the perpetrator’s face has been visible, then, is considered unreliable.

Studies of eyewitness identification consistently find superior performance in recognizing faces of one’s own race as opposed to faces of another race. There is a similar finding in voice recognition research regarding accents and languages. English speakers, for example, have been shown to be more accurate in recognizing unaccented English-speaking voices than heavily accented English-speaking voices and least accurate in recognizing voices speaking in a foreign language. Language familiarity, then, has a significant positive effect on voice identification accuracy. (Gender, on the other hand, has no consistent relationship to voice recognition.)

Stress can also decrease the accuracy of voice recognition. When viewing videotaped crimes in the laboratory, research participants typically make more errors in both face and voice recognition when violent threats are made or a weapon is present. Our ability to pay attention to all aspects of our surroundings is limited under any circumstances, and under conditions of stress it becomes more limited. When threats are made, it is more important to our survival to listen to and remember the content of the spoken message rather than the vocal qualities of the speaker.

Voices may easily be disguised, further decreasing the ability of a witness to accurately recognize a voice. When a witness hears a voice raised in anger during the commission of a crime and subsequently attempts to recognize the speaker saying something in a normal tone, accuracy is decreased. Whispering is an extremely effective way to disguise a voice, because it covers up many distinctive vocal characteristics such as pitch.

Earwitness accuracy may also be related to the age of the witness. Studies tend to show that very young children are not as accurate in recognizing voices as children above 10 years of age, who often perform comparable with adults. Speaker identification accuracy also decreases after the age of 40, probably related to increases in hearing loss for older persons. Furthermore, blind persons are not superior to sighted persons in their ability to recognize voices or other natural sounds, in spite of popular opinion to the contrary.

Common sense tells us that recognizing the voice of an acquaintance, friend, or family member should be easier than recognizing the voice of a stranger. To a certain extent, research supports this conclusion. However, studies of the recognition of familiar voices find a wide range of accuracy levels, depending on the specific circumstances of the event. Although some studies find a high degree of accuracy (more than 95%) in recognizing familiar voices, studies often show accuracy rates of less than 70%, and sometimes significantly lower. Daniel Yarmey and colleagues, for example, compared participants’ recognition of highly familiar voices (immediate family members or best friends), moderately familiar voices (co-workers, teammates, or friends), or low-familiarity voices (casual acquaintances) and found that accuracy for identifying voices of low and moderate familiarity was only about 65% and participants misidentified the voices of strangers as being familiar almost 40% of the time. Thus, according to Yarmey, when a witness claims to recognize a perpetrator’s voice as that of a familiar person, police officers should not simply take this statement at face value but should construct a voice lineup to test the witness’s ability to identify the voice in question.

Unfortunately, the most salient indicator of voice recognition accuracy for a juror is often the witness’s confidence in the courtroom. Studies consistently show that voice identification accuracy is almost completely unrelated to confidence. Extremely confident witnesses are often wrong in their identification of a voice, and accurate witnesses often show little confidence in their identifications. Furthermore, jurors are likely to overestimate the likelihood of any voice identification being accurate. When psychology students, for example, are asked to estimate the percentage of accurate identifications in circumstances mirroring actual laboratory and field studies, they consistently give unrealistically high accuracy predictions. While it may not be surprising that laypersons have little knowledge of the problems associated with earwitness identification, a recent British study indicated that police officers were no more knowledgeable than the general population regarding voice recognition issues.

Voice Identification Procedures

In another parallel with eyewitness research, the use of one-person showups in voice identification has been criticized as unduly suggestive. In a study by Daniel Yarmey and his associates, a young woman approached citizens individually in a public place and interacted with them for about 15 seconds each. The participants were given a voice identification test approximately 5 minutes after the encounter. When the test was a one-person showup as opposed to a lineup to six voices, innocent suspects were significantly more likely to be identified. Accurate identifications of the real speaker’s voice were rare in both conditions.

Within the United States and, for the most part, internationally, there are few standardized procedures for use in forensic voice identification. Researchers at the Netherlands Forensic Institute have proposed the development of guidelines for voice lineup construction, similar to the guidelines in use in many police departments for eyewitness lineups. They advocate a minimum of five voices in the lineup in addition to the suspect, with foils being chosen for similarity to the suspect’s sex, age, accent, socioeconomic background, and vocal characteristics, such as pitch and speed of speaking. They also recommend the use of double-blind administrators and standardized instructions for the earwitness—recommendations that are becoming common in the realm of eyewitness procedure but need stronger advocacy in the voice recognition arena.

References:

Breeders, A. P. A., & van Amelsvoort, A. G. (1999). Lineup construction for forensic earwitness identification: A practical approach. In Proceedings of the 14th International Congress of Phonetic Sciences (pp. 1373-1376), University of California, Berkeley.
Read, D., & Craik, F. I. M. (1995). Earwitness identification: Some influences on voice recognition. Journal of Experimental Psychology: Applied, 1, 6-18.
Solan, L. M., & Tiersma, P. M. (2003, November/December). Falling on deaf ears [Electronic version]. Legal Affairs. Retrieved June 27, 2015, from http://www.legalaffairs.org/issues/November-December-2003/story_solan_novdec03.msp
Van Wallendael, L. R., Surace, A., & Hall-Parsons, D. (1994). “Earwitness” voice recognition: Factors affecting accuracy and impact on jurors. Applied Cognitive Psychology, 8, 661-677.
Yarmey, A. D. (1995). Earwitness speaker identification. Psychology, Public Policy, and Law, 1, 792-816.
Yarmey, A. D., Yarmey, A. L., Yarmey, M. J., & Parliament, L. (2001). Commonsense beliefs and the identification of familiar voices. Applied Cognitive Psychology, 15, 283-299.