02-03-2013, 12:06 PM
Speaker Recognition
Introduction
Speaker, or voice, recognition is a biometric modality that uses an individual’s voice for recognition purposes. (It is a different technology than “speech recognition”, which recognizes words as they are articulated, which is not a biometric.) The speaker recognition process relies on features influenced by both the physical structure of an individual’s vocal tract and the behavioral characteristics of the individual. A popular choice for remote authentication due to the availability of devices for collecting speech samples (e.g., telephone network and computer microphones) and its ease of integration, speaker recognition is different from some other biometric methods in that speech samples are captured dynamically or over a period of time, such as a few seconds. Analysis occurs on a model in which changes over time are monitored, which is similar to other behavioral biometrics such as dynamic signature, gait, and keystroke recognition.
History
Speaker verification has co-evolved with the technologies of speech recognition and speech synthesis because of the similar characteristics and challenges associated with each. In 1960, Gunnar Fant, a Swedish professor, published a model describing the physiological components of acoustic speech production, based on the analysis of x-rays of individuals making specified phonic sounds. 1 In 1970, Dr. Joseph Perkell used motion x-rays and included the tongue and jaw 1 to expand upon the Fant model. Original speaker recognition systems used the average output of several analog filters to perform matching, often with the aid of humans “in the loop”. 2,3,4,5,6 In 1976, Texas Instruments built a prototype system that was tested by the U.S. Air Force and The MITRE Corporation. 1,7 In the mid 1980s, the National Institute of Standards and Technology (NIST) developed the NIST Speech Group to study and promote the use of speech processing techniques. Since 1996, under funding from the National Security Agency, the NIST Speech Group has hosted yearly evaluations, the NIST Speaker Recognition Evaluation Workshop, to foster the continued advancement of the speaker recognition community
Approach
The physiological component of voice recognition is related to the physical shape of an individual’s vocal tract, which consists of an airway and the soft tissue cavities from which vocal sounds originate. 1 To produce speech, these components work in combination with the physical movement of the jaw, tongue, and larynx and resonances in the nasal passages. The acoustic patterns of speech come from the physical characteristics of the airways. Motion of the mouth and pronunciations are the behavioral components of this biometric. There are two forms of speaker recognition: text dependent (constrained mode) and text independent (unconstrained mode). In a system using “text dependent” speech, the individual presents either a fixed (password) or prompted (“Please say the numbers ‘33-54-63’”) phrase that is programmed into the system and can improve performance especially with cooperative users. A “text independent” system has no advance knowledge of the presenter's phrasing and is much more flexible in situations where the individual submitting the sample may be unaware of the collection or unwilling to cooperate, which presents a more difficult challenge. 9 Speech samples are waveforms with time on the horizontal axis and loudness on the vertical access. The speaker recognition system analyzes the frequency content of the speech and compares characteristics such as the quality, duration, intensity dynamics, and pitch of the signal.
CONCLUSION
Thanks to the commitment of researchers and the support of NSA and NIST, speaker recognition will continue to evolve as communication and computing technology advance. Their determination will help to further develop the technology .