25-02-2011, 12:06 PM
Automatic+Speaker+Recognition+System.doc (Size: 92 KB / Downloads: 87)
ABSTRACT
Speaker recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to user's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.
The goal of this project is to build a simple, yet complete and representative automatic speaker recognition system. Due to the limited space, we will only test our system on a very small (but already non-trivial) speech database. There were 8 female speakers, labeled from S1 to S8. All speakers uttered the same single digit "zero" once in a training session and once in a testing session later on. Those sessions are at least 6 months apart to simulate the voice variation over the time. The vocabulary of digit is used very often in testing speaker recognition because of its applicability to many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line. By checking the voice characteristics of the input utterance, using an automatic speaker recognition system similar to the one that we will develop, the system is able to add an extra level of security.
1. Principles of Speaker Recognition
Speaker recognition can be classified into identification and verification. Speaker identification is the process of determining which registered speaker provides a given utterance. Speaker verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker. Figure shows the basic structures of speaker identification and verification systems.
Speaker recognition methods can also be divided into text-independent and text-dependent methods. In a text-independent system, speaker models capture characteristics of somebody’s speech which show up irrespective of what one is saying. In a text-dependent system, on the other hand, the recognition of the speaker’s identity is based on his or her speaking one or more specific phrases, like passwords, card numbers, PIN codes, etc.
All technologies of speaker recognition, identification and verification, text-independent and text-dependent, each has its own advantages and disadvantages and may requires different treatments and techniques. The choice of which technology to use is application-specific. The system that we will develop is classified as text-independent speaker identification system since its task is to identify the person who speaks regardless of what is saying.
At the highest level, all speaker recognition systems contain two main modules (refer to Figure ): feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker. Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers. We will discuss each module in detail in later sections.
All speaker recognition systems have to serve two distinguish phases. The first one is referred n sessions or testing phase. In the training phase, each registered speaker has to provide samples of their speech so that the system can build or train a reference model for that speaker. In case of speaker verification systems, in addition, a speaker-specific threshold is also computed from the training samples. During the testing (operational) phase (see Figure), the input speech is matched with stored reference model(s) and recognition decision is made.