Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: IMPLEMENTATION OF A VOICE - BASED BIOMETRIC SYSTEM
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
IMPLEMENTATION OF A VOICE - BASED BIOMETRIC SYSTEM

[attachment=21782]

INTRODUCTION

The concept of a machine than can recognize the human voice has long been an accepted feature in Science Fiction. From „Star Trek‟ to George Orwell‟s „1984‟ - “Actually he was not used to writing by hand. Apart from very short notes, it was usual to dictate everything into the speakwriter.” - It has been commonly assumed that one day it will be possible to converse naturally with an advanced computer-based system. Indeed in his book „The Road Ahead‟, Bill Gates (co-founder of Microsoft Corp.) hails Automatic Speaker Recognition (ASR) as one of the most important innovations for future computer operating systems.
From a technological perspective it is possible to distinguish between two broad types of ASR: „direct voice input‟ (DVI) and „large vocabulary continuous speech recognition‟ (LVCSR). DVI devices are primarily aimed at voice command-and-control, whereas LVCSR systems are used for form filling or voice-based document creation. In both cases the underlying technology is more or less the same. DVI systems are typically configured for small to medium sized vocabularies (up to several thousand words) and might employ word or phrase spotting techniques. Also, DVI systems are usually required to respond immediately to a voice command. LVCSR systems involve vocabularies of perhaps hundreds of thousands of words, and are typically configured to transcribe continuous speech. Also, LVCSR need not be performed in real-time - for example, at least one vendor has offered a telephone-based dictation service in which the transcribed document is e-mailed back to the user.
From an application viewpoint, the benefits of using ASR derive from providing an extra communication channel in hands-busy eyes-busy human-machine interaction (HMI), or simply from the fact that talking can be faster than typing. Also, whilst speaking to a machine cannot be described as natural, it can nevertheless be considered intuitive; as one ASR advertisement declared “you have been learning since birth the only skill needed to use our system”.


MOTIVATION

The motivation for ASR is simple; it is man‟s principle means of communication and is, therefore, a convenient and desirable mode of communication with machines. Speech communication has evolved to be efficient and robust and it is clear that the route to computer based speech recognition is the modeling of the human system. Unfortunately from pattern recognition point of view, human recognizes speech through a very complex interaction between many levels of processing; using syntactic and semantic information as well very powerful low level pattern classification and processing. Powerful classification algorithms and sophisticated front ends are, in the final analysis, not enough; many other forms of knowledge, e.g. linguistic, semantic and pragmatic, must be built into the recognizer. Nor, even at a lower level of sophistication, is it sufficient merely to generate “a good” representation of speech (i.e. a good set of features to be used in a pattern classifier); the classifier itself must have a considerable degree of sophistication. It is the case, however, it do not effectively discriminate between classes and, further, that the better the features the easier is the classification task.
Automatic speech recognition is therefore an engineering compromise between the ideal, i.e. a complete model of the human, and the practical, i.e. the tools that science and technology provide and that costs allow.
At the highest level, all speaker recognition systems contain two main modules (refer to Fig 1.1): feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each speaker. Feature matching involves the actual procedure to identify the unknown speaker by comparing extracted features from his/her voice input with the ones from a set of known speakers. We will discuss each module in detail in later sections.

All Recognition systems have to serve two different phases. The first one is
referred to the enrollment sessions or training phase while the second one is referred to
as the operation sessions or testing phase. In the training phase, each registered
speaker has to provide samples of their speech so that the system can build or train a
reference model for that speaker. In case of speaker verification systems, in addition, a
speaker-specific threshold is also computed from the training samples. During the
testing (operational) phase (see Figure 1.2), the input speech is matched with stored
reference model(s) and recognition decision is made.
Speech recognition is a difficult task and it is still an active research area.
Automatic speech recognition works based on the premise that a person‟s speech
exhibits characteristics that are unique to the speaker. However this task has been
challenged by the highly variant of input speech signals. The principle source of
variance is the speaker himself. Speech signals in training and testing sessions can be
greatly different due to many facts such as people voice change with time, health
conditions (e.g. the speaker has a cold), speaking rates, etc. There are also other factors,
beyond speaker variability, that present a challenge to speech recognition technology.

Examples of these are acoustical noise and variations in recording environments (e.g. speaker uses different telephone handsets). The challenge would be make the system “Robust”.
So what characterizes a “Robust System”? When people use an automatic speech recognition (ASR) system in real environment, they always hope it can achieve as good recognition performance as human's ears do which can constantly adapt to the environment characteristics such as the speaker, the background noise and the transmission channels. Unfortunately, at present, the capacities of adapting to unknown conditions on machines are greatly poorer than that of ours. In fact, the performance of speech recognition systems trained with clean speech may degrade significantly in the real world because of the mismatch between the training and testing environments. If the recognition accuracy does not degrade very much under mismatch conditions, the system is called “Robust”.

ORGANIZATION OF THE REPORT

The report first introduces the reader to speech processing by giving a theoretical overview; with a stress on speech recognition to build a foundation for our project. In Chapter 3, we describe the process of feature extraction, which outlines the steps involved in extracting the feature vectors required to suitably represent the speech uttered.
In Chapter 4, we discuss in detail the algorithms which we have simulated and tested using MATLAB, with a comparative study of both. This is followed by the architectural details of the TMS320C6713 digital signal processor in Chapter 5, which we have used for the implementation of our project. In the next chapter, we have discussed some optimization steps to be followed while implementing an ASR system on the DSK.