Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: VOICE CONTROLLED ROBOT REPORT pdf
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
VOICE CONTROLLED ROBOT

[attachment=36155]

INTRODUCTION

When we say voice control, the first term to be considered is Speech Recognition i.e.
making the system to understand human voice. Speech recognition is a technology where
the system understands the words (not its meaning) given through speech. Speech is an ideal method for robotic control and communication. The speechrecognition
circuit we will outline, functions independently from the robot’s main
intelligence [central processing unit (CPU)]. This is a good thing because it doesn’t take
any of the robot’s main CPU processing power for word recognition. The CPU must
merely poll the speech circuit’s recognition lines occasionally to check if a command has
been issued to the robot. We can even improve upon this by connecting the recognition
line to one of the robot’s CPU interrupt lines. By doing this, a recognized word would
cause an interrupt, letting the CPU know a recognized word had been spoken. The
advantage of using an interrupt is that polling the circuit’s recognition line occasionally
would no longer be necessary, further reducing any CPU overhead.

Why build robots?

Robots are indispensable in many manufacturing industries. The reason is that the cost
per hour to operate a robot is a fraction of the cost of the human labor needed to perform
the same function. More than this, once programmed, robots repeatedly perform
functions with a high accuracy that surpasses that of the most experienced human
operator. Human operators are, however, far more versatile. Humans can switch job tasks
easily. Robots are built and programmed to be job specific. You wouldn’t be able to
program a welding robot to start counting parts in a bin. Today’s most advanced
industrial robots will soon become “dinosaurs.” Robots are in the infancy stage of their
evolution. As robots evolve, they will become more versatile, emulating the human
capacity and ability to switch job tasks easily. While the personal computer has made an
indelible mark on society, the personal robot hasn’t made an appearance. Obviously
there’s more to a personal robot than a personal computer. Robots require a combination
of elements to be effective: sophistication of intelligence, movement, mobility,
navigation, and purpose.

SPEECH RECOGNITION TYPES AND STYLES

Voice enabled devices basically use the principal of speech recognition.It is the process
of electronically converting a speech waveform (as the realization of a linguistic
expression) into words (as a best-decoded sequence of linguistic units).
Converting a speech waveform into a sequence of words involves several essential steps:
1. A microphone picks up the signal of the speech to be recognized and converts it
into an electrical signal. A modern speech recognition system also requires that
the electrical signal be represented digitally by means of an analog-to-digital
(A/D) conversion process, so that it can be processed with a digital computer or a
microprocessor.
2. This speech signal is then analyzed (in the analysis block) to produce a
representation consisting of salient features of the speech. The most prevalent
feature of speech is derived from its short-time spectrum, measured successively
over short-time windows of length 20–30 milliseconds overlapping at intervals of
10–20 ms. Each short-time spectrum is transformed into a feature vector, and the
temporal sequence of such feature vectors thus forms a speech pattern.
3. The speech pattern is then compared to a store of phoneme patterns or models
through a dynamic programming process in order to generate a hypothesis (or a
number of hypotheses) of the phonemic unit sequence. (A phoneme is a basic unit
of speech and a phoneme model is a succinct representation of the signal that
corresponds to a phoneme, usually embedded in an utterance.) A speech signal
inherently has substantial variations along many dimensions.

Recognition Style

Speech recognition systems have another constraint concerning the style of speech they
can recognize. They are three styles of speech: isolated, connected and continuous.
Isolated speech recognition systems can just handle words that are spoken separately.
This is the most common speech recognition systems available today. The user must
pause between each word or command spoken. The speech recognition circuit is set up to
identify isolated words of .96 second lengths.
Connected is a half way point between isolated word and continuous speech recognition.
Allows users to speak multiple words. The HM2007 can be set up to identify words or
phrases 1.92 seconds in length. This reduces the word recognition vocabulary number to
20.
Continuous is the natural conversational speech we are use to in everyday life. It is
extremely difficult for a recognizer to shift through the text as the word tend to merge
together. For instance, "Hi, how are you doing?" sounds like "Hi,.howyadoin"
Continuous speech recognition systems are on the market and are under continual
development.

Hidden Markov model (HMM)-based speech recognition

Modern general-purpose speech recognition systems are generally based on hidden
Markov models (HMMs). This is a statistical model which outputs a sequence of symbols
or quantities.
One possible reason why HMMs are used in speech recognition is that a speech signal
could be viewed as a piece-wise stationary signal or a short-time stationary signal. That
is, one could assume in a short-time in the range of 10 milliseconds, speech could be
approximated as a stationary process. Speech could thus be thought as a Markov model
for many stochastic processes (known as states).
Another reason why HMMs are popular is because they can be trained automatically and
are simple and computationally feasible to use. In speech recognition, to give the very
simplest setup possible, the hidden Markov model would output a sequence of ndimensional
real-valued vectors with n around, say, 13, outputting one of these every 10
milliseconds. The vectors, again in the very simplest case, would consist of cepstral
coefficients, which are obtained by taking a Fourier transform of a short-time window of
speech and de-correlating the spectrum using a cosine transform, then taking the first
(most significant) coefficients. The hidden Markov model will tend to have, in each state,
a statistical distribution called a mixture of diagonal covariance Gaussians which will
give likelihood for each observed vector. Each word, or (for more general speech
recognition systems), each phoneme, will have a different output distribution; a hidden
Markov model for a sequence of words or phonemes is made by concatenating the
individual trained hidden Markov models for the separate words and phonemes.

Neural network-based speech recognition

Another approach in acoustic modeling is the use of neural networks. They are capable of
solving much more complicated recognition tasks, but do not scale as well as HMMs
when it comes to large vocabularies. Rather than being used in general-purpose speech
recognition applications they can handle low quality, noisy data and speaker
independence. Such systems can achieve greater accuracy than HMM based systems, as
long as there is training data and the vocabulary is limited. A more general approach
using neural networks is phoneme recognition. This is an active field of research, but
generally the results are better than for HMMs. There are also NN-HMM hybrid systems
that use the neural network part for phoneme recognition and the hidden Markov model
part for language modeling.

Dynamic time warping (DTW)-based speech recognition

Dynamic time warping is an algorithm for measuring similarity between two sequences
which may vary in time or speed. For instance, similarities in walking patterns would be
detected, even if in one video the person was walking slowly and if in another they were
walking more quickly, or even if there were accelerations and decelerations during the
course of one observation. DTW has been applied to video, audio, and graphics -- indeed,
any data which can be turned into a linear representation can be analyzed with DTW.
A well known application has been automatic speech recognition, to cope with different
speaking speeds. In general, it is a method that allows a computer to find an optimal
match between two given sequences (e.g. time series) with certain restrictions, i.e. the
sequences are "warped" non-linearly to match each other. This sequence alignment
method is often used in the context of hidden Markov models.