21-07-2014, 03:21 PM
VOICE BASED WEB BROWSER
VOICE BASED WEB BROWSER.pdf (Size: 1.78 MB / Downloads: 44)
Introduction
During the past few years, the computer industry has seen speech technologies arrive
with a flash and quietly fold without a whimper. This speech processing area has
become a feature which is under levels of researches. Speech recognition systems
have been around for over twenty years, but the early systems were very expensive
and required powerful computers to run. During recent years, manufacturers have
reduced the prices of the speech recognition systems. The technology behind speech
output has also changed. Early systems used discrete speech, i.e. the user had to
speak one word at a time, with a short pause between words. Over the past few years
most systems have used continuous speech, allowing the user to speak in a more
natural way.
The main commercial continuous speech systems currently available for the PC are
Dragon NaturallySpeaking and IBM ViaVoice. With the advancement of technology
today many research projects have progressed with developing applications that use
speech technology to boost application user‟s experience. As pioneers IBM, Microsoft
and Sun Microsystems have performed considerable amount of work on the context.
Microsoft boasted the ability to control Windows by voice commands and have
embedded the capability of dictating such like Microsoft Word application. With kind
of those promising achievements when it‟s coming to present day apart from the
Microsoft and Sun Microsystems some other developers also have released specific
speech APIs in order to help application developers
Purpose
Internet has brought about an incredible improvement in human access to
knowledge and information. Voice browsers allow people to access the Web using
speech synthesis, pre-recorded audio, and speech recognition. This can be
supplemented by keypads and small displays. Voice may also be offered as an adjunct
to conventional desktop browsers with high resolution graphical displays, providing
an accessible alternative to using the keyboard or screen, for instance in automobiles
where hands/eyes free operation is essential. Voice interaction can escape the
physical limitations on keypads and displays as mobile devices become ever smaller.
The browser will have an integrated text extraction engine that inspects the content
of the page to construct a structured representation. The internal nodes of the
structure represent various levels of abstraction of the content. This helps in easy and
flexible navigation of the page so as to rapidly home into objects of interest. Finally,
the browser is integrated to an automatic Text-To-Speech transliteration engine that
outputs the selected text in the form of speech.
Abstract
Voice based web browser is a solution basically for recognizing and interpreting
voice. System will be capable of converting an acoustic signal, captured by a
microphone to a set of words. Emphasis is that this acoustic wave represents a
human speech done. The recognized words which are the results can be used for
applications as commands, data entries or could be served as the input to further
linguistic processing in order to achieve done in language and produce an output that
can be used for other applications extensively. System consists of two main
component such that one component is meant for processing acoustic signal
captured by the microphone while other component is meant to interpret the
processed signal and map the signal to words. speech understanding. Hence the end
product will be capable of tracking the human speech.
Problem/Requirement
Speech recognition is the process of converting an acoustic signal, captured by a
microphone or a telephone, to a set of words. The recognized words can be the final
results, as for applications such as commands & control, data entry, and document
preparation. They can also serve as the input to further linguistic processing in order
to achieve speech understanding, a subject covered in section. Speech recognition
appears as an alternative to typing on a keyboard. Simply, user talk to the computer
and computer grab the user‟s utterance. These softwares have been developed to
provide a fast method to communicate with a computer and can help people with a
variety of disabilities. It is useful for people with physical disabilities who often find
typing difficult, painful or impossible. Voice recognition software can also help those
with spelling difficulties, including users with dyslexic, because recognized words are
always correctly spelled.
The speech recognition process is performed by a software component known as the
speech recognition engine. The primary function of the speech recognition engine is
to process spoken input and translate it into text that an application understands.
The application can then do one of two things:
1. The application can interpret the result of the recognition as a command. In
this case, the application is a command and control application. An example of
a command and control application is one in which the caller says “check
HIDDEN MARKOV MODEL (HMM)-BASED SPEECH RECOGNITION
Most of the modern speech recognition systems are based upon this approach that
provides statistical models that output a sequence of symbols or quantities. Hidden
Markov model would output a sequence of n-dimensional real valued vectors such
that each of this vector would consists of ceptral coefficients, which are obtained by
taking a Fourier transform of a short time window of speech and de correlating the
spectrum using a cosine transform, then taking the first (most significant)
coefficients .In each state the hidden Markov model will tend to have a statistical
distribution that is a mixture of diagonal covariance Gaussians which will give
likelihood for each observed vector. Each word or each phoneme will have a different
output distribution. A hidden Markov model for a sequence of words or phonemes is
made by concatenating the individual trained hidden Markov models for the separate
words and phonemes . Decoding of the speech would probably use the Viterbi
algorithm to find the best path, and here there is a choice between dynamically
creating a combination HMM which includes both the acoustic and language model
information, or combining it statically beforehand .
Design Constrains
Hardware Environment System microphone must product correspondingly
uninterrupted voice level within the session of operating. It should detect same noise
level from the environment and its component must not add or add uniform extra
noise to detected voice stream. End user Environment For the system there is one
type of user who is feeding voice stream to in. He should have general knowledge of
computer literacy. The most important concern is to stay in noise free environment
when operating the system. Also he must maintain even accent at all the time.
Database requirements The VOICE BASED WEB-BROWSER (version -1.0) do not
connected with any database system. Performance requirements The VOICE BASED
WEB-BROWSER speech recognition system is to be design as operate in real time.
For the given voice input wave it gives matching literal output within 1 second of time
delay.
Speech Synthesis
Speech synthesis is the artificial production of human speech. Synthesizing is the
very effective procees of generating speech waveforms using machines based on the
phonetical transcription of the message. Recent progress in speech systhesis has
produced sysnthesizers with very high intelligibility but the sound quality and
naturalness still remains major problem.
Representation and analysis of Speech Signals
Continuous speech is a set of complicated audio signals which makes producing them
artificially difficult. Speech signals are considered as voiced or unvoiced ,but in some
cases they are something between these two .Voiced sounds consist of fundamental
frequency(F0) and its harmonic components produced by vocal cords. The vocal
tract modifies this excitation signal causing formant (pole) and sometimes
antiformant (zero) frequencies. Each formant frequency has amplitude and
bandwidth and it is difficult to define some of parameters correctly.
The fundamental and formant frequencies are important concepts in speech
synthesis and also in speech processing.
With purely unvoiced sound, there is no fundamental frequency in excitation signal
and therefore no harmonic structure either and the excitation can be considered as
white noise. T he airflow is forced through vocal tract constriction .Some sounds are
produced with complete stoppage of airflow followed by sudden release , producing
impulsive turbulent excitation followed by protracted turbulent excitation. Unvoiced
sounds are more silent.
Speech signals of three vowels (/a/ /i/ /u/) are presented in time and frequency
domain.. Te fundamental frequency is about 100 Hz in all cases and the formant
frequencies F1, F2,and F3 with vowel /a/ are approx 600Hz, 1000Hz and 2500Hz
respectively. With vowel /i/ first three formants are 200Hz, 2300Hz and 3000hZ
and with /u/ 300hz, 600hz and 2300hz. The harmonic structure of excitation is easy
to perceive from frequency domain presentation.
Conclusion
In my opinion, the software does an admirable job of addressing the usability needs
of a wide audience. It is necessary to insure designers develop software products with
universal usability in mind.
Designers created a generally solid software product that almost anyone can use with
success. The software recorded and analyzed each test subject's voice successfully.
Afterwards, each user could dictate and the PC transcribed the user‟s dictation with
relative accuracy.
The voice to text transcription application is a proven feature of Fonix Embedded
Speech DDK. However, using this software application as a communication device
for automobile is yet unproven, at least on my opinion. First, I have not figured out a
way to input external files into a program that needs storage for its data. As an
example, in Samples Four and Five, the programs need to have places to store the
origins and destinations input by the users so that the program can use it later on for
the direction. Because of its lacking of features, I had to be creative and work my way
around it to find another alternative to make the Samples work the way I wanted
them to. Maybe I have been unable to figure out the best features of the software
product yet and the features that I had looked are still hidden within the package
somewhere.
The built-in microphone on my notebook did not work well with the software
product in noisy environments. The product needed assistance from the Array
Microphone made by Andreaelectronic Company. Despite the size of the Array
Microphone from Andreaelectronic Company, its performance is well beyond perfect.
It could eliminate noise to minimal levels and helps a lot with the software product,
especially when users want to use it in a very noisy environment such as a train
station, freeway, nearby hospital, or busy commuter roads.
A suggestion for further research is to select users who actually would