05-09-2016, 11:16 AM
1452706092-chandu.docx (Size: 32.14 KB / Downloads: 5)
Introduction
Think about a creating a car which would be controlled by your voice. By giving a command, the car would drive you to your destination. The voice recognition algorithm we used could be applied to daily life; for example it would be most helpful to disabled people to perform their daily work. We created a speech controlled car using various electrical and mechanical domains such as digital signal processing, analog circuit design, and interfacing the car with the Mega32.
We were both interested in building some kind of robot. We researched projects that had been done in the field of robotics and there existed a line follower robot, and sensor robots but none of them used speech to control a robot. In the digital world, it would be cool to make a robot which obeys human speech commands and performs errands. In the movie “I, Robot”, (where our project idea came from) they showed a high tech robotic car responding to vocal commands and driving according to human speech. We picked a part of the theme of the movie to make a car move using speech recognition. The required computation to process speech would normally overflow the mega32 memory, but we found a nice algorithm from a website by Tor Aamodt, from the ECE department at the University of British Columbia that would fit within our constraint.
• Background math:
Speech Analysis:
In order to analyze speech, we needed to look at the frequency content of the detected word. To do this we used several 4th order Chebyshev band pass filters. To create 4th order filters, we cascaded two second order filters using the following "Direct Form II Transposed" implementation of a difference equations.
• Initial-Threshold Calculation:
At start up as part of the initialization the program reads the ADC input using timercounter0 and accumulates its value 256 times. By interpreting the read in ADC value as a number between 1 to 1/256, in fixed point, and accumulating 256 times. The average value of ADC was calculated without doing a multiply or divide. Three average values are taken each with a 16.4msec delay between the samples. After receiving three average values, the threshold value is to be four times the value of the median number. The threshold value is useful to detect when a word has been spoken or not.
• Fingerprint Generation:
The program considers a word detected if a sample value from the ADC is greater than the threshold value. Every sample of ADC is typecast to an int and stored in a dummy variable Ain. The Ain value passes through 8 4th order Chebyshev band pass filters with a 40 dB stop band for 2000 samples (half a second) once a word has been detected. When a filter is used its output is squared and that value is accumulated with the previous squares of the filter output. After 125 samples the accumulated value is stored as a data point in the fingerprint of that word. The accumulator is then cleared and the process is begun again. After 2000 samples 16 points have been generated from each filter, thus every sampled word is divided up into 16 parts. Our code is based around using 10 filters and since each one outputs 16 data points every fingerprint is made up of 160 data points.
Project Results
Since we had to pass the ADC output through all of the filters faster than our sample time; the time it took do all the filter calculations was very important. We were able to run through 9 filters in under 4000 cycles, which is the amount of cycles available when sampling from the ADC at 4 KHz. The fingerprint comparison function did not have a speed requirement and so the cycle time for that was unimportant. The program was able to recognize five words, but sometimes it would become confused and match the incorrect word if the word that was spoken varied too much from the word stored in the dictionary. As a rough estimate the program recognized the correct word about 70% of the time a valid word was spoken. The program achieved success using Chirag’s voice, and with sufficient practice a person could say the same word with a small enough variation for the program to recognize the spoken word most of the time. For the general person though the recognition program would have a much lower percentage of success. Also the words in the dictionary are words spoken by only one person. If someone else said the same words it is unlikely the program would recognize the correct word most of the time, if at all.
For safety an testing we made sure the PWM signals sent to the car were as close to neutral as possible, while still letting the move go forward and backward. We did this to prevent the car from going out of control and potentially hurting others. Our project did not use any RF signals and the board we used ran just off of a battery so there were no physical connections to anything involving other peoples projects. Also the only pins switching state were the pins for the PWM, which were mostly covered by wire.
Conclusion
At the beginning of our project, we set a goal to recognize five words, at the end of project we got five words to be recognized. However our five words needed to be orthogonal to each other because our filters were not giving a high enough resolution and inaccuracy in fingerprint calculations due to using fix point arithmetic made the lookup function to be error prone. As a result, we had to pick various different words that sound apart. If we had to do this again instead of trying to use the Euclidean distance formula to match words we would like to try do perform a correlation of the two fingerprints. A correlation is less sensitive to amplitude differences and is a better way of identifying patterns between two objects. If we had faster process chip, we could modified our algorithm to add more filters, perform Fourier transform, or floating point arithmetic in order to improve our results.