13-08-2012, 11:27 AM
GENDER RECOGNITION USING SPEECH PROCESSING TECHNIQUES IN LABVIEW
1GENDER RECOGNITION.pdf (Size: 456.49 KB / Downloads: 210)
ABSTRACT
Traditionally the interest in voice-gender conversion was of a more theoretical nature rather than founded in
real–life applications. However, with the increase in biometric security applications, mobile and automated
telephonic communication and the resulting limitation in transmission bandwidth, practical applications of
gender recognition have increased many folds. In this paper, using various speech processing techniques and
algorithms, two models were made, one for generating Formant values of the voice sample and the other for
generating pitch value of the voice sample. These two models were to be used for extracting gender biased
features, i.e. Formant 1 and Pitch Value of a speaker. A preprocessing model was prepared in LabView for
filtering out the noise components and also to enhance the high frequency formants in the voice sample. To
calculate the mean of formants and pitch of all the samples of a speaker, a model containing loop and counters
were implemented which generated a mean of Formant 1 and Pitch value of the speaker. Using nearest neighbor
method, calculating Euclidean distance from the Mean value of Males and Females of the generated mean
values of Formant 1 and Pitch, the speaker was classified between Male and Female. The algorithm was
implemented in real time using NI LabVIEW.
INTRODUCTION
Problem Definition
The aim of this paper is to identify the gender of a speaker based on the voice of the speaker using
certain speech processing techniques in real time using LabVIEW. Gender-based differences in human
speech are partly due to physiological differences such as vocal fold thickness or vocal tract length and
partly due to differences in speaking style. Since these changes are reflected in the speech signal, we
hope to exploit these properties to automatically classify a speaker as male or female.
Proposed Solution
In finding the gender of a speaker we have used acoustic measures from both the voice source and the
vocal tract, the fundamental frequency (F0) or pitch and the first formant frequency (F1) respectively.
It is well-known that F0 values for male speakers are lower due to longer and thicker vocal folds. F0
for adult males is typically around 120 Hz, while F0 for adult females is around 200 Hz. Further adult
males exhibit lower formant frequencies than adult females due to vocal tract length differences.
Linear predictive analysis is used to find both the fundamental frequency and the formant frequency of
each speech frame. The mean of all the frames is calculated to obtain the values for each speaker. The
Euclidean distance of this mean point is found from the preset means of the male class and the female
class. The least of the two distances determines whether the speaker is male or female. The preset
mean points for each class is found by training the system with 20 male and 20 female speakers.
Work in Formant Tracking
The speech waveform can be modeled as the response of a resonator (the vocal tract) to a series of
pulses (quasi-periodic glottal pulses during voiced sounds, or noise generated at a constriction during
unvoiced sounds). The resonances of the vocal tract are called formants, and they are manifested in the
spectral domain by energy maxima at the resonant frequencies. The frequencies at which the formants
occur are primarily dependent upon the shape of the vocal tract, which is determined by the positions
of the articulators (tongue, lips, jaw, etc.). In continuous speech, the formant frequencies vary in time
as the articulators change position.
Linear Prediction Coding Method:
This frequently used technique for formant location involves the determination of resonance peaks
from the filter coefficients obtained through LPC analysis of segments of the speech waveform . Once
the prediction polynomial A(z) has been calculated, the formant parameters are determined either by
“peak-picking’’ on the filter response curve or by solving for the roots of the equation A(z) = 0. Each
pair of complex roots is used to calculate the corresponding formant frequency and bandwidth. The
computations involved in “peak-picking” consist of either the use of the fast Fourier transform with a
sufficiently large number of points to provide the prescribed accuracy in formant locations or the
evaluation of the complex function A(e j ) at an equivalently large number of points .
Cepstral Analysis Method:
An improvement on the LPC analysis algorithm adopted the cepstral spectrum coefficient of LPC to
acquire the parameters of formant. The log spectra display the resonant structure of the particular
segment; i.e., the peaks in the spectrum correspond to the formant frequencies. The robustness of the
improved algorithm was better when acquiring the formant of the fragment of vowel.
Mel Scale LPC Algorithm:
This algorithm combines a linear predictive analysis together with the Me1 psycho-acoustical
perceptual scale for F1 and F2 estimation. In some speech processing applications, it is useful to
employ a non linear frequency scale instead of the linear scale in Hz. In the analysis of speech signals
for speech recognition, for example, it is common to use psychoacoustic perceptual scales, specially
the Me1 scale. These scales result from acoustic perception experiments and establish a nonlinear
spectral characterization for the speech signal. The relation between the linear scale (f in Hz) and the
nonlinear Me1 scale (M in Mel)
Conclusion
Considering the efficiency of the results obtained, it is concluded that the algorithm implemented in
LabView is working successfully. Since the algorithm does not extract the vowels from the speech, the
value obtained for Formant 1 were not completely correct as they were obtained by processing all the
samples of the speech. It was also observed that by increasing the unvoiced part in the speech, like the
sound of ‘s’, the value of pitch increases hampering the gender detection in case of Male samples.
Likewise by increasing the voiced, like the sound of ‘a’, decreases the value of pitch but the system
takes care of such dip in value and results were not affected by the same. Different speech by the same
speaker spoken in the near to identical conditions generated the same pitch value establishing the
system can be used for speaker identification after further work.