09-11-2016, 02:34 PM
1467812711-Infomining.docx (Size: 201.04 KB / Downloads: 3)
ABSTRACT-Information mining from speech signal as the ultimate goal of data mining is concerned with science, technology, and engineering of discovering patterns and extracting potentially useful or interesting information automatically or semi-automatically from speech data. With the advent of inexpensive storage space and faster processing over the past decade, data mining research has start to penetrate new grounds in areas of speech and audio processing. A typical speech can be broadly defined as speech with emotional content, speech affected by alcohol and drugs, speech from speakers with disabilities, and various kind of pathological speech. Individual speech can vary because of different timing and amplitude of the movement of speech articulators. Physical mechanism of speech undergoes changes, which can affect the nasal cavity resonance and mode of vibrations of the vocal cords. A series of environmental variables likebackground noise, reverberation and recording condition have also to be taken into account.
I. INFORMATION IN SPEECH
There are several ways of characterizing the communication potential of speech. According to information theory, speech can be represented in terms of its message content. An alternating way of characterizing speech is in terms of the signal carrying the message information that is the acoustic waveform
A central concern of information theory is the rate at which the information is conveyed. The high information redundancy of speech signal is associated with such factors as the loudness of the speech, environmental condition, and emotional, physical as well as psychological state of the speaker.
II. BASIC MODEL OF SPEECH PRODUCTION
The appropriate model for speech corresponding to the electrical analogs of the vocal tract is shown in Figure 2. Such analog models are further developed into digital circuits suitable for simulation by computer. In modeling speech, the effects of the excitation source and the vocal tract are often considered independently.
The actual excitation function for speech is essentially either a quasi-periodic pulse train for voiced speech sounds or a random noise source for unvoiced speech sounds. In both cases, a speech signal s(t) can be modeled as the convolution of an excitation signal e(t) and an impulse response characterizing the vocal tract v(t)
s(t)= e(t) * v(t) (1)
Which also implies that the effect of lips radiation can be included in the source function. Convolution of two signals corresponds to multiplication of their spectra, the output speech spectrum s(f) is the product of the excitation spectrum E(f) and the frequency response V(f) the vocal tract.
S(f)=E(f) V(f) (2)
The excitation source is chosen by a switch whose position is controlled by the voiced/ unvoiced character of the speech. The appropriate gain G of the source is estimated from the speech signal and the scaled source is used as input to a filter, which is controlled by the vocal tract parameters characteristic of the speech being produced. The parameters of this model all vary with time.Unvoiced excitation is usually modeled as random noise with an approximately Gaussian amplitude distribution and a flat spectrum over most frequencies of interest. More research has been done on voiced excitation because the naturalness of synthetic speech is crucially related to accurate modeling of voiced speech. It is very difficult to obtain precise measurements of glottal pressure or glottal airflow.
III.GENERAL PRINCIPLES OF SPEECH SIGNAL PROCESSING.
The whole processing block chain common to all approaches to speech processing is shown in figure 3. The first step in the processing is the speech pre-processing, which provides signal operation such as digitalization, pre-emphasis, frame blocking, and windowing. Digitalization of analog speech signal starts the whole processing. The microphone and the A/D converter usually introduce undesired side effects.
The second step that is feature extraction, represents the process of converting sequences of pre-processed speech samples s(n) to observation vector x representing characteristics of the time varying speech signal. The kind of features extracted from speech signal and put together into feature vector x corresponds to the final aim of the speech processing. For each application, the most efficient features that is the features carrying best the mining information, should be used. The first two blocks represent straightforward problems in digital signal processing. The subsequent classification is then optimized to the final expected information. In contrary to the blocks of feature extraction and classification, the block of pre-processing provides operations that are independent on the aim of speech processing.
A. PRE-EMPHASIS
The characteristics of the vocal tract define the current uttered phoneme. Such characteristics are evidenced in the frequency spectrum by the location of the formants that is local peaks given by resonances of the vocal tract. Although possessing relevant information, high frequency formants have smaller amplitude with respect to low frequency formants. To spectrally flatten the speech signal, a filtering is required. Usually, a one coefficient FIR filter, known as a pre-emphasis filter, with transfer function in z-domain
H(z) = 1 - λ z -1 (3)
is used. In the time domain, the pre-emphasized signal is related to the input signal by the difference equation
Ŝ(n) = s(n) – λ s(n-1) (4)
A typical range of values for the pre-emphasis coefficient is λ belongs to 0.9 – 1. One possibility is to choose an adaptive pre-emphasis, in which λ changes with time according to the relation between the first two values of autocorrelation coefficients
λ = R (1) / R (0) (5)
B. FRAME BLOCKING
The most common approaches in speech signal processing are based on short-time analysis. The pre-emphasized signal is blocked into frames of N samples. Frame duration typically ranges between 10-30nmsec. Values in this range represent a trade-off between the rate of change of spectrum and system complexity. The proper frame duration is ultimately dependent on the velocity of the articulators in the speech production system.
WINDOWING
A signal observed for a finite interval of time may have distorted spectral information in the Fourier transform due to the ringing of the sin(f)/f spectral peaks of the rectangular window. To avoid or minimize this distortion, a signal is multiplied by a window-weighing function before parameter extraction is performed. Window choice is crucial for separation of spectral components which are near one another in frequency or where one component is much smaller than another. In speech processing, the Hamming window is almost exclusively used. The hamming window is a specific case of the Hanning window.
A generalized hanning window is defined as
w(n) = α-(1-α) cos(2πn/N) for n=1,2..N
β (6)
Andw(n) = 0 elsewhere, α is defined as a window constant in the range <0,1> and N is the window duration in samples. To implement a Hamming window, the window constant is set to α = 0.54. β is defined as normalization constant so that the root mean square value of the window is unity.
(7)
In practice, it is desirable to normalize the window so that the power in the signal after windowing is approximately equal to the power of the signal before windowing. Equation 7 describes such a normalization constant. This type of normalization is especially convenient for implementations using fixed-point arithmetic hardware.Windowing involves multiplying a speech signal s(n) by a finite-duration window w(n), length N, widely used windows have duration of 10-25 msec.