08-08-2012, 11:19 AM
Audio data
Audio data .docx (Size: 410.7 KB / Downloads: 28)
Introduction
Audio data is an integral part of many modern computer and multimedia applications.
Numerous audio recordings are dealt with in audio and multimedia applications. The
effectiveness of their deployment is greatly dependent on the ability to classify and retrieve
the audio files in term of their sound properties or content. Rapid increase in the
amount of audio data demands for a computerized method which allows efficient and
automated content-based classification and retrieval of audio database. For these reasons,
commercial companies developing audio retrieval products are emerging.world et al. [14]
have developed a system called “MuscleFish”. That work distinguishes itself from earlier
work[5, 3, 2] in its content-based capability. There, various perceptual are used to represent a
sound. A normalized Euclidean(Mahalanobis) distance and the nearest neighbor(NN) rule are
used to classify the query sound into one of the sound classes in the database. In Liu et al.
[10],separability of different classes is evaluated in terms of the intra- and inter-class scatters
to identify highly correlated features. Foote [4] choose to use 12 mel-frequency cepstral
coefficients (MFCCs) as the audio features. Histograms of sounds are compared and the
classification is done by using the NN rule. In Pfeiffer et al. [12], audio features are
extracted by using gammaphone filters.Recently, a new pattern recognition method, called
the Nearest Feature Line (NFL), is developed. This method explores information contained
in multiple prototypes per class by using linear interpolation and extrapolation of each pair of
prototypes in the class. It has been shown to produce better results than Euclidean distance
based ranking meth- ods such as -NN in face recognition [9], image [8] and audio [7]
classification and retrieval.
In this paper, a support vector machines (SVMs) [1, 13] based method is proposed for
content-based audio classifi- cation and retrieval. The SVM minimizes the structural risk,
that is, the probability of misclassifying yet-to-be-seen pat- terns for a fixed but unknown
probability distribution of the data. This is in contrast to traditional pattern recognition
techniques of minimizing the empirical risk, that is, of to optimizing the performance on
the training data. This min- imum structural risk principle is equivalent to minimizing an
upper bound on the generalization error.
Given a class-labeled training set, which in this work is a set of labeled feature vectors
composed of perceptual and cepstral feature (Section 2), class boundaries between each
pair of two classes are learned using SVMs (Section 3. A binary tree is formed for multi-
class problems. A new met- ric called distance-from-boundary (DFB) is used to measure
and rank similar audio patterns. Extensive experiments are performed (Section 3) to
compare SVMs with NFL, NN and MuscleFish using a database of 409 sounds from
Muscle- Fish.
Preprocessing
To extract the features from the audio signal, the signal must be processed and divided into
Successive windows or analysis frames. Throughout this work, a sampling rate of 8 kHz, 16
Bit monophonic, pulse code modulation (PCM) format in wave audio is adopted. Audio
Signal which is recorded using a TV tuner card from the TV broadcast audio data is
preprocessed before extracting features. This involves detection of begin and end points of
the utterance in the audio waveform, preemphasis and windowing of the frame. The process
of preemphasis provides high frequency emphasis and windowing reduces the effect of
discontinuity at the ends of each frame of audio. The audio samples in each frame are
preprocessed using a difference operator to emphasize the high frequency components.