22-03-2011, 12:11 PM
CONTENTS.doc (Size: 56 KB / Downloads: 69)
Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals, while generating a smooth transition between them. Speech morphing is analogous to image morphing. In image morphing the in-between images all show one face smoothly changing its shape and texture until it turns into the target face. It is this feature that a speech morph should possess. One speech signal should smoothly change into another, keeping the shared characteristics of the starting and ending signals but smoothly changing the other properties.
The major properties of concern as far as a speech signal is concerned are its pitch and envelope information. These two reside in a convolved form in a speech signal. Hence some efficient method for extracting each of these is necessary. We have adopted an uncomplicated approach namely cepstral analysis to do the same. Pitch and formant information in each signal is extracted using the cepstral approach. Necessary processing to obtain the morphed speech signal include methods like Cross fading of envelope information, Dynamic Time Warping to match the major signal features (pitch) and Signal Re-estimation to convert the morphed speech signal back into the acoustic waveform
INTROSPECTION OF THE MORPHING PROCESS
Speech morphing can be achieved by transforming the signal's representation from the acoustic waveform obtained by sampling of the analog signal, with which many people are familiar with, to another representation. To prepare the signal for the transformation, it is split into a number of 'frames' - sections of the waveform. The transformation is then applied to each frame of the signal. This provides another way of viewing the signal information. The new representation (said to be in the frequency domain) describes the average energy present at each frequency band
Further analysis enables two pieces of information to be obtained: pitch information and the overall envelope of the sound. A key element in the morphing is the manipulation of the pitch information. If two signals with different pitches were simply cross-faded it is highly likely that two separate sounds will be heard. This occurs because the signal will have two distinct pitches causing the auditory system to perceive two different objects. A successful morph must exhibit a smoothly changing pitch throughout.
The pitch information of each sound is compared to provide the best match between the two signals' pitches. To do this match, the signals are stretched and compressed so that important sections of each signal match in time. The interpolation of the two sounds can then be performed which creates the intermediate sounds in the morph. The final stage is then to convert the frames back into a normal waveform
ABSTRACT
Voice morphing means the transition of one speech signal into
another. Like image morphing, speech morphing aims to preserve the
shared characteristics of the starting and final signals, while
generating a smooth transition between them. Speech morphing is
analogous to image morphing. In image morphing the in-between images
all show one face smoothly changing its shape and texture until it
turns into the target face. It is this feature that a speech morph
should possess. One speech signal should smoothly change into
another, keeping the shared characteristics of the starting and
ending signals but smoothly changing the other properties.
The major properties of concern as far as a speech signal is
concerned are its pitch and envelope information. These two reside in
a convolved form in a speech signal. Hence some efficient method for
extracting each of these is necessary. We have adopted an
uncomplicated approach namely cepstral analysis to do the same. Pitch
and formant information in each signal is extracted using the
cepstral approach. Necessary processing to obtain the morphed speech
signal include methods like Cross fading of envelope information,
Dynamic Time Warping to match the major signal features (pitch) and
Signal Re-estimation to convert the morphed speech signal back into
the acoustic waveform.
INTRODUCTION
Voice morphing, which is also referred to as voice transformation and
voice conversion, is a technique for modifying a source speakerâ„¢s
speech to sound as if it was spoken by some designated target
speaker. There are many applications of voice morphing including
customizing voices for text to speech (TTS) systems, transforming
voice-overs in adverts and films to sound like that of a well-known
celebrity, and enhancing the speech of impaired speakers such as
laryngectomees. Two key requirements of many of these applications
are that firstly they should not rely on large amounts of parallel
training data where both speakers recite identical texts, and
secondly, the high audio quality of the source should be preserved in
the transformed speech. The core process in a voice morphing system
is the transformation of the spectral envelope of the source speaker
to match that of the target speaker and various approaches have been
proposed for doing this such as codebook mapping, formant mapping,
and linear transformations. Codebook mapping, however, typically
leads to discontinuities in the transformed speech. Although some
discontinuities can be resolved by some form of interpolation
technique , the conversion approach can still suffer from a lack of
robustness as well as degraded quality. On the other hand, formant
mapping is prone to formant tracking errors. Hence, transformation-
based approaches are now the most popular. In particular, the
continuous probabilistic transformation approach introduced by
Stylianou provides the baseline for modern systems. In this approach,
a Gaussian mixture model (GMM) is used to classify each incoming
speech frame, and a set of linear transformations weighted by the
continuous GMM probabilities are applied to give a smoothly varying
target output. The linear transformations are typically estimated
from time-aligned parallel training data using least mean squares.
More recently, Kain has proposed a variant of this method in which
the GMM classification is based on a joint density model. However,
like the original Stylianou approach, it still relies on parallel
training data. Although the requirement for parallel training data is
often acceptable, there are applications which require voice
transformation for nonparallel training data. Examples can be found
in the entertainment and media industries where recordings of unknown
speakers need to be transformed to sound like well-known
personalities. Further uses are envisaged in applications where the
provision of parallel data is impossible such as when the source and
target speaker speak different languages. Although interpolated
linear transforms are effective in transforming speaker identity, the
direct transformation of successive source speech frames to yield the
required target speech will result in a number artifacts. The reasons
for this are as follows. First, the reduced dimensionality of the
spectral vector used to represent the spectral envelope and the
averaging effect of the linear transformation result in formant
broadening and a loss of spectral detail. Second, unnatural phase
dispersion in the target speech can lead to audible artifacts and
this effect is aggravated when pitch and duration are modified.
Third, unvoiced sounds have very high variance and are typically not
transformed. However, in that case, residual voicing from the source
is carried over to the target speech resulting in a disconcerting
background whispering effect .To achieve high quality of voice
conversion, include a spectral refinement approach to compensate the
spectral distortion, a phase prediction method for natural phase
coupling and an unvoiced sounds transformation scheme. Each of these
techniques is assessed individually and the overall performance of
the complete solution evaluated using listening tests. Overall it is
found that the enhancements significantly improve
TRANSFORM BASED VOICE MORPHING SYSTEM
2.1 Overall Framework
Transform-based voice morphing technology converts the speaker
identity by modifying the parameters of an acoustic representation of
the speech signal. It normally includes two parts, the training
procedure and the transformation procedure. The training procedure
operates on examples of speech from the source and the target
speakers. The input speech examples are first analyzed to extract the
spectral parameters that represent the speaker identity. Usually
these parameters encode the short-term acoustic features, such as the
spectrum shape and the formant structure. After the feature
extraction, a conversion function is trained to capture the
relationship between the source parameters and the corresponding
target parameters. In the transformation procedure, the new spectral
parameters are obtained by applying the trained conversion functions
to the source parameters. Finally, the morphed speech is synthesized
from the converted parameters. There are three interdependent issues
that must be decided before building a voice morphing system. First,
a mathematical model must be chosen which allows the speech signal to
be manipulated and regenerated with minimum distortion. Previous
research suggests that the sinusoidal model is a good candidate
since, in principle at least, this model can support modifications to
both the prosody and the spectral characteristics of the source
signal without inducing significant artifacts However, in practice,
conversion quality is always compromised by phase incoherency in the
regenerated signal, and to minimize this problem, a pitch synchronous
sinusoidal model is used in our system .Second, the acoustic features
which enable humans to identify speakers must be extracted and coded.
These features should be independent of the message and the
environment so that whatever and wherever the source speaker speaks,
his/her voice characteristics can be successfully transformed to
sound like the target speaker. Clearly the changes applied to these
features must be capable of straightforward realization by the speech
model. Third, the type of conversion function and the method of
training and applying the conversion function must be decided
2.2 Spectral Parameters
As indicated above, the overall shape of the spectral envelope
provides an effective representation of the vocal tract
characteristics of the speaker and the formant structure of voiced
sounds. Generally, there are several ways to estimate the spectral
envelope,such as using linear predictive coding (LPC) , cepstral
coefficients, and line spectral frequencies (LSF). The main steps in
estimating the LSF envelope for each speech frame are as follows.
1. Use the amplitudes of the harmonicsdetermined by the pitch
synchronous sinusoidal model to represent the magnitude spectrum.K is
determined by the fundamental frequency , its value can typically
range from 50 to 200.
2. Resample the magnitude spectrum nonuniformly according to the
bark scale frequency warping using cubic spline interpolation.
3. Compute the LPC coefficients by applying the Levinson- Durbin
algorithm to the autocorrelation sequence of the warped power
spectrum.
4. Convert the LPC coefficients to LSF.
5. In order to maintain adequate encoding of the formant
structure,LSF spectral vectors with an order of p=15 were used
throughout our voice conversion experiments.