Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Voice morphing full report
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Voice morphing

[attachment=29321]

ABSTRACT
Voice morphing means the transition of one speech signal into another. The new morphed signal will have the same information content as the two input speech signals but a different pitch, which is determined by the morphing algorithm. To do this, each signal's information has to be converted into another representation, which enables the pitch and spectral envelope to be encoded on orthogonal axes. Individual components of the speech signal are then matched and the signal’s amplitudes are then interpolated to produce a new speech signal. This new signal's representation then has to be converted back to an acoustic waveform. This project vividly describes the representations of the signals required to affect the morph and also the techniques required to match the signal components, interpolate the amplitudes and invert the new signal’s representation back to an acoustic waveform. Voice morphing, which is also referred to as voice transformation and voice conversion, is a technique for modifying a source speaker’s speech to sound as if it was spoken by some designated target speaker. There are many applications of voice morphing including customizing voices for text to speech (TTS) systems, transforming voice-overs in adverts and films to sound like that of a well-known celebrity, and enhancing the speech of impaired speakers such as laryngectomees. Two key requirements of many of these applications are that firstly they should not rely on large amounts of parallel training data where both speakers recite identical texts, and secondly, the high audio quality of the source should be preserved in the transformed speech. The core process in a voice morphing system is the transformation of the spectral envelope of the source speaker to match that of the target speaker and various approaches have been proposed for doing this such as codebook mapping, formant mapping, and linear transformations. Codebook mapping, however, typically leads to discontinuities in the transformed speech. Although some discontinuities can be resolved by some form of interpolation technique , the conversion approach can still suffer from a lack of robustness as well as degraded quality. On the other hand, formant mapping is prone to formant tracking errors. Hence, transformation-based approaches are now the most popular. In particular, the continuous probabilistic transformation approach introduced by Stylianou provides the baseline for modern systems. In this approach, a Gaussian mixture model (GMM) is used to classify each incoming speech frame, and a set of linear transformations weighted by the continuous GMM probabilities are applied to give a smoothly varying target output. The linear transformations are typically estimated from time-aligned parallel training data using least mean squares. More recently, Kain has proposed a variant of this method in which the GMM classification is based on a joint density model. However, like the original Stylianou approach, it still relies on parallel training data. Although the requirement for parallel training data is often acceptable, there are applications which require voice transformation for nonparallel training data. Examples can be found in the entertainment and media industries where recordings of unknown speakers need to be transformed to sound like well-known personalities. Further uses are envisaged in applications where the provision of parallel data is impossible such as when the source and target speaker speak different languages. Although interpolated linear transforms are effective in transforming speaker identity, the direct transformation of successive source speech frames to yield the required target speech will result in a number artifacts. The reasons for this are as follows. First, the reduced dimensionality of the spectral vector used to represent the spectral envelope and the averaging effect of the linear transformation result in formant broadening and a loss of spectral detail. Second, unnatural phase dispersion in the target speech can lead to audible artifacts and this effect is aggravated when pitch and duration are modified. Third, unvoiced sounds have very high variance and are typically not transformed. However, in that case, residual voicing from the source is carried over to the target speech resulting in a disconcerting background whispering effect .To achieve high quality of voice conversion, include a spectral refinement approach to compensate the spectral distortion, a phase prediction method for natural phase coupling and an unvoiced sounds transformation scheme. Each of these techniques is assessed individually and the overall performance of the complete solution evaluated using listening tests. Overall it is found that the enhancements significantly improve.


INTRODUCTION

Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals, while generating a smooth transition between them. Speech morphing is analogous to image morphing. In image morphing the in-between images all show one face smoothly changing its shape and texture until it turns into the target face. It is this feature that a speech morph should possess. One speech signal should smoothly change into another, keeping the shared characteristics of the starting and ending signals but smoothly changing the other properties. The major properties of concern as far as a speech signal is concerned are its pitch and envelope information. These two reside in a convolved form in a speech signal. Hence some efficient method for extracting each of these is necessary. We have adopted an uncomplicated approach namely cepstral analysis to do the same. Pitch and formant information in each signal is extracted using the cepstral approach. Necessary processing to obtain the morphed speech signal include methods like Cross fading of envelope information, Dynamic Time Warping to match the major signal features (pitch) and Signal Re-estimation to convert the morphed speech signal back into the acoustic waveform.

This report has been subdivided into six chapters. The chapter first gives an idea of the various processes involved in this project in a concise manner. A thorough analysis of the procedure used to accomplish morphing and the necessary theory involved is presented in an uncomplicated manner in the second chapter. Processes like pre processing, cepstral analysis, dynamic time warping and signal re-estimation are vividly described with necessary diagrams. The three chapters give a deep insight into the actual morphing process. The conversion of the morphed signal into an acoustic waveform is dealt in detail in the four chapters. Chapter five summarizes the whole morphing process with the help of a block diagram. Chapter six lists the conclusions that have been drawn from this project.

AN INTROSPECTION OF THE MORPHING PROCESS

We had undertaken this work, which sounded quite challenging and interesting. We were eager to know whether a venture like speech morphing will be feasible using the cepstral approach. Processes like cepstral analysis and the re estimation of the morphed speech signal into an acoustic waveform involve much intricacy and challenge. Also this project digs deep into the basics of digital signal processing or speech processing rather. This project covers a lot of ground as far as speech processing is concerned.

Speech morphing can be achieved by transforming the signal’s representation from the acoustic waveform obtained by sampling of the analog signal, with which many people are familiar with, to another representation. To prepare the signal for the transformation, it is split into a number of 'frames' - sections of the waveform. The transformation is then applied to each frame of the signal. This provides another way of viewing the signal information. The new representation (said to be in the frequency domain) describes the average energy present at each frequency band.

Further analysis enables two pieces of information to be obtained: pitch information and the overall envelope of the sound. A key element in the morphing is the manipulation of the pitch information. If two signals with different pitches were simply cross-faded it is highly likely that two separate sounds will be heard. This occurs because the signal will have two distinct pitches causing the auditory system to perceive two different objects. A successful morph must exhibit a smoothly changing pitch throughout. The pitch information of each sound is compared to provide the best match between the two signals' pitches. To do this match, the signals are stretched and compressed so that important sections of each signal match in time. The interpolation of the two sounds can then be performed which creates the intermediate sounds in the morph. The final stage is then to convert the frames back into a normal waveform.

Guest

please give full report on voice morphing

Guest

hai goodmorning,
i need voice morphing documentation for cse
prashanth.M