Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Combining Statistical Parameteric Speech Synthesis and Unit-Selection for Automatic V
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
The ability to use the recorded audio of a subject’s voice to produce an open-domain synthesis system has generated much interest both in academic research and in commercial speech technology. The ability to produce synthetic versions of a subjects voice has potential commercial applications, such as virtual celebrity actors, or potential clinical applications, such as offering a synthetic replacement voice in the case of a laryngectomy. Recent developments in HMM-based speech synthesis have shown it is possible to produce synthetic voices from quite small amounts of speech data. However, mimicking the depth and variation of a speaker’s prosody as well as synthesising natural voice quality is still a challenging research problem. In contrast, unit-selection systems have shown it is possible to strongly retain the character of the voice but only with sufficient original source material. Often this runs into hours and may require significant manual checking and labelling. In this paper we will present two state of the art systems, an HMM based system HTS-2007, developed by CSTR and Nagoya Institute Technology, and a commercial unit-selection system CereVoice, developed by Cereproc. Both systems have been used to mimic the voice of George W. Bush (43rd president of the United States) using freely available audio from the web. In addition we will present a hybrid system which combines both technologies. We demonstrate examples of synthetic voices created from 10, 40 and 210 minutes of randomly selected speech. We will then discuss the underlying problems associated with voice cloning using found audio, and the scalability of our solution. Index Terms: speech synthesis, unit-selection, statistical parametric synthesis, voice cloning, HMM, speaker adaptation 1. Introduction Vocal mimicry by computers is regarded with both awe and suspicion [1]. This is partly because perfect vocal mimicry is also the mimicry of our own sense of individuality: the use of a certain voice draws with it much more than the voice itself, it also draws the associations we have with that voice. Conveying this sense of character is becoming important in a whole set of innovative applications for human-computer interfaces which use speech for input and output. For example, the ability to produce synthetic versions of a subjects voice has potential attractive commercial applications, such as virtual celebrity actors, or potential beneficial clinical applications, such as offering a synthetic replacement voice in the case of a laryngectomy. In addition, the ability to retain the character of a speaker could be combined with translation systems, where it would help personalize speech-to-speech translation so that a user’s speech in one language can be used to produce corresponding speech in another language while continuing to sound like the user’s voice. It might eliminate the need for subtitles and onerous voice-overs acting on international broadcasts or movies in the future. In this paper we investigate the reproduction/mimicry aspects of up-to-date speech synthesis technologies: how well can we take a well-known speaker and duplicate his acoustic feature, linguistic features, and speaking styles so that a listener immediately recognises the speaker? Furthermore, how effective is this mimicry for conveying the character of the speaker in an amusing manner? We term the process of producing a speech synthesis system that can effectively mimic a speak “voice cloning”. We apply two major competing technologies to this voice cloning problem, the first is a well-established and well-studied technique called “unit-selection”, which concatenates segments of speakers’ source speech to create new utterances [2], the second is often termed “statistical parametric synthesis,” where a statistical acoustic model is trained or adapted from speakers’ source speech [3]. In the experiments, we will apply both techniques to the problem of cloning the voice of GeorgeW. Bush (The 43rd President of the United States) and produce a short rendition of the introduction of a well known children’s story, “The Emperor’s New Clothes”. In addition we will explore the use of a new hybrid system which attempts to utilise the strengths of both approaches to create a more scalable means of mimicking voices.