16-09-2014, 02:15 PM
IMPLEMENTATION OF A TEXT-TO-SPEECH SYSTEM WITH
MACHINE LEARNING ALGORITHMS IN TURKISH
IMPLEMENTATION.pdf (Size: 768 KB / Downloads: 16)
ABSTRACT
This study is intended to build the framework of a concatenative TTS (Text to
Speech) system for Turkish. Turkish TTS system is based on concatenative, unit selection
approach. System contains two different speech databases comprised of units which are
directly recorded and cut from a continuous speech. The units have been cut from speech
manually and automatically. Some digital signal features such as zero crossing rate and
energy of speech have been used for automatic cutting. While concatenating the units,
PSOLA (Pitch Synchronous Overlap and Add) algorithm has been used for smoothing.
Some subjective tests are used to measure the system success. The quality of the
synthesized speech is measured depending on two criteria: Intelligibility and naturallness.
For naturalness defined as closeness to human speech, Mean Opinion Score (MOS), for
intelligibility defined as the ability to be understood, Diagnostic Rhyme Test (DRT) and
Comprehension Test (CT) have been applied.
Although the system uses simple techniques, it provides promising results for Turkish
TTS, since the selected concatenative method is very well suited for Turkish language
structure.
INTRODUCTION
1.1 GENERAL PURPOSE
Text-to-Speech (TTS) is the technology which lets computer speak to you. The TTS
gets the text as an input and then a computer algorithm called TTS engine analyzes the text,
preprocesses the text and synthesizes the speech with some mathematical models. The TTS
engine usually gives kind of sound data like wave, mp3 etc as an output.
TTS is under Natural Language Processing (NLP) in computer science taxonomy. To
understand TTS better, first NLP should be understood. So, in the following two sections of
introduction, NLP and TTS will be investigated extensively.
NATURAL LANGUAGE PROCESSING
NLP is a field of computer science concerned with the interactions between
computers and human (natural) languages.
NLP is a branch of artificial intelligence that deals with analyzing, understanding
and generating the languages that humans use naturally in order to interface with computers
in both written and spoken contexts using natural human languages instead of computer
languages1
. To summarize, NLP is to teach computers how humans learn and use language
and how to speak to humans in their natural language
Speech
Speech is the vocalization form of human communication. It is based upon the
syntactic combination of lexicals and names that are drawn from very large
(usually>10,000 different words) vocabularies. Each spoken word is created out of the
phonetic combination of a limited set of vowel and consonant speech sound units6
.
The process of human speech production is shown by Figure 1.4. The mechanism of
speech is composed of four processes: Language processing, in which the content of an
utterance is converted into phonemic symbols in the brain’s language center; generation of
motor commands to the vocal organs in the brain’s motor center; articulatory movement for
the production of speech by the vocal organs based on these motor commands; and the
emission of air sent from the lungs in the form of speech (Honda 2003). The air formed by
the vocal organs after emitting from the lungs constitutes the speech signal, which is a
continuous, acoustic waveform. It is created by the operation of the vocal organs in
response to motor control commands from the brain (Taylor, 2007). In Figure 1.5 speech
signal waveform is shown.
FESTIVAL
Festival is a general multi-lingual speech synthesis system originally developed at
Centre for Speech Technology Research (CSTR) at the University of Edinburgh.
Substantial contributions have also been provided by Carnegie Mellon University and other
sites. Festival is free software. Festival and the speech tools are distributed under an X11-
type license allowing unrestricted commercial and non-commercial use alike.
Festival offers a full text to speech system with various APIs, as well an environment
for development and research of speech synthesis techniques. It is written in C++ with a
Scheme-like command interpreter for general customization and extension.
Festival is designed to support multiple languages, and comes with support for
English (British and American pronunciation), Welsh, and Spanish. Voice packages exist
for several other languages, such as Castilian Spanish, Czech, Finnish, Hindi, Italian,
Marathi, Russian and Telugu (Black at el., 1999).
Festival support for MBROLA already supports a number of diphone sets including
French, Spanish, German and Romanian, Hindi, Swedish, Turkish.
CONCLUSION
In this study, the framework of a Turkish TTS that uses a concatenative synthesis
approach is implemented and evaluated Although the system uses simple techniques, it
provides promising results for Turkish, since the selected approach, the concatenative
method, is very well suited for Turkish. This method is flexible enough to allow the
synthesis of all types texts and the concatenation units are obtained from the atomic units.
The system can be improved by improving the quality of the speech files recorded.
The sound files of news, films etc can be explored for extracting the recurrent sound units
in Turkish instead of recording the diphones one by one. There are some ongoing projects
about the analysis of speech signals for various applications. These projects can be helpful
for obtaining wide ranges of phonemes in synthesis.
The punctuations are removed in the preprocessing step just to eliminate some
inconsistencies and obtain the core system. In the future versions of the TTS, the text can
be synthesized in accordance with the punctuations for considering the emotions and
intonations as partially achieved in some of the researches. The synthesis of a sentence
ending with a question mark can have an interrogative intonation and synthesis of a
sentence ending with an exclamation mark can be an amazing intonation. In addition to
these, other punctuations can be helpful for approximating the synthesized speech to its
human speech form such as pausing at the end of the sentences ending with full stop and
also pausing after the punctuation comma.
The evaluation process that yields high accuracies both for naturalness and
intelligibility criterion is carried out by using the MOS, CT and DRT techniques as being
the most frequently employed evaluation approaches in this field. 57
The capabilities of the system are as follows:
• All words can be vocalized , since the units are very small.
• Special characters can be vocalized
• Abbreviations can be recognized and vocalized.
• Dates can be recognized and vocalized.
• Currency values and numbers can be vocalized as default.
• Decimal numbers can be vocalized.
• Western-originated words such as train, plan, and professor can be vocalized
correctly. For instance plan in English is vocalized as pi + lan in Turkish.
• The words borrowed from Arabic and Persian and which are pronounciated
differently in Turkish can be vocalized successfully. To illustrate, the syllable of
kâ in kâğıt,is vocalized with circumflex. Furthermore, me in me:mur is vocalized
with long vowel e
• The system has fast or syllable by syllable (for new learners and children)
reading option.
The sources of the previously implemented NLP applications are ought to be utilized
by the coming researchers for building more comprehensive and better TTS systems. For
instance, if there had not been Zemberek, a syllabication module would have had to
implemented. Samely, if a speech database is developed for Turkish, researchers allocate
their time used for developing a speech database to increase the quality of synthesized
speech such as adding intonation and emotion. To exemplify, TIMIT35 and Blizzard
Challenge36 databases are developed for English and researchers utilize these databases
They can implement more comprehensive TTS applications with the help of these
databases.
The TIMIT database is a continuous, speaker independent, phonetically-balanced and
phonetically-labeled speech corpus developed by the Advanced Research Projects Agency
(ARPA). TIMIT contains broadband recordings of 630 speakers of eight major dialects of
American English, each reading ten phonetically rich sentences. The TIMIT corpus