Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Designing and Recording an Emotional Speech Database for Corpus Based Synthesis
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Designing and Recording an Emotional Speech Database for Corpus Based Synthesis in Basque

[attachment=30052]

Abstract

This paper describes an emotional speech database recorded for standard Basque. The database has been designed with the twofold
purpose of being used for corpus based synthesis, and also of allowing the study of prosodic models for the emotions. The database is
thus large, to get good corpus based synthesis quality and contains the same texts recorded in the six basic emotions plus the neutral
style. The recordings were carried out by two professional dubbing actors, a man and a woman. The paper explains the whole creation
process, beginning with the design stage, following with the corpus creation and the recording phases, and finishing with some learned
lessons and hints.

Introduction

In the last years, progress in speech synthesis has
largely overcome the milestone of intelligibility, driving
the research efforts to the area of naturalness and fluency.
These features become more and more necessary as the
synthesis tasks get larger and more complex: natural
sound and good fluency and intonation are mandatory if a
long synthesized text shall be understood.
Seeking naturalness, the corpus based (or unit
selection based) synthesis methods appeared about the
second half of the last decade. These methods use
concatenative speech synthesis techniques and try to
minimize the signal manipulation. In this way they
preserve the original naturalness of the speech,
minimizing the number of joints between voice fragments,
by using unit selection algorithms which bonus large units
(Sagisaka,1998; Sagisaka et al., 1992).

Controlling speakers’ variability

Previous works have also taught us that it was
impossible for the speakers to keep a constant reference
level for their rhythm, tone, volume, etc. through a long
lasting recording session. The expected recording time for
this database spread through several sessions, so the
effects of these variations were supposed to be even more
important. In order to quantify these deviations and keep
on being able of comparing prosodic parameters among
emotions, a control text was also designed.
This control text consists of a short continuous text
(400 words long), and had to be read with neutral style at
the beginning, mid-session and end of every session. In
this way, the reference levels in the prosodic parameters
for each session will be extracted from this control text,
and the data of every emotion will be normalized against
these reference levels.

Phonetic balance

Besides assuring that large units will be found in the
database, it is also necessary to assure that all the possible
phonemes and certain phoneme combinations of a
language are included. If we want to assure that the unit
selection synthesis will produce at least the quality of
other concatenative methods, we will have to design the
database to guarantee that there are at least all the smallest
units used by these other methods. A reasonable minimum
size for these units is diphoneme.
Once the minimum unit is selected, the purpose of the
phonetic balance is to keep the appearance rate of these
units in the database corpus, as close as possible to their
appearance in the actual language. In this way, usual
diphonemes will appear lot of times in the recorded
database, in multiple contexts, and rare ones will appear
perhaps only once, or even they will have to be explicitly
added. In addition.

Requirements for unit selection
techniques


In the previous section we have seen that the prosodic
study has posed requirements which affect to what we
could call the “external” structure of the database. In this
section we will see that the unit selection synthesis
objective will set the requirements for the actual text
contents of the database: the “internal” structure.
As said before, the unit selection techniques need large
databases to provide the selection algorithm with a good
choice of candidate units. The main objective of the
corpus design for these systems is to ensure that there are
candidate units for the biggest number of possibilities, that
is, database coverage is broad enough.
The part of the database that will be used for this
purpose is the one called Main Corpus, so the following
requirements will only affect to the contents of this part.

Creation of the corpus

Once the initial requirements have been stated, the
next step in the process is to create the actual corpus to be
recorded. Some of these requirements set coverage
referred to a pre-existing corpus of the language: and this
is actually the first step, getting a great amount of corpora
from which the final recording corpus will be extracted.
In this work the initial corpora is a set of texts coming
from different sources: the main portion consists of two
years of text from a Basque newspaper, other texts come
from several novels, and a number of smaller corpora,
previously obtained for other works in Aholab, depurated
and balanced to get phonetic coverage.

Conclusions

The recorded database consists on approximately 1.5
hours per emotion which makes up 10.5 hours of
recordings per speaker, more than 20 hours in total. This
database represents a new linguistic resource that will
allow the study of emotional speech in standard Basque,
and also a high quality unit based synthesis. The large
extent of the database will also enable future research on
other areas, like speech modification, corpus based
prosody and so on.