06-11-2012, 05:10 PM
Unit-Selection Text-to-Speech Synthesis Using an Asynchronous Interpolation Model
1Unit-Selection Text-to-Speech.pdf (Size: 217.77 KB / Downloads: 39)
Abstract
We describe the Asynchronous Interpolation Model, which represents
speech as a composition of several different types of
feature streams that are computed using asynchronous interpolation
of neighboring basis vectors, according to transition
weights. When applied to the acoustic inventory of a concatenative
Text-to-Speech synthesizer, the model eliminates concatenation
errors and affords opportunities for high rates of
compression and voice transformation. We propose a particular
instance of the model that uses formant frequency values
and formant-normalized complex spectra as two types of
streams, in conjunction with a unit-selection synthesizer. During
analysis, basis vectors and transition weights were estimated
automatically, using three different labeling schemes and dynamic
programming methods. An evaluation of the intelligibility
and quality of the synthesized speech showed significant improvements
over a standard, size-matched compression scheme.
The proposed method was also able to convincingly transform
speaker characteristics through replacement of basis vectors.
Introduction
Today’s most natural sounding Text-to-Speech (TTS) synthesis
systems are based on the concatenative synthesis approach,
which uses a multitude of pre-recorded speech “chunks” (a contiguous
section of natural speech) of a single speaker, stored in
an acoustic inventory, to stitch together a new output signal.
The quality of the resulting speech relates directly to the size
of the database, because the larger the chunks, the fewer the
number of concatenation points at which audible artifacts can
occur. Moreover, when the prosodic space is not covered by
the acoustic inventory, prosodic modification becomes necessary,
further degrading the speech signal. The concatenative approach
can be contrasted with the formant synthesis approach,
which is compact in size, gives full prosodic and spectral control
over the speech signal, and is highly intelligible, but which
does not sound very natural.
The Asynchronous Interpolation Model
The core idea of AIM is to represent a short region (on the order
of 5–10 ms) of speech as a composition of several types of features
called streams. Each stream is computed by asynchronous
interpolation of neighboring basis vector features. Each basis
vector is associated (labeled) with a particular phoneme, allophone,
or more specialized unit and may contain additional
information about phonetic and prosodic context. Thus, the
speech region is described by the varying degrees of influence
of several types of preceding and following acoustic features. In
this section, we extend and improve upon the notation reported
previously [6].
Implementation
In our specific implementation, we reduced phonetic and
prosodic context by constraining the summation of Equation 2
to only depend on the previous and the next unit; in other words,
the influence of a basis vector never extends beyond its neighbor.
We chose two types of features, namely formant frequency
locations and the formant-normalized complex spectrum. The
latter is the result of modifying the complex spectrum so that
formants appear at constant neutral values, allowing the interpolation
of spectra without adding extraneous formants.
Composition Operation
The task of the composition operation is to receive a vector of
stream values and to then render a short segment of speech. In
our case the inputs are formant-normalized complex spectra and
formant frequency values, and the composition consists of returning
a modified complex spectrum with the neutral formant
frequency locations changed to the specified ones.
Modifying formant frequencies in the natural spectrum has
been previously researched [10, 11, 12]. Our implementation
consists of non-uniformly resampling the original spectrum (see
Figure 1). In addition to formant frequencies, we specify a
modification-cutoff frequency at 6000 Hz to stop modification
of the spectrum at and above that frequency. Conversely, the
formant-normalized spectra themselves were initially created
by modifying the original spectrum with associated original formant
frequency locations to have formants at a constant neutral
location.
Analysis
During analysis, synthesis, and evaluation, the system utilizes
a small unit-selection database of a female speaker “AS” [13],
which covers all diphones and specific triphones that are known
to have a significant amount of coarticulation, but which does
not have complete prosodic coverage.
Basis Vectors
In the proposed implementation, basis vectors contain information
about both the complex spectrum and formant frequency
locations. Therefore, the analysis process begins by making
initial estimates of formant frequency trajectories F1, F2, and
F3, using the ESPS get_formant algorithm [14].
Evaluation
Intelligibility and Quality
The following four conditions were compared: (1) the standard
OGI TTS baseline system [13] at 352.8 kbps, (2) the baseline
compressed with the Speex CELP coder [16] at 8.0 kbps, (3) the
baseline compressed with the Speex CELP coder at 3.4 kbps,
and (4) the BioSpeech AIM TTS system using the global labeling
scheme at 3.4 kbps. The average bit rate for AIM was
computed as follows: Given 54 basis vectors with an average dimension
of 62, where each component is represented by 16 bits,
yields 53,568 bits. Each of the 63,716 frames of the acoustic
inventory contains an 8-bit number that marks the position
of the frame; in addition, each frame contains two 4-bit transition
weights, for a total of 1,019,456 bits. Finally, the 132,300-
bit wave library is added, for a grand total of representing the
database in 1,205,324 bits or 3,414 bps. Compared to the original
representation of 124,530,928 bits, or 352.8 kbps, this represents
a 103:1 compression rate.
Speaker Recognizability
In this test, a source speaker’s basis vectors of an acoustic inventory
were replaced with basis vectors from a target speaker’s
acoustic inventory, while leaving the transition weights unchanged.
Prosody was kept exactly constant for all stimuli to
ensure that speaker recognizability performance was measured
based on spectral cues only, and not on prosodic cues.
The text material used in this experiment consisted of 40
sentences, randomly selected from the IEEE Harvard Psychoacoustic
Sentences [17]. The sentences were synthesized using
AIM with representations derived from the acoustic inventory
of five male voices, aged 21–39, and whose native language
was American English. The local labeling scheme was used for
highest synthesis quality. For 20 of the sentences, the original
basis vectors were replaced by basis vectors derived from exactly
one of the other four voices.
Conclusion
We have described a speech synthesis system based on the
Asynchronous Interpolation Model, which represents speech as
a composition of several streams that are computed using asynchronous
interpolation of neighboring basis vectors. Applied
to a concatenative TTS system’s acoustic inventory, the model
avoids concatenation errors during synthesis, and affords opportunities
for variable compression and a new approach to voice
transformation. During evaluation, AIM produced significantly
higher quality and intelligibility than speech that has been compressed
by traditional methods, using sizes equal to AIM or
more than twice as large as AIM. The AIM compression ratio
in this study was 103:1; this could easily be further increased
by further parametrization of transition weights. Results also
showed that AIM produces speech that can be reliably identified
with a desired target speaker, using an extremely small set
of training speech.