08-11-2012, 02:32 PM
Dialogic® Continuous Speech Processing API
Continuous Speech.pdf (Size: 531.51 KB / Downloads: 72)
Purpose
This publication provides guidelines for building applications using the Dialogic® Continuous
Speech Processing (CSP) Software and Dialogic® Voice Software in a Linux or Windows®
environment.
It is a companion guide to the Dialogic® Continuous Speech Processing API Library Reference
which provides details on functions and parameters in the Dialogic® CSP library.
Applicability
This document version (05-1699-006) is published for Dialogic® Host Media Processing Software
Release 3.1LIN.
This document may also be applicable to other software releases (including service updates) on
Linux or Windows® operating systems. Check the Release Guide for your software release to
determine whether this document is supported.
How to Use This Publication
Refer to this publication after you have installed the hardware and the system software which
includes the CSP software.
This publication assumes that you are familiar with the Linux or Windows® operating system and
the C programming language. It is helpful to keep the Dialogic® Voice API Library Reference and
Dialogic® Voice API Programming Guide handy as you develop your application.
Key Features
The Dialogic® CSP Software provides a high-level interface to Dialogic® boards and is a building
block for creating host-based automatic speech recognition (ASR) applications. Dialogic® CSP
Software gives you the ability to stream voice-activated, pre-speech buffered, echo-cancelled voice
data to an ASR engine.
Dialogic® CSP Software consists of a library of functions, device drivers, firmware, sample
demonstration programs and technical documentation to help you create leading-edge ASR
applications. It is an enhancement to existing echo cancellation resource (ECR) and barge-in
technology.
Key features of CSP include:
• Full-duplex operation, which means the capability of simultaneously sending and receiving
(playing and recording) voice data on a single CSP channel.
• Echo canceller that significantly reduces echo in the incoming signal (up to 64 ms on select
Dialogic® DM3 boards and up to 16 ms on Dialogic® Springware boards).
• Voice activity detector (VAD) that determines when significant audio energy is detected on the
channel and enables data to be sent only when speech is present, thereby reducing CPU
loading.
CSP Components
The Dialogic® CSP Software consists of several CSP components, many of which reside in the
firmware level of the board:
• echo canceller
• voice activity detector (VAD)
• pre-speech buffer
• barge-in and voice event signaling
• streaming or recording
Figure 1 depicts the data flow from the network to the CSP voice channel. This figure shows how
echo is introduced in the signal in the network and how it is cancelled. It also illustrates the option
of sending echo-cancelled data over the TDM bus to another board, regardless of whether this
second board is CSP-capable or not.
Echo Canceller Overview
The echo canceller is a component in the Dialogic® CSP Software that is used by applications to
eliminate echo in the incoming signal. In the scenario described in Section 1.2, “CSP
Components”, on page 12, the incoming signal is the utterance “Steve Smith.” Because of the echo
canceller, the “Steve Smith” signal has insignificant echo and can be processed more accurately by
the speech recognition engine.
Figure 3 shows a close-up view of how the echo canceller works. After the incoming signal is
processed by the echo canceller, the resulting signal no longer has significant echo and is then sent
to the host application.
Tap Length
The duration of an echo is measured in tens of milliseconds. An echo canceller can remove some
limited number of these milliseconds, and this number is known as the length of the echo canceller.
The length of an echo canceller is sometimes given as “taps,” where each tap is 125 microseconds
long.
The longer the tap length, the more echo is cancelled from the incoming signal. However, this
means more processing power is required. When determining the tap length value, consider the
length of the echo delay in your system as well as your overall system configuration.
On Dialogic® DM3 boards, the media load which is downloaded when you start the board
determines what tap length values are supported. Some Dialogic® DM3 boards support one value
only, 128 taps (16 ms). Other Dialogic® DM3 boards support 512 taps (64 ms). For information on
media loads, see the appropriate Configuration Guide for your product or product family. For
information on tap length support on Dialogic® DM3 boards, see the Release Guide for the system
release you are using.
Voice Activity Detector (VAD)
When a caller begins to speak over a prompt (also known as barge-in), the application typically
stops the playing of the prompt so that it isn’t distracting to the caller.
A voice activity detector (VAD) is a component in the Dialogic® CSP Software that examines the
caller’s incoming signal and determines if the signal contains significant energy and is likely to be
speech rather than a click, for example. The significance is determined by configurable parameters.
The VAD has several configurable parameters such as the threshold of energy that is considered
significant during prompt play and after the prompt has completed play. For more information, see
parameter descriptions in ec_setparm( ) in the Dialogic® Continuous Speech Processing API
Library Reference.
For information on the VAD, see Chapter 5, “Using the Voice Activity Detector”. For information
on choices of operating modes for the VAD, see Section 5.1, “Voice Activity Detector Operating
Modes”, on page 35.