01-10-2012, 12:00 PM
Statistical Machine Translation
statistical_mt.pdf (Size: 225.33 KB / Downloads: 42)
Abstract
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages
and especially the inherent ambiguity of language make MT a very difficult problem.
Traditional approaches to MT have relied on humans supplying linguistic knowledge
in the form of rules to transform text in one language to another. Given the vastness
of language, this is a highly knowledge intensive task. Statistical MT is a radically
different approach that automatically acquires knowledge from large amounts of
training data. This knowledge, which is typically in the form of probabilities of various
language features, is used to guide the translation process. This report provides
an overview of MT techniques, and looks in detail at the basic statistical model.
Machine Translation: an Overview
Machine Translation (MT) can be defined as the use of computers to automate
some or all of the process of translating from one language to another. MT is an
area of applied research that draws ideas and techniques from linguistics, computer
science, Artificial Intelligence (AI), translation theory, and statistics. Work began
in this field as early as in the late 1940s, and various approaches — some ad hoc,
others based on elaborate theories — have been tried over the past five decades.
This report discusses the statistical approach to MT, which was first suggested by
Warren Weaver in 1949 [Weaver, 1949], but has found practical relevance only in
the last decade or so. This approach has been made feasible by the vast advances
in computer technology, in terms of speed and storage capacity, and the availability
of large quantities of text data.
This chapter provides the context for the detailed discussion of Statistical MT
that appears in the following chapters. In this chapter, we look at the issues in
MT, and briefly describe the various approaches that have evolved over the last five
decades of MT research.
Difficulties in Machine Translation
Although the ultimate goal of MT, as in AI, may be to equal the best human efforts,
the current targets are much less ambitious. MT aims not to translate literary
work, but technical documents, reports, instruction manuals etc. Even here, the
goal usually is not fluent translation, but only correct and understandable output.
To appreciate the difficulty of MT, we will look at some examples of language
features that are especially problematic from the point of view of translation.
Structural Differences
Just as English follows a Subject-Verb-Object (SVO) ordering in sentences, each
language follows a certain sentence structure. Hindi, for example, is a Subject-
Object-Verb language. Apart from this basic feature, languages also differ in the
structural (or syntactic) constructions that they allow and disallow. These differences
have to be respected during translation.
For instance, post-modifiers in English become pre-modifiers in Hindi, as can
be seen from the following pair of sentences. These sentences also illustrate the
SVO and SOV sentence structure in these languages. Here, S is the subject of the
sentence, S_m is the subject modifier, and similarly for the verb (V) and the object
(O).
Approaches to Machine Translation
We have seen in the previous section that languages differ in vocabulary and structure.
MT, then, can be thought of as a process that reduces these differences to the
extent possible. This perspective leads to what is known as the transfer model for
MT.
The Transfer Approach
The transfer model involves three stages: analysis, transfer, and generation. In the
analysis stage, the source language sentence is parsed, and the sentence structure and
the constituents of the sentence are identified. In the transfer stage, transformations
are applied to the source language parse tree to convert the structure to that of the
target language. The generation stage translates the words and expresses the tense,
number, gender etc. in the target language.
Corpus-based Approaches
The approaches that we have seen so far, all use human-encoded linguistic knowledge
to solve the translation problem. We will now look at some approaches that do not
explicitly use such knowledge, but instead use a training corpus (plur. corpora) of
already translated texts — a parallel corpus — to guide the translation process. A
parallel corpus consists of two collections of documents: a source language collection,
and a target language collection. Each document in the source language collection
has an identified counterpart in the target language collection.