17-06-2013, 03:19 PM
Development of a Web-based Service to Transcribe Between Multiple
Orthographies of the Iu Mien Language
Development of a Web.pdf (Size: 6.75 MB / Downloads: 230)
Abstract
The goal of this study was to explore the use of machine learning techniques in the
development of a web-based application that transcribes between multiple orthographies
of the same language. To this end, source text files used in the publishing of the
Iu Mien Bible translation in 4 scripts were merged into a single textbase that served
as a text corpus for this study.
All syllables in the corpus were combined into a list of parallel renderings which
were subjected to ID3 and neural networks with the back propagation in an attempt
to achieve machine learning of transcription between the different Iu Mien orthographies.
The most effective set of neural net transcription rules were captured and
incorporated into a web-based service where visitors could submit text in one writing
system and receive a webpage containing the corresponding text rendered in the other
writing systems of this language. Transcriptions that are in excess of 90% correct were
achieved between a Roman script and another Roman script or between a non-Roman
script and another non-Roman script. Transcriptions between a Roman script and a
non-Roman yield output that were only 50% correct. This system is still being tested
and improved by linguists and volunteers from various organizations associated with
the target community within Thailand, Laos, Vietnam and the USA.
This study demonstrates the potential of this approach for developing written
materials in languages with multiple scripts. This study also provides useful insights
on how this technology might be improved.
Introduction
Multiple orthographies of a language
All writing systems are an attempt to use orthographic symbols to represent various
linguistic features of a language. However, individual writing systems differ in the
level of phonetic and linguistic information captured, the range of symbols used and
formal constraints governing the mapping between these entities. This is true even
when multiple writing systems are used to describe the same language.[6]
Multiple orthographies commonly arise within a language to address specific needs
of subgroups of readers. For example, the majority of readers of English are familiar
with the Roman script used in most English publications. However, this is not the only
orthography for English in current use. The visually impaired prefer Level 3 Braille
which uses contractions rendered in 6 dot Braille to improve digital reading speeds.
Linguists use the International Phonetic Alphabet (IPA) to record and describe the
regional accents of spoken English. American dictionary publishers commonly use
some variant of the pronunciation symbols derived by Noah Webster to describe
the standard pronunciation of words. Greg short hand, speed writing and court
stenographic systems are all different attempts to increase the speed and accuracy of
manual transcription of English dictation. The ideographic system of emoticons and
acronyms used in English text messaging has become very popular among Internet
users and is still evolving.
The value of parallel texts
Parallel texts provide excellent opportunities for data-mining and machine learning.
Ever since the Rosetta stone was used in the early 19th century to crack the secrets
of hieroglyphics, parallel texts have been used for gleaning rules for translation and
transcription. Lessons learned with hieroglyphics were quickly applied to other sets
of parallel texts to determine the phonetics of various semitic languages.[17] With
the dawn of computers and machine learning techniques, corresponding elements in
pairs of text in large parallel text sets can be tagged, linked and analyzed as a means
for unraveling the meaning of natural human language. However, a large corpus of
text can also contain noise arising from inconsistencies in text entry, regional language
variations, as well as human error. Because these errors and inconsistencies give rise to
artifacts that lower the accuracy and efficiency of the machine learning, considerable
effort is required to develop filters that would result in consistent parallel text.[18]
Support for Unicode encodings
This project uses Roman, Thai and Lao scripts to encode Iu Mien words. The multiscript
nature of this project makes this project vulnerable to undocumented features
and bugs of both application software and operation systems especially since the
source files were encoded in legacy 8 bit proprietary codepages. Editing software like
Microsoft Word is designed to catch common English, Thai or Lao misspellings and
non-standard characters. However, correct Iu Mien character sequences in the source
files are often flagged as typos. While the spell checker can be turned off, there are
some character sequences that cause Microsoft Office 2007 products to enter a mode
of operation that prevents entry of additional characters until one of the preceding
characters is removed.
However, this is not the only problem requiring attention. At this time, not all
programming languages and application software provide full support for Unicode
characters and extended ASCII character sets that use the full 8 bits of a byte (8-bit
ASCII). The ability to handle Thai and Lao characters as distinct entities is essential
to this project and can be demonstrated with a simple, 3-character regular expression
(regex) as shown in Code Frag. 1.1.
Online service
This study aims to deliver the transcription service online in the form of a Ruby on
Rails application running on top of the web services of Heroku which was founded
in 2007 as a cloud application platform for Ruby and was built upon the services
provided by the Amazon Elastic Compute Cloud (Amazon EC2). The system was
set up so that Ruby on Rails applications could be designed and developed locally.
Once the applications are written and tested, they could be deployed to the cloud
using version control commands of GIT. As a cloud based solution, the system monitor
provides practical tools for measuring performance and use of computing resources.
It also has the potential for handling bottlenecks and future expansion if the service
becomes possible.