11-05-2012, 11:30 AM
Breaking Audio CAPTCHAs
Breaking_Audio_CAPTCHAs ajay.pdf (Size: 227.49 KB / Downloads: 174)
Introduct ion
CAPTCHAs [1] are automated tests designed to tell computers and humans apart by
presenting users with a problem that humans can solve but current computer programs
cannot. Because CAPTCHAs can distinguish between humans and computers with high
probability, they are used for many different security applications: they prevent bots from
voting continuously in online polls, automatically registering for millions of spam email
accounts, automatically purchasing tickets to buy out an event, etc. Once a CAPTCHA is
broken (i.e., computer programs can successfully pass the test), bots can impersonate
humans and gain access to services that they should not. Therefore, it is important for
CAPTCHAs to be secure.
To pass the typical visual CAPTCHA, a user must correctly type the characters displayed in
an image of distorted text. Many visual CAPTCHAs have been broken with machine
learning techniques [2]-[3], though some remain secure against such attacks. Because
visually impaired users who surf the Web using screen-reading programs cannot see this type
of CAPTCHA, audio CAPTCHAs were created. Typical audio CAPTCHAs consist of one
or several speakers saying letters or digits at randomly spaced intervals. A user must
correctly identify the digits or characters spoken in the audio file to pass the CAPTCHA. To
make this test difficult for current computer systems, specifically automatic speech
recognition (ASR) programs, background noise is injected into the audio files.
Since no official evaluation of existing audio CAPTCHAs has been reported, we tested the
security of audio CAPTCHAs used by many popular Web sites by running machine learning
experiments designed to break them. In the next section, we provide an overview of the
literature related to our project. Section 3 describes our methods for creating training data,
and section 4 describes how we create classifiers that can recognize letters, digits, and noise.
In section 5, we discuss how we evaluated our methods on widely used audio CAPTCHAs
and we give our results. In particular, we show that the audio CAPTCHAs used by sites
such as Google and Digg are susceptible to machine learning attacks. Section 6 mentions the
proposed design of a new more secure audio CAPTCHA based on our findings.
Lit eratur e r evi ew
To break the audio CAPTCHAs, we derive features from the CAPTCHA audio and use
several machine learning techniques to perform ASR on segments of the CAPTCHA. There
are many popular techniques for extracting features from speech. The three techniques we use
are mel-frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP), and
relative spectral transform-PLP (RASTA-PLP). MFCC is one of the most popular speech
feature representations used. Similar to a fast Fourier transform (FFT), MFCC transforms an
audio file into frequency bands, but (unlike FFT) MFCC uses mel-frequency bands, which
are better for approximating the range of frequencies humans hear. PLP was designed to
extract speaker-independent features from speech [4]. Therefore, by using PLP and a variant
such as RASTA-PLP, we were able to train our classifiers to recognize letters and digits
independently of who spoke them. Since many different people recorded the digits used in
one of the types of audio CAPTCHAs we tested, PLP and RASTA-PLP were needed to
extract the features that were most useful for solving them.
In [4]-[5], the authors conducted experiments on recognizing isolated digits in the presence
of noise using both PLP and RASTA-PLP. However, the noise used consisted of telephone
or microphone static caused by recording in different locations. The audio CAPTCHAs we
use contain this type of noise, as well as added vocal noise and/or music, which is supposed
to make the automated recognition process much harder.
The authors of [3] emphasize how many visual CAPTCHAs can be broken by successfully
splitting the task into two smaller tasks: segmentation and recognition. We follow a similar
approach in that we first automatically split the audio into segments, and then we classify
these segments as noise or words.
In early March 2008, concurrent to our work, the blog of Wintercore Labs [6] claimed to
have successfully broken the Google audio CAPTCHA. After reading their Web article and
viewing the video of how they solve the CAPTCHAs, we are unconvinced that the process
is entirely automatic, and it is unclear what their exact pass rate is. Because we are unable to
find any formal technical analysis of this program, we can neither be sure of its accuracy nor
the extent of its automation.
Cr eat ion of traini ng data
Since automated programs can attempt to pass a CAPTCHA repeatedly, a CAPTCHA is
essentially broken when a program can pass it more than a non-trivial fraction of the time;
e.g., a 5% pass rate is enough.
Our approach to breaking the audio CAPTCHAs began by first splitting the audio files into
segments of noise or words: for our experiments, the words were spoken letters or digits. We
used manual transcriptions of the audio CAPTCHAs to get information regarding the
location of each spoken word within the audio file. We were able to label our segments
accurately by using this information.
We gathered 1,000 audio CAPTCHAs from each of the following Web sites: google.com,
digg.com, and an older version of the audio CAPTCHA in recaptcha.net. Each of the
CAPTCHAs was annotated with the information regarding letter/digit locations provided by
the manual transcriptions. For each type of CAPTCHA, we randomly selected 900 samples
for training and used the remaining 100 for testing.