24-11-2012, 01:16 PM
An Introduction to Text Mining CAS 2009 RPM Seminar
francis.pdf (Size: 294.84 KB / Downloads: 71)
Objectives
• Present a new data mining technology
• Show how the technology uses a
combination of
• String processing functions
• Natural language processing
• Common multivariate procedures available
in statistical most statistical software
• Discuss practical issues for
implementing the methods
• Discuss software for text mining
Parsing Text
Separate words from spaces and
punctuation
Clean up
Remove redundant words
Remove words with no content
Cleaned up list of Words referred to
as tokens
Term Document Matrix/Index
Uses frequency measure for each word
instead of on-off binary indicator
“The Index representation does not do justice
to the complexity of human language but is
dictated by the practical difficulty of storing
more information objects”
Natural Language Processing
Draws on many disciplines
Artificial Intelligence
Linguistics
Statistics
Speech Recognition
Includes lexical analysis, multiword phrase
groupings, sense disambiguation, part of
speech tagging
Arguments against: it is error-prone and
output contains too much detail and nise
Consequences of Zipf
There are a few very frequent tokens or words
that add little to information
Known as stop words
Examples: a, the, to, from
Usually
Small number of very common words (i.e., stop
words)
Medium number of medium frequency words
Large number of infrequent words
The medium frequency words the most useful