26-02-2013, 03:51 PM
An Overview of Text Mining
An Overview.ppt (Size: 130 KB / Downloads: 116)
What Is Text Mining?
“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)
“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)
Challenges in Text Mining
Data collection is “free text”
Data is not well-organized
Semi-structured or unstructured
Natural language text contains ambiguities on many levels
Lexical, syntactic, semantic, and pragmatic
Learning techniques for processing text typically need annotated training examples
Consider bootstrapping techniques
Text Mining Tasks
Exploratory Data Analysis
Using text to form hypotheses about diseases (Swanson and Smalheiser, 1997).
Information Extraction
(Semi)automatically create (domain specific) knowledge bases, and then use standard data-mining techniques.
Bootstrapping methods (Riloff and Jones, 1999).
Text Classification
Useful intermediary step for information extraction
Bootstrapping method using EM (Nigam et al., 2000).
Challenges in Data Exploration
How can valid inference links be found without succumbing to combinatorial explosion of possibilities?
Need better models of lexical relationships and semantic constraints (very hard)
How should the information be presented to the human experts to facilitate their exploration?
Information Extraction (IE)
Extract domain-specific information from natural language text
Need a dictionary of extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”)
Constructed by hand
Automatically learned from hand-annotated training data
Need a semantic lexicon (dictionary of words with semantic category labels)
Typically constructed by hand
Challenges in IE
Automatic learning methods are typically supervised (i.e., need labeled examples)
But annotating training data is a time-consuming and expensive task.
Can we develop better unsupervised algorithm?
Can we make better use of a small set of labeled example?
Parameter Estimation with Unlabeled Documents
EM: for “incomplete data” problems
Maximize prob. of model generating observed data
Build initial classifier (initialize the parameters to “reasonable” starting values)
Repeat until convergence
E-Step: Use current classifier params, qt, to estimate P(c|d;qt) for all d in Du
M-Step: Re-estimate the classifier, qt+1, using the expected counts from the E-Step