30-11-2012, 04:37 PM
Text Mining
TextMining.pdf (Size: 703.68 KB / Downloads: 167)
What is Text Mining?
•The discovery by computer of new, previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources.
•What does previously unknown mean?
–Implies discovering genuinely new information.
–Hearst’s analogy: Discovering new knowledge vs. merely finding patterns is like the difference between a detective following clues to find the criminal vs. analysts looking at crime statistics to assess overall trends in car theft.
•What about unstructured?
–Free naturally occurring text.
–As opposed to HTML,XML, …
Document Clustering
•Large volume of textual data
–Billions of documents must be handled in an efficient manner.
•No clear picture of what documents suit the application.
•Solution: use Document Clustering (Unsupervised Learning).
•Most popular Document Clustering methods are:
–K-Means clustering.
–Agglomerative hierarchical clustering.
Text Characteristics
•Several input modes
–Text is intended for different consumers, i.e. different languages (human consumers) and different formats (automated consumers).
•Dependency
–Words and phrases create context for each other.
Text Processing again
•Semantic Structures:
–Two methods:
•Full parsing: Produces a parse tree for a sentence.
•Chunking with partial parsing: Produces syntactic constructs like Noun Phrases and Verb Groups for a sentence.
–Which is better?
•Producing a full parse tree often fails due to grammatical inaccuracies, novel words, bad tokenization, wrong sentence splits, errors in POS tagging, …
•Hence, chunking and partial parsing is more commonly used.
Data Mining
•At this point the Text mining process merges with the traditional Data Mining process.
•Classic Data Mining techniques are used on the structured database that resulted from the previous stages.
•This is a purely application-dependent stage.