An Overview of Text Mining ppt

**study tips** · 26-02-2013, 03:51 PM

An Overview of Text Mining

.ppt

An Overview.ppt (Size: 130 KB / Downloads: 116)

What Is Text Mining?

“The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)
“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)

Challenges in Text Mining

Data collection is “free text”
Data is not well-organized
Semi-structured or unstructured
Natural language text contains ambiguities on many levels
Lexical, syntactic, semantic, and pragmatic
Learning techniques for processing text typically need annotated training examples
Consider bootstrapping techniques

Text Mining Tasks

Exploratory Data Analysis
Using text to form hypotheses about diseases (Swanson and Smalheiser, 1997).
Information Extraction
(Semi)automatically create (domain specific) knowledge bases, and then use standard data-mining techniques.
Bootstrapping methods (Riloff and Jones, 1999).
Text Classification
Useful intermediary step for information extraction
Bootstrapping method using EM (Nigam et al., 2000).

Challenges in Data Exploration

How can valid inference links be found without succumbing to combinatorial explosion of possibilities?
Need better models of lexical relationships and semantic constraints (very hard)
How should the information be presented to the human experts to facilitate their exploration?

Information Extraction (IE)

Extract domain-specific information from natural language text
Need a dictionary of extraction patterns (e.g., “traveled to <x>” or “presidents of <x>”)
Constructed by hand
Automatically learned from hand-annotated training data
Need a semantic lexicon (dictionary of words with semantic category labels)
Typically constructed by hand

Challenges in IE

Automatic learning methods are typically supervised (i.e., need labeled examples)
But annotating training data is a time-consuming and expensive task.
Can we develop better unsupervised algorithm?
Can we make better use of a small set of labeled example?

Parameter Estimation with Unlabeled Documents

EM: for “incomplete data” problems
Maximize prob. of model generating observed data
Build initial classifier (initialize the parameters to “reasonable” starting values)
Repeat until convergence
E-Step: Use current classifier params, qt, to estimate P(c|d;qt) for all d in Du
M-Step: Re-estimate the classifier, qt+1, using the expected counts from the E-Step

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Human Computer Interface : Seminar Report and PPT	seminar post	1	1,337	22-09-2017, 11:23 AM Last Post: jaseela123
	4G Broadband : Seminar Report and PPT	study tips	1	1,261	22-09-2017, 11:19 AM Last Post: jaseela123
	Software Life-Cycle Models ppt	seminar flower	1	3,852	22-09-2017, 10:54 AM Last Post: jaseela123
	PPT ON LINUX	project girl	1	1,829	21-09-2017, 03:56 PM Last Post: jaseela123
	Public Key Infrastructure (Digital Certificates and Digital Signatures) PPT	project girl	1	2,364	21-09-2017, 01:18 PM Last Post: jaseela123
	Itanium Processor : Seminar Report and PPT	seminar projects maker	1	1,052	21-09-2017, 12:46 PM Last Post: jaseela123
	Design and Analysis Of Algorithms : Seminar Report and PPT	seminar projects maker	1	1,315	21-09-2017, 12:04 PM Last Post: jaseela123
	Ranked, Efficient and Secure Keyword search over encrypted cloud data PPT	seminar post	1	814	21-09-2017, 11:55 AM Last Post: jaseela123
	Data Mining: What is Data Mining? Report	project girl	1	2,262	21-09-2017, 11:47 AM Last Post: jaseela123
	Biometric Authentication PPT	project girl	1	1,109	19-09-2017, 02:32 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.