Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Text Mining
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Text Mining

The explosion of on-line information has given rise to many query based search engines and manually constructed topic hierarchies. But with the current growth rate in the amount of information, query results grow incomprehensibly large and manual classification in topic hierarchies creates an immense bottleneck. Search engines return millions of relevant sites but sites referring to similar content are not grouped. Cluster search, groups similar sites, giving users a greater chance of finding more sites relevant to their search.

In this dissertation, we address these problems with a system for topical information space navigation that combines the query-based and taxonomic approaches. Our system Racimo enables the creation of dynamic hierarchical document clustering based on full text of articles. A major challenge in document clustering is the extremely high dimensionality. For example, the vocabulary for a document set can easily be thousands of words. On the other hand, each document often contains a small fraction of words in the vocabulary. These features require special handlings. Another requirement is hierarchical clustering where clustered documents can be browsed according to the increasing specificity of topics. In this system, we propose to use the notion of frequent itemsets, which comes from association rule mining, for document clustering. The intuition of our clustering criterion is that each cluster is identified by some common words, called frequent itemsets, for the documents in the cluster. Frequent itemsets are also used to produce a hierarchical topic tree for clusters. By focusing on frequent items, the dimensionality of the document set is drastically reduced. We show that this method outperforms best existing methods in terms of both clustering accuracy and scalability.
Text Mining

[attachment=17268]
Text Databases


.Consists of large collections of documents from various sources. Eg- articles, books, research papers, digital libraries, etc…

.Semistructured data
.Document contains few structured fields such as title,authors and unstructured text components such as abstract and contents.

.Information retrival techniques such as indexing methods have been developed to handle unstructured documents.


Information Retrieval(IR)


.It is a field that has been developing in parallel with database systems.

.Database systems focused on query and transaction processing on structured data.

.Information retrieval focused on organization and retrieval of information from a large number of text-based documents.


F-score
Its a trade off recall for precision and vice versa.
It’s a harmonic mean of precision and recall
It discourages a system that sacrifices one measure for another.



Document Selection



.Query is used to specifying constraints for selecting relevant documents

.Boolean Model
.Document is represented as set of keywords and user provides a boolean expression of keywords.
Eg: tea or coffee, database systems but not DB2.
.Retrieval system would take such a boolean query and return documents that satisfies the boolean query.
.Works well when the user knows lot about the document collection.



Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text.

High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.

'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

hotels in ormond
TEXT MINING


[attachment=31146]

Text Databases and IR

Text databases (document databases)
Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc.
Data stored is usually semi-structured
Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data
Information retrieval
A field developed in parallel with database systems
Information is organized into (a large number of) documents
Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents

Information Retrieval

Typical IR systems
Online library catalogs
Online document management systems
Information retrieval vs. database systems
Some DB problems are not present in IR, e.g., update, transaction management, complex objects
Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance

Information Retrieval Techniques

Basic Concepts
A document can be described by a set of representative keywords called index terms.
Different index terms have varying relevance when used to describe document contents.
This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf)
DBMS Analogy
Index Terms Attributes
Weights Attribute Values

Boolean Model

Consider that index terms are either present or absent in a document
As a result, the index term weights are assumed to be all binaries
A query is composed of index terms linked by three connectives: not, and, and or
e.g.: car and repair, plane or airplane
The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query

Keyword-Based Retrieval

A document is represented by a string, which can be identified by a set of keywords
Queries may use expressions of keywords
E.g., car and repair shop, tea or coffee, DBMS but not Oracle
Queries and retrieval should consider synonyms, e.g., repair and maintenance
Major difficulties of the model
Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining
Polysemy: The same keyword may mean different things in different contexts, e.g., mining

Types of Text Data Mining

Keyword-based association analysis
Automatic document classification
Similarity detection
Cluster documents by a common author
Cluster documents containing information from a common source
Link analysis: unusual correlation between entities
Sequence analysis: predicting a recurring event
Anomaly detection: find information that violates usual patterns
Hypertext analysis
Patterns in anchors/links
Anchor text correlations with linked objects