20-09-2014, 11:06 AM
Text Mining
Text Mining.pdf (Size: 262.44 KB / Downloads: 84)
ABSTRACT
The volume of information circulating in a typical enterprise continues to increase.
Knowledge hidden in the information however, is not fully utilized, as most of the
information is described in textual form (as sentences). A large amount of text information
can be analyzed objectively and efficiently with Text Mining.The field of text mining has
received a lot of attention due to the ever increasing need for managing the information that
resides in the vast amount of available text documents. Text documents are characterized by
their unstructured nature. Ever increasing sources of such unstructured information include
the World Wide Web, biological databases, news articles, emails etc.
Text mining is defined as the discovery by computer of new, previously unknown
information, by automatically extracting information from different written resources. A key
element is the linking together of the extracted information together to form new facts or new
hypotheses to be explored further by more conventional means of experimentation. As the
amount of unstructured data increases, text-mining tools will be increasingly valuable. A
future trend is integration of data mining and text mining into a single system, a combination
known as duo-mining
INTRODUCTION
In the future, books and magazines will be used only for special purposes because
electronic documents become the primary means of storing, accessing and sorting written
communication. As many become overwhelmed with information, it will become physically
impossible for any individual to process all the information available on a particular topic.
Massive amounts of data will reside in cyberspace, generating demand for text mining
technology and solutions.
Text Mining is the discovery by computer of new, previously unknown information,
by automatically extracting information from different written resources. A key element is
the linking together of the extracted information together to form new facts or new
hypotheses to be explored further by more conventional means of experimentation. Text
mining is different from what are familiar with in web search. In search, the user is typically
looking for something that is already known and has been written by someone else. The
problem is pushing aside all the material that currently is not relevant to your needs in order
to find the relevant information. In text mining, the goal is to discover unknown information,
something that no one yet knows and so could not have yet written down
TECHNOLOGY FOUNDATIONS
Although the differences in human and computer languages are expansive, there have
been technological advances which have begun to close the gap. The field of natural language
processing has produced technologies that teach computers natural language so that they may
analyze, understand, and even generate text. Some of the technologies that have been
developed and can be used in the text mining process are information extraction, topic
tracking, summarization, categorization, clustering, concept linkage, information
visualization, and question answering. In the following sections we will discuss each of these
technologies and the role that they play in text mining. We will also illustrate the type of
situations where each technology may be useful in order to help readers identify tools of
interest to themselves or their organizations.
INFORMATION EXTRACTION
A starting point for computers to analyze unstructured text is to use information
extraction. Information extraction software identifies key phrases and relationships within
text. It does this by looking for predefined sequences in text, a process called pattern
matching. The software infers the relationships between all the identified people, places, and
time to provide the user with meaningful information. This technology can be very useful
when dealing with large volumes of text. Traditional data mining assumes that the
information to be “mined” is already in the form of a relational database. Unfortunately, for
many applications, electronic information is only available in the form of free natural
language documents rather than structured databases. Since IE addresses the problem of
transforming a corpus of textual documents into a more structured database, the database
constructed by an IE module can be provided to the KDD module for further mining of
knowledge as illustrated in Figure
TEXT SUMMARIZATION
Text summarization is immensely helpful for trying to figure out whether or not a
lengthy document meets the user’s needs and is worth reading for further information. With
large texts, text summarization software processes and summarizes the document in the time
it would take the user to read the first paragraph. The key to summarization is to reduce the
length and detail of a document while retaining its main points and overall meaning. The
challenge is that, although computers are able to identify people, places, and time, it is still
difficult to teach software to analyze semantics and to interpret meaning.
Generally, when humans summarize text, we read the entire selection to develop a full
understanding, and then write a summary highlighting its main points. Since computers do
not yet have the language capabilities of humans, alternative methods must be considered.
One of the strategies most widely used by text summarization tools, sentence extraction,
extracts important sentences from an article by statistically weighting the sentences. Further
heuristics such as position information are also used for summarization
CLUSTERING
Clustering is a technique used to group similar documents, but it differs from
categorization in that documents are clustered on the fly instead of through the use of
predefined topics. Another benefit of clustering is that documents can appear in multiple
subtopics, thus ensuring that a useful document will not be omitted from search results. A
basic clustering algorithm creates a vector of topics for each document and measures the
weights of how well the document fits into each cluster. Clustering technology can be useful
in the organization of management information systems, which may contain thousands of
documents.
In K-means clustering algorithm , while calculating Similarity between text
documents, not only consider eigenvector based on algorithm of term frequency statistics ,but
also combine the degree of association between words ,then the relationship between
keywords has been taken into consideration ,thereby it lessens sensitivity of input sequence
and frequency, to a certain extent, it considered semantic understanding , effectively raises
similarity accuracy of small text and simple sentence as well as preciseness and recall rate of
text cluster result .The algorithm model with the idea of co-mining shows as Fig .
QUESTION ANSWERING
Another application area of natural language processing is natural language queries,
or question answering (Q&A), which deals with how to find the best answer to a given
question. Many websites that are equipped with question answering technology, allow end
users to “ask” the computer a question and be given an answer. Q&A can utilize multiple text
mining techniques. For example, it can use information extraction to extract entities such as
people, places, events; or question categorization to assign questions into known types (who,
where, when, how, etc.). In addition to web applications, companies can use Q&A techniques
internally for employees who are searching for answers to common questions. The education
and medical areas may also find uses for Q&A in areas where there are frequently asked
questions that people wish to search.
CONCLUSION
As the amount of unstructured data increases, text-mining tools will be
increasingly valuable. Text-mining methods are useful to government intelligence and
security agencies. In education area students and educators are better able to find
information relating to their topics .In business applications text-mining tools can help
them analyze their competition, customer base, and marketing strategies. A future
trend is integration of data mining and text mining into a single system, a combination
known as duo-mining.