24-04-2014, 03:59 PM
EFFECTIVE PATTERN DISCOVERY FOR TEXT MINING
EFFECTIVE PATTERN DISCOVERY.doc (Size: 949 KB / Downloads: 14)
ABSTRACT:
Techniques of Data mining have been proposed for gathering useful patterns in text documents. The quality of extracted features is the key issue to text mining due to the large number of terms and phrases. Text mining is the discovery of interesting knowledge in text documents. It is a challenging issue to find accurate knowledge (or features) in text documents. In Previous technique, Information Retrieval (IR) provided many term-based methods to solve this challenge.
Most existing text mining methods were developed based on term-based approaches. These approaches basically extract a set of keywords in a document to form a vector for a text representation, which gives the problem in large semantic field and similar meanings. To evaluate the proposed approach, we adopt the feature extraction method for pattern based approaches.
In this project, we propose a new effective pattern discovery approach for text mining. This approach presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information.
We focus on the development of a knowledge discovery model to effectively use and update the discovered patterns and apply it to the field of text mining. These pattern mining-based approaches have shown certain extent improvements on the effectiveness. The experimental results conducted on RCV1(Reuters Corpus Volume 1) and TREC(Text REtrieval Conference) topics confirm that the proposed approach could achieve excellent performance.
SCOPE OF THE PROJECT:
Aim of our project is the presence of these setbacks; sequential patterns used in data mining community have turned out to be a promising alternative to phrases, because sequential patterns enjoy good statistical properties like terms. To overcome the disadvantages of phrase-based approaches, pattern mining-based approaches (or pattern taxonomy models (PTM)) have been proposed, which adopted the concept of closed sequential patterns, and pruned nonclosed patterns. These pattern mining-based approaches have shown certain extent improvements on the effectiveness. However, the paradox is that people think pattern-based approaches could be a significant alternative, but consequently less significant improvements are made for the effectiveness compared with term-based methods.
There are two fundamental issues regarding the effectiveness of pattern-based approaches: low frequency and misinterpretation. Given a specified topic, a highly frequent pattern (normally a short pattern with large support) is usually a general pattern, or a specific pattern of low frequency. If we decrease the minimum support, a lot of noisy patterns would be discovered. Misinterpretation means the measures used in pattern mining turn out to be not suitable in using discovered patterns to answer what users want. The difficult problem hence is how to use discovered patterns to accurately evaluate the weights of useful features (knowledge) in text documents.
In order to solve the above paradox, this paper presents an effective pattern discovery technique, which first calculates discovered specificities of patterns and then evaluates term weights according to the distribution of terms in the discovered patterns rather than the distribution in documents for solving the misinterpretation problem. It also considers the influence of patterns from the negative training examples to find ambiguous (noisy) patterns and try to reduce their influence for the low-frequency problem. The process of updating ambiguous patterns can be referred as pattern evolution. The proposed approach can improve the accuracy of evaluating term weights because discovered patterns are more specific than whole documents.
EXISTING SYSTEM:
Term based approaches: These approaches basically extract a set of keywords in a document to form a vector for text representation. Term based methods have the challenging problems such as very high dimensionality of text data and uncertain meaning of words. It also gives the problem in polysemy and synonymy, where polysemy means a word has multiple meanings and synonymy is multiple words having the same meaning. The semantic meaning of many discovered terms is uncertain for answering what users want.
Phrase based approaches: Phrase-based approaches could perform better than the term based ones, as phrases may carry more “semantics” like information. It has low frequency and misinterpretation. Misinterpretation means the measures used in pattern mining. Phrases are less ambiguous and more discriminative than individual terms.
PROPOSED SYSTEM:
To overcome the disadvantages of phrase-based approaches, pattern mining-based approaches or pattern taxonomy models (PTM) have been proposed, which adopted the concept of closed Sequential patterns, and Pattern mining based approaches.
Sequential patterns: Sequential pattern mining is a data mining technique that extracts the significant pattern in the sequential data. Sequential patterns used in data mining community have turned out to be a promising alternative to phrases.
Pattern mining-based approaches: These approaches have shown certain extent improvements on the effectiveness. Pattern-based approaches could be a significant alternative, but consequently less significant improvements are made for the effectiveness compared with term-based methods. A variety of efficient algorithms such as Apriori-like algorithms PrefixSpan, FP-tree, SPADE, SLPMiner and GST have been proposed.
FUNCTIONAL REQUIREMENTS
Functional requirements specify which output file should be produced from the given file they describe the relationship between the input and output of the system, for each functional requirement a detailed description of all data inputs and their source and the range of valid inputs must be specified. A typical functional requirement will contain a unique name and number, a brief summary, and a rationale. This information is used to help the reader understand why the requirement is needed, and to track the requirement through the development of the system.
NON FUNCTIONAL REQUIREMENTS
Describe user-visible aspects of the system that are not directly related with the functional behavior of the system. Non-Functional requirements include quantitative constraints, such as response time (i.e. how fast the system reacts to user commands.) or accuracy (i.e. how precise are the systems numerical answers.) The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture.
PSEUDO REQUIREMENTS
The client that restricts the implementation of the system imposes these requirements. Typical pseudo requirements are the implementation language and the platform on which the system is to be implemented. These have usually no direct effect on the user’s view of the system.
INFORMATION RETRIEVAL:
Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web. There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis, and technologies. IR has developed many mature techniques which demonstrated that terms were important features in text documents. However, many terms with larger weights are general terms because they can be frequently used in both relevant and irrelevant information. Therefore, it is not adequate for evaluating the weights of the terms based on their distributions in documents for a given topic, although this evaluating method has been frequently used in developing IR models.
QUERY WEIGHT:
Query weight is the process that is used to fetch the particular word from the n number of files. The process is used for discovered patterns to accurately evaluate the weights of useful features (knowledge) in text documents. The proposed approach can improve the accuracy of evaluating term weights because discovered patterns are more specific than whole documents. In our project query weight provides the related search elements from more number of data and also it provides the collection of file data.
EFFECTIVE PATTERN:
Effective pattern discovery technique that includes the processes of pattern deploying and pattern evolving, and to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Effectiveness of the text mining systems using phrases as text representation showed no significant improvement. Effective pattern process generates the related meaning to the user. In our project, this pattern is used for mining the correct word and produces the meanings.
DATA FLOW DIAGRAM
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of input data to the system, various processing carried out on this data, and the output data is generated by this system.
2. The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
3. DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.
USE CASE DIAGRAM:
A use case diagram in the Unified Modeling Language (UML) is a type of behavioral diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical overview of the functionality provided by a system in terms of actors, their goals (represented as use cases), and any dependencies between those use cases. The main purpose of a use case diagram is to show what system functions are performed for which actor.