Most of the common techniques in text mining are based on the statistical analysis of a term, either word or phrase. The statistical analysis of a frequency of terms captures the importance of the term within a document only. However, two terms may have the same frequency in your documents, but one term contributes more to the meaning of your sentences than the other term. Therefore, the underlying text mining model must indicate terms that capture the semantics of the text. In this case, the mining model can capture terms that present the concepts of the sentence, which leads to the discovery of the subject of the document. A new concept-based mining model is introduced that analyzes terms on the levels of sentences, documents and corpus. The concept-based mining model can effectively discriminate between terms that are not important with respect to the semantics of sentences and terms that contain the concepts that represent the meaning of the sentence.
The proposed mining model consists of analysis of concepts based on sentences, analysis of concepts based on documents, analysis of concepts based on corpus and measures of similarity based on concepts. The term that contributes to sentence semantics is analyzed at the levels of sentence, document and corpus instead of the traditional analysis of the document only. The proposed model can efficiently find meaningful matching concepts between documents, according to the semantics of their sentences. The similarity between documents is calculated on the basis of a new measure of similarity based on the concept. The proposed similarity measure maximizes the use of concept analysis measures at the sentence, document, and corpus levels to calculate the similarity between documents. Large sets of experiments are performed using the proposed concept-based mining model in different data sets in the text grouping. Experiments show the extensive comparison between concept-based analysis and traditional analysis. Experimental results demonstrate the substantial improvement in clustering quality through the analysis of concepts based on a document-based and corpus-based sentence-based approach.