Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Distributional Features for Text Categorization Report
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Distributional Features for Text Categorization

[attachment=59834]

Abstract

Text categorization is the task of assigning predefined categories to natural language text. With the widely used “bag-of-word” representation, previous researches usually assign a word with values that express whether this word appears in the document concerned or how frequently this word appears. These features are not enough for fully capturing the information contained in a document. Although these values are useful for text categorization, they have not fully articulated the abundant information contained in the document. This project explores the effect of other types of values, which express the circulation of a word in the document. These novel values assigned to a word are called distributional features, which include the neatness of the appearances of the word and the position of the first appearance of the word. The proposed distributional features are exploited by a tfidf style equation, and different features are combined using ensemble learning techniques. Thus we conclude that the distributional features are useful for text categorization, especially when they are combined with term frequency or combined together.

Existing system:

The existing system assigns a word with values that express whether this word appears in the document concerned or how frequently this word appears. Another system uses a statistical phrase that is composed of a sequence of words that occur contiguously in text in a statistically interesting way, which is usually called n-gram.

Existing system disadvantages:

The existing features are not enough for fully capturing the information contained in a document.
The performance of the system is comparatively slow.

Proposed system:

The proposed distributional features are exploited by a tfidf style equation, and different features are combined using ensemble learning techniques. The extraction of the distributional features is efficiently implemented using the inverted index constructed for the corpus. Using such type of index, for a given word-document pair, we can obtain not only the frequencies of the word but also the positions where the word appears. With the position information and the length of the document, the distribution of the word is constructed and the distributional features are computed.

Proposed system advantages

Distributional features for text categorization requires only a little additional cost.
Combining traditional term frequency with the distributional features improves the performance of the system.
The effect of the distributional features is obvious when the documents are long and when the writing style is informal.