25-02-2013, 01:02 PM
Movie Rating and Review Summarization in Mobile Environment
Movie Rating and Review.pdf (Size: 645.16 KB / Downloads: 66)
Abstract
In this paper, we design and develop a movie-rating
and review-summarization system in a mobile environment. The
movie-rating information is based on the sentiment-classification
result. The condensed descriptions of movie reviews are generated
from the feature-based summarization. We propose a novel approach
based on latent semantic analysis (LSA) to identify product
features. Furthermore, we find away to reduce the size of summary
based on the product features obtained from LSA. We consider
both sentiment-classification accuracy and system response time
to design the system. The rating and review-summarization system
can be extended to other product-review domains easily.
INTRODUCTION
PEOPLE’s opinion has become one of the extremely important
sources for various services in ever-growing popular
social networks. In particular, online opinions have turned into
a kind of virtual currency for businesses looking to market their
products, identify new opportunities, and manage their reputations.
Meanwhile, cellular phones have definitely become the
most-vital part of our lives. There is no doubt that the mobile
platform is currently one of the most popular platforms in the
world. However, digital content displayed in cellular phones
is limited in size, since cellular phones are physically small.
Hence, a mechanism that can provide users with condensed
descriptions of documents will facilitate the delivery of digital
content in cellular phones. This paper explores and designs
a mobile system for movie rating and review summarization in
which semantic orientation of comments, the limitation of small
display capability of cellular devices, and system response time
are considered.
LATENT-SEMANTIC-ANALYSIS-BASED
PRODUCT-FEATURE IDENTIFICATION
In this paper, we propose a novel approach based on LSA
to identify related product-feature terms. Essentially, LSA is
a theory and method to analyze relationships between a set
of documents and the terms they contain by producing a set
of concepts related to the documents and terms. LSA can be
applied to any type of count data over a discrete dyadic domain,
which is so-called two-mode data [16]. Supposing that
a collection of documents D = {d1, . . . , dn} with terms from
W = {w1, . . . , wm} are given, then the system can construct a
cooccurrence matrixM, where its dimension is n × mand each
entry Mij denotes the number of times the term wj occurred
in document di . Each document di is represented using a row
vector, while each term wj is represented using a column vector.
As shown in (1), LSA applies singular-value decomposition
(SVD) to the term-document matrix M, and a low-rank approximation
of the matrix M could be used to determine patterns in
the relationships between the terms and concepts contained in
the text.
Dataset
In this paper, we collected the Chinese movie reviews from
Internet Blogs. Since the original data are an hypertext markup
language (HTML) document, HTML-tag-removal process is required
to extract the text information. Training data are necessary
for SVMto train a classificationmodel, and manual classification
is performed to classify the training reviews into positive
or negative reviews.We randomly selected 500 positive reviews
and 500 negative reviews as the data for classification-model
building. In addition to the model-building data, we further collected
around 8000 movie reviews from the Internet, and these
reviews will be used as movie-review database.
Sentiment Classification
As mentioned above, sentiment classification is similar to
traditional binary-classification problem. Currently, many classification
algorithms such as SVM [1], [10], [18], [19], decision
trees [20], and neural networks [21] have been proposed and
shown their capabilities in different domains. SVM is one of the
state-of-the-art algorithms. SVM has been shown to be highly
effective in traditional text categorization. SVM measures the
complexity of hypotheses based on the margin with which they
separate the data instead of the number of features. One remarkable
property of SVM is that their ability to learn can be
independent of the dimensionality of the feature space.
EXPERIMENT
Several experiments are performed to evaluate our system. In
sentiment-classification experiment, SVM is employed to perform
the sentiment-classification task. Several feature combinations
are used to evaluate the system performance. Since the
application runs on mobile platform, therefore, classification
accuracy is not the only factor in system design. The system
will be infeasible if it takes a long time to response. Therefore,
system-response-time-evaluation experiment is conducted
as well. In product-feature identification, we propose an LSAbased
approach to identify the product features and compare
LSA-based approach with frequency-based and PLSA-based
approaches using the movie-review-glossary dataset.
Product-Feature Identification
In product-feature identification, we compared our LSAbased
approach with two other approaches, which are frequencybased
and PLSA-based. We performed experiments using the
movie-review documents mentioned above, which is available
at http://www.cs.cornell.edu/People/pabo/movie-reviewdata/.
The dataset includes 1000 positive and 1000 negative
movie reviews. Since nouns are the candidates of product features,
only nouns will be used in this experiment and the total
number of nouns is 29 632. In addition to movie-review
dataset, we employed themovie-review glossary, which is available
at http://www.movieprofilermovieglossary, as the basis
of the comparison. The movie-review glossary is created
for movie reviewers, critics, and film students alike, as well
as the general public interested in movie reviewing and film
making-related terminology. The number of terminologies is
1069. Since many terminologies are only used in movie industry,
additional filtering is applied to the dataset. Only the terms
appearing in the movie-review data will be kept. The number
of terminologies left is 383.
CONCLUSION
In this paper, we design and implement a movie-rating and
review-summarization system in mobile environment. Sentiment
classification is applied to the movie reviews, and rating
information is based on sentiment-classification results.
In feature-based summarization, product-feature identification
plays an essential role, and we propose a novel approach based
on LSA to identify related product features.Moreover, we use a
statistical approach to identify opinion words. Product features
and opinion words will be used as the basis for feature-based
summarization.
In a system-performance-analysis experiment, the number of
features plays an important role in SVM-model loading and
prediction. We use frequency criterion to reduce the number of
features, and the experiment shows that it takes less than 6 s to
load the SVM model and classify the reviews. Furthermore, we
propose an LSA-based filtering approach to reduce the size of
the summary based on the user’s preferred aspect.