20-04-2013, 04:54 PM
A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs
A Probabilistic Approach.pdf (Size: 395.63 KB / Downloads: 17)
ABSTRACT
Mining subtopics from weblogs and analyzing their spatiotem-
poral patterns have applications in multiple domains. In this
paper, we de¯ne the novel problem of mining spatiotemporal
theme patterns from weblogs and propose a novel probabilis-
tic approach to model the subtopic themes and spatiotem-
poral theme patterns simultaneously. The proposed model
discovers spatiotemporal theme patterns by (1) extracting
common themes from weblogs; (2) generating theme life cy-
cles for each given location; and (3) generating theme snap-
shots for each given time period. Evolution of patterns can
be discovered by comparative analysis of theme life cycles
and theme snapshots. Experiments on three di®erent data
sets show that the proposed approach can discover interest-
ing spatiotemporal theme patterns e®ectively. The proposed
probabilistic model is general and can be used for spatiotem-
poral text mining on any domain with time and location
information.
INTRODUCTION
With the quick growth during recent years, weblogs (or
blogs for short) have become a prevailing type of media on
the Internet [7]. Simultaneously, increasingly more research
work is conducted on weblogs, which considers blogs not
only as a new information source, but also as an appropri-
ate testbed for many novel research problems and algorithms
[16, 26, 11, 10]. We consider weblogs as online diaries pub-
lished and maintained by individual users, ordered chrono-
logically with time stamps, and usually associated with a
pro¯le of their authors. Compared with traditional media
such as online news sources (e.g., CNN online) and pub-
lic websites maintained by companies or organizations (e.g.,
Yahoo!), weblogs have several unique characteristics: 1) The
content of weblogs is highly personal and rapidly evolving.
2) Weblogs are usually associated with the personal infor-
mation of their authors.
Parameter Estimation
In this section, we discuss how we estimate the parameters
of the spatiotemporal theme model above using the maxi-
mum likelihood estimator, which chooses parameter values
to maximize the data likelihood.
The general model has many parameters to estimate. How-
ever, for the purpose of spatiotemporal weblog mining, we
will regularize the model by ¯xing some parameters.
Generality of the Model
Although motivated by speci¯c needs in blog mining, the
probabilistic model we proposed is quite general. In this
section, we show that several existing models can be viewed
as special cases of the spatiotemporal model when we make
di®erent simpli¯cation assumptions about p(w; µj jd; t; l).
Parameter Setting
In the spatiotemporal theme model, there are several user-
input parameters which provide °exibility for the spatiotem-
poral theme analysis. These parameters are set empirically.
In principle, it is not easy to optimize these parameters with-
out relying on domain knowledge and information about the
goal of the data analysis. However, the nature of this mining
task is to provide user °exibility to explore the spatiotem-
poral text data with their belief about the data. We expect
that the change of these parameters will not a®ect the ma-
jor themes and trends but provide °exibility on analyzing
them. The e®ect of the parameters is as follows.
Generally, we expect each discovered theme to be seman-
tically coherent and distinctive from the general informa-
tion of the collection, which is captured by the background
model. ¸B controls the strength of the background model,
and should be set based on how discriminative we would like
the extracted themes to be. In practice, a larger ¸B would
cause the stop words to be automatically excluded from the
top probability words in each theme language model. How-
ever, an extremely large ¸B could attract too much use-
ful information into background and make the component
theme di±cult to interpret. Empirically, a suitable ¸B for
blog documents can be chosen between 0.9 and 0.95
Hurricane Katrina Data Set
The Hurricane Katrina data set is the largest one in our
experiments. 7118 documents out of 9377 have location in-
formation. We vary the time granularity from a day to a
week. The extracted themes are not sensitive to this gran-
ularity change. We set the granularity of location as a state
and analyze the theme snapshot within the United States.
The most salient themes extracted from the Hurricane
Katrina data set are presented in Table 1, where we show the
top probability words of each theme language model. The
semantic labels of each theme are presented in the second
row of Table 1. We manually label each theme with the help
of the documents with highest p(µjd). A few less meaningful
themes are dropped as noise.
From Table 1, we can tell that theme 1 suggests the con-
cern about \Government Response" to the disaster; theme
2 discusses the subtopic related to \New Orleans"; theme
3 represents people's concern about the increase of \Oil
Price"; theme 4 is about \praying and blessing" for the vic-
tims; and theme 5 covers the aid and donations made for vic-
tims. Unlike theme 1 to theme 5, the semantics of which can
be inferred from the top probability words, theme 6 is hard
to interpret directly from the top words. By linking back
to the original documents, we ¯nd that the documents with
highest probability p(µjd) for theme 6 tend to talk about
personal life and experiences of the author. This is inter-
esting and reasonable because weblogs are associated with
personal contents. Indeed, we observe that a similar theme
also occurs in other two data sets
RELATEDWORK
To the best of our knowledge, the problem of spatiotem-
poral text mining has not been well studied in existing work.
Most existing text mining work (e.g., [22, 21]) does not
consider the temporal and location context of text. Li and
others proposed a probabilistic model to detect retrospec-
tive news events by explaining the generation of \four Ws3"
from each news article [18]. However, their work considers
time and location as independent variables, and aims at dis-
covering the reoccurring peaks of events rather than extract
the spatiotemporal patterns of themes.
Some other related work can be summarized in the fol-
lowing several lines.
CONCLUSIONS
Weblogs usually have a mixture of subtopics and exhibit
spatiotemporal content patterns. Discovering themes and
modeling their spatiotemporal patterns are bene¯cial not
only for weblog analysis, but also for many other applica-
tions and domains. In this paper, we de¯ne the general prob-
lem of spatiotemporal theme patterns discovery and pro-
pose a novel probabilistic mixture model which explains the
generation of themes and spatiotemporal theme patterns si-
multaneously. With this model, we discover spatiotemporal
theme patterns by (1) extracting common themes from we-
blogs; (2) generating theme life cycles for each location.