15-12-2012, 02:56 PM
Post-Level Spam Detection for Social Bookmarking Web Sites
1Post-Level Spam Detection.pdf (Size: 265.06 KB / Downloads: 27)
Abstract
Social bookmarking Web sites have emerged recently
for collecting and sharing of interesting Web sites among
users. People can add Web pages to such sites as bookmarks and
allow themselves as well as others to manipulate them. One of
the key features of the social bookmarking sites is the ability
of annotating a Web page when it is being bookmarked. The
annotation usually contains a set of words or phrases, which are
collectively known as tags, that could reveal the semantics of
the annotated Web page. Efficient and effective search of Web
pages can then be achieved via such tags. However, spam tags
that are irrelevant to the content of Web pages often appear
to deceive other users for malicious or commercial purposes.
Various techniques have been devised to tackle such tag spam
detection problem. Most of these techniques were able to detect
a user that always annotate spam tags. However, finer levels of
detection are seldom discussed. In this work, we will propose a
method based on a text mining approach to discover the relations
between Web pages and there tag posts. These relations are then
used to compute the similarity between a Web page and its tag
post to decide if it is a spam post. Preliminary experiments show
that the accuracy of the post-level spam detection task is 83%.
INTRODUCTION
Social bookmarking Web sites, which belong to a kind of
social network sites [1], are a class of Web sites where people
can bookmark Web pages they like, share them with others,
and save the pages bookmarked by others. Besides the bookmarking
facility, these Web sites also provide collaborative
annotations of Web pages. People can annotate interesting Web
pages using ’tags’, which are words or phrases that they think
could reflect the semantics of the annotated pages. Usually
the sites will rank the Web pages according to the popularity
of the Web pages and their annotations. For example, famous
social bookmarking site Delicious could rank a Web page by
the amount of users bookmarked this page. Tags could also
be ranked by the number of users used them. It is common
for a user to surf Web pages according to such rankings. As
a consequence, approaches to alter the ranking, mostly by
promoting specific pages, were emerged. The use of tag spams,
or spam tags, is one of these approaches.
User-Level Detection
Markines et al. [6] identified six features and used various
machine learning algorithms provided in Weka library
(http://www.cs.waikato.ac.nz/ml/weka) to classify tag spams.
Two features they proposed are related to the semantics of tags.
One of these features, i.e. TagSpam, measures the probability
of a user being a spammer according to a predefined spam
tag vocabulary. This scheme is common [7] but needs to
identify both spam tags and spammers a priori. The other
feature, namely TagBlur, measures the degree of unrelatedness
among tags in a post using a measurement they proposed
before [8]. Another work made by the same group [9] used a
set of 25 features, including 6 semantic features, to detect
spammers. These semantic features, however, measure the
relatedness of tags and spam tags or users and spammers
rather than the semantics of the tags. In fact, since user-level
detection only needs to decide who is a spammer, it is rather
common to use features based on the user behavior and tag
relatedness, but not tag semantics. Generally, the semantics of
tags should be represented by the tags themselves but nothing
else. That is, we should discover the semantics of tags solely
by using the tags, which are set of keywords. Kyriakopoulou
and Kalamboukis [10] combined classification and clustering
techniques to detect tag spams. They treated spam detection as
a text classification problem and used classical tf ·idf scheme
[11] to represent tags as vectors. These vectors were then
clustered and classified to decide the spammers. Madkour et
al. [12] analyzed and improved the semantic features proposed
in [9]. A better precision was achieved using their features.
Post-Level Detection
Bogers and van den Bosch [13] stated that similar
users and posts tend to use the same language. Therefore,
they built language models [14], using Indri toolkit
(http://www.lemurproject.org) and based on unigram occurrence
probabilities, to detect spam posts as well as spammers.
Liu et al. [15] were inspired from the collaborative nature
of social bookmarking service and used such collaborative
knowledge in detecting post-level as well as user-level spams.
They measured the tag information value according to the
times a tag being used. The post information value was then
calculated according to tag information value. Tag spams are
those posts with small post information values. Although their
methods are effective, the semantics of tags are not considered.
Instead, the usage patterns of tags were used.
RELATED WORK
Heymann et al. [5] gave a survey on early works in fighting
tag spams. According to their taxonomy, three schemes could
be applied in battling social spams, namely detection-based,
prevention-based, and demotion-based schemes. Here we only
reviewed works in detection-based scheme, focusing on the
representation of tag semantics. As we stated earlier, there are
three levels of granularity in tag spam detection, namely user,
post, and tag levels. To our best knowledge, we found no work
in tag-level detection.
Document Preprocessing
In this work each Web page in the training corpus was preprocessed
to extract descriptive keywords. This page was then
transformed into a vector resembling the vector space model
for further training. First we removed all markup language tags
which are generally irrelevant to the semantics of this page.
The remaining text is then segmented into individual words. It
is a common sense that not all words are important in describing
the meaning of text [11]. Therefore, techniques such as
stopword elimination, stemming, and keyword selection were
used to remove trivial or redundant words. We assemble all
remaining words into the vocabulary V of the training corpus.
The processing of tags is different from the Web pages in
several aspects. First, tags are usually individual words which
need no segmentation. However, users often use tags which
are composed by several words, e.g. ’MichaelJackson’ or
’Michael Jackson’. Hyphens and underscores could be easily
removed. It is rather difficult, on the other hand, to segment
concatenated word such as ’MichaelJackson’ into ’Michael
Jackson’ without the use of elaborate dictionary matching
techniques. For simplicity, we will discard tags, as well as
segmented words of Web pages, which could not be found in
a dictionary. We also omitted numerals and punctuation marks.
Generally there is no need to eliminate stopwords which
were seldom used as tags.
CONCLUSION
In this work we develop a method based on a machine
learning approach to detect tag spams in social bookmarking
Web sites. Traditional tag spam detection algorithms focused
on user-level detection which could only detect if a contributor
was a spammer. Our method, on the other hand, could achieve
post-level detection such that individual spam post could be
identified. Social bookmarking services could be benefited
by such post-level detection. We conducted experiments on
ECML/RSDC 2008 dataset although the dataset is designed for
user-level detection task. The experimental result shows that
the proposed method could achieve an acceptable accuracy of
83%. Our method may also be extended to user-level tag spam
detection or even tag-level spam detection, which methods are
still under development.