Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: Scalable Learning of Collective Behavior Based on Sparse Social Dimensions pdf
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Scalable Learning of Collective Behavior Based on Sparse Social Dimensions

[attachment=52911]

ABSTRACT

The study of collective behavior is to understand how individuals
behave in a social network environment. Oceans
of data generated by social media like Facebook, Twitter,
Flickr and YouTube present opportunities and challenges to
studying collective behavior in a large scale. In this work,
we aim to learn to predict collective behavior in social media.
In particular, given information about some individuals,
how can we infer the behavior of unobserved individuals
in the same network? A social-dimension based approach
is adopted to address the heterogeneity of connections
presented in social media. However, the networks in
social media are normally of colossal size, involving hundreds
of thousands or even millions of actors. The scale
of networks entails scalable learning of models for collective
behavior prediction. To address the scalability issue, we
propose an edge-centric clustering scheme to extract sparse
social dimensions. With sparse social dimensions, the socialdimension
based approach can efficiently handle networks of
millions of actors while demonstrating comparable prediction
performance as other non-scalable methods.

INTRODUCTION

Social media such as Facebook, MySpace, Twitter, Blog-
Catalog, Digg, YouTube and Flickr, facilitate people of all
walks of life to express their thoughts, voice their opinions,
and connect to each other anytime and anywhere. For instance,
popular content-sharing sites like Del.icio.us, Flickr,
and YouTube allow users to upload, tag and comment different
types of contents (bookmarks, photos, videos). Users
registered at these sites can also become friends, a fan or
follower of others. The prolific and expanded use of social
media has turn online interactions into a vital part of human
experience. The election of Barack Obama as the President
of United States was partially attributed to his smart Internet
strategy and access to millions of younger voters through
the new social media, such as Facebook. As reported in the
New York Times, in response to recent Israeli air strikes in
Gaza, young Egyptians mobilized not only in the streets of
Cairo, but also through the pages of Facebook.

COLLECTIVE BEHAVIOR LEARNING

The recent boom of social media enables the study of collective
behavior in a large scale. Here, behavior can include
a broad range of actions: join a group, connect to a person,
click on some ad, become interested in certain topics, date
with people of certain type, etc. When people are exposed in
a social network environment, their behaviors are not independent
[6, 22]. That is, their behaviors can be influenced
by the behaviors of their friends. This naturally leads to
behavior correlation between connected users.
This behavior correlation can also be explained by homophily.
Homophily [12] is a term coined in 1950s to explain
our tendency to link up with one another in ways that
confirm rather than test our core beliefs. Essentially, we are
more likely to connect to others sharing certain similarity
with us. This phenomenon has been observed not only in
the real world, but also in online systems [4]. Homophily
leads to behavior correlation between connected friends. In
other words, friends in a social network tend to behave similarly.
Take marketing as an example, if our friends buy
something, there’s better-than-average chance we’ll buy it
too.

SOCIAL DIMENSIONS

Connections in social media are not homogeneous. People
can connect to their family, colleagues, college classmates,
or some buddies met online. Some of these relations are
helpful to determine the targeted behavior (labels) but not
necessarily always so true. For instance, Figure 1 shows the
contacts of the first author on Facebook. The densely-knit
group on the right side are mostly his college classmates,
while the upper left corner shows his connections at his graduate
school. Meanwhile, at the bottom left are some of his
high-school friends. While it seems reasonable to infer that
his college classmates and friends in graduate school are very
likely to be interested in IT gadgets based on the fact that
the user is a fan of IT gadget (as most of them are majoring
in computer science), it does not make sense to propagate
this preference to his high-school friends. In a nutshell, people
are involved in different affiliations and connections are
emergent results of those affiliations. These affiliations have
to be differentiated for behavior prediction.

ALGORITHM—EDGECLUSTER

In this section, we first show one toy example to illustrate
the intuition of our proposed edge-centric clustering scheme
EdgeCluster, and then present one feasible solution to handle
large-scale networks.

Edge-Centric View

As mentioned earlier, the social dimensions extracted based
on modularity maximization are the top eigenvectors of a
modularity matrix. Though the network is sparse, the social
dimensions become dense, begging for abundant memory
space. Let’s look at the toy network in Figure 2. The
column of modularity maximization in Table 2 shows the
top eigenvector of the modularity matrix. Clearly, none of
the entries is zero. This becomes a serious problem when
the network expands into millions of actors and a reasonable
large number of social dimensions need to be extracted. The
eigenvector computation is impractical in this case. Hence,
it is essential to develop some approach such that the extracted
social dimensions are sparse.
The social dimensions according to modularity maximization
or other soft clustering scheme tend to assign a non-zero
score for each actor with respect to each affiliation. However,
it seems reasonable that the number of affiliations one
user can participate in is upperbounded by the number of
connections. Consider one extreme case that an actor has
only one connection. It is expected that he is probably active
in only one affiliation. It is not necessary to assign a nonzero
score for each affiliation. Assuming each connection
represents one dominant affiliation, we expect the number
of affiliations of one actor is no more than his connections.

K-means Variant

As mentioned above, edge-centric clustering essentially
treats each edge as one data instance with its ending nodes
being features. Then a typical k-means clustering algorithm
can be applied to find out disjoint partitions.
One concern with this scheme is that the total number of
edges might be too huge. Owning to the power law distribution
of node degrees presented in social networks, the total
number of edges is normally linear, rather than square, with
respect to the number of nodes in the network. That is,
m = O(n). This can be verified via the properties of power
law distribution.

Baseline Methods

The edge-centric clustering (or EdgeCluster) is used to
extract social dimensions on all the data sets. We adopt
cosine similarity while performing the clustering. Based on
cross validation, the dimensionality is set to 5000, 10000, and
1000 for BlogCatalog, Flickr, and YouTube, respectively. A
linear SVM classifier is exploited for discriminative learning.
In particular, we compare our proposed sparse social dimensions
with the social dimensions extracted according to
modularity maximization (denoted as ModMax) [18]. We
study how the sparsity in social dimensions affects the prediction
performance as well as the scalability. For completeness,
we also include the performance of some representative
methods: wvRN, NodeCluster and MAJORITY. We
briefly discuss these methods next.

RELATEDWORK

Within-network classification [11] refers to the classification
when data instances are presented in a network format.
The data instances in the network are not independently
identically distributed (i.i.d.) as in conventional data mining.
To capture the correlation between labels of neighboring
data objects, typically a Markov dependency assumption is
assumed. That is, the labels of one node depend on the labels
(or attributes) of its neighbors. Normally, a relational
classifier is constructed based on the relational features of
labeled data.

CONCLUSIONS AND FUTUREWORK

In this work, we examine whether or not we can predict
the online behavior of users in social media, given the behavior
information of some actors in the network. Since the
connections in a social network represent various kinds of relations,
a framework based on social dimensions is employed.
In the framework.