03-07-2014, 11:20 AM
Creating Evolving User Behavior Profiles Automatically
Creating Evolving User Behavior.pdf (Size: 1.42 MB / Downloads: 56)
Abstract
Knowledge about computer users is very beneficial for assisting them, predicting their future actions or detecting
masqueraders. In this paper, a new approach for creating and recognizing automatically the behavior profile of a computer user is
presented. In this case, a computer user behavior is represented as the sequence of the commands she/he types during her/his work.
This sequence is transformed into a distribution of relevant subsequences of commands in order to find out a profile that defines its
behavior. Also, because a user profile is not necessarily fixed but rather it evolves/changes, we propose an evolving method to keep up
to date the created profiles using an Evolving Systems approach. In this paper, we combine the evolving classifier with a trie-based
user profiling to obtain a powerful self-learning online scheme. We also develop further the recursive formula of the potential of a data
point to become a cluster center using cosine distance, which is provided in the Appendix. The novel approach proposed in this paper
can be applicable to any problem of dynamic/evolving user behavior modeling where it can be represented as a sequence of actions or
events. It has been evaluated on several real data streams
INTRODUCTION
RECOGNIZING the behavior of others in real time is a
significant aspect of different human tasks in many
different environments. When this process is carried out by
software agents or robots, it is known as user modeling. The
recognition of users can be very beneficial for assisting them
or predicting their future actions. Most existing techniques
for user recognition assume the availability of handcrafted
user profiles, which encode the a-priori known behavioral
repertoire of the observed user. However, the construction
of effective user profiles is a difficult problem for different
reasons: human behavior is often erratic, and sometimes
humans behave differently because of a change in their
goals. This last problem makes necessary that the user
profiles we create evolve.
There exists several definitions for user profile [1]. It can
be defined as the description of the user interests,
characteristics, behaviors, and preferences. User profiling
is the practice of gathering, organizing, and interpreting the
user profile information. In recent years, significant work
has been carried out for profiling users, but most of the user
profiles do not change according to the environment and
new goals of the user. An example of how to create these
static profiles is proposed in a previous work [2].
BACKGROUND AND RELATED WORK
Different techniques have been used to find out relevant
information related to the human behavior in many different
areas. The literature in this field is vast; Macedo et al. [5]
propose a system (WebMemex) that provides recommended
information based on the captured history of navigation
from a list of known users. Pepyne et al. [6] describe a method
using queuing theory and logistic regression modeling
methods for profiling computer users based on simple
temporal aspects of their behavior. In this case, the goal is
to create profiles for very specialized groups of users, who
would be expected to use their computers in a very similar
way. Gody and Amandi [7] present a technique to generate
readable user profiles that accurately capture interests by
observing their behavior on the web.
There is a lot of work focusing on user profiling in a
specific environment, but it is not clear that they can be
transferred to other environments. However, the approach
we propose in this paper can be used in any domain in
which a user behavior can be represented as a sequence of
actions or events. Because sequences play a crucial role in
human skill learning and reasoning [8], the problem of user
profile classification is examined as a problem of sequence
classification. According to this aspect, Horman and
Kaminka [9] present a learner with unlabeled sequential
data that discover meaningful patterns of sequential
THE PROPOSED APPROACH
This section introduces the proposed approach for automatic
clustering, classifier design, and classification of the
behavior profiles of users. The novel evolving user behavior
classifier is based on Evolving Fuzzy Systems and it takes into
account the fact that the behavior of any user is not fixed,
but is rather changing. Although the proposed approach
can be applied to any behavior represented by a sequence of
events, we detail it using a command-line interface (UNIX
commands) environment.
In order to classify an observed behavior, our approach, as
many other agent modeling methods [31], creates a library
which contains the different expected behaviors. However,
in our case, this library is not a prefixed one, but is evolving,
learning from the observations of the users real behaviors
and, moreover, it starts to be filled in “from scratch” by
assigning temporarily to the library the first observed user as
a prototype. The library, called Evolving-Profile-Library
(EPLib), is continuously changing, evolving influenced by
the changing user behaviors observed in the environment.
Thus, the proposed approach includes at each step the
following two main actions
CONSTRUCTION OF THE USER BEHAVIOR PROFILE
In order to construct a user behavior profile in online mode
from a data stream, we need to extract an ordered sequence
of recognized events; in this case, UNIX commands. These
commands are inherently sequential, and this sequentiality
is considered in the modeling process. According to this
aspect and based on the work done in [2], in order to get the
most representative set of subsequences from a sequence,
we propose the use of a trie data structure [32]. This
structure was also used in [33] to classify different
sequences and in [34], [35] to classify the behavior patterns
of a RoboCup soccer simulation team.
The construction of a user profile from a single sequence
of commands is done by a three step process
Storage of the Subsequences in a trie
The subsequences of commands are stored in a trie data
structure. When a new model needs to be constructed, we
create an empty trie, and insert each subsequence of events
into it, such that all possible subsequences are accessible
and explicitly represented. Every trie node represents an
event appearing at the end of a subsequence, and the nodes
children represent the events that have appeared following
this event. Also, each node keeps track of the number of
times a command has been recorded into it. When a new
subsequence is inserted into a trie, the existing nodes are
modified and/or new nodes are created. As the dependencies
of the commands are relevant in the user profile, the
subsequence suffixes (subsequences that extend to the end
of the given sequence) are also inserted.
Considering the previous example, the first subsequence
(fls ! date ! lsg) is added as the first branch of the empty
trie (Fig. 1a). Each node is labeled with the number 1 which
indicates that the command has been inserted in the node
once (in Fig. 1, this number is enclosed in square brackets).
Then, the suffixes of the subsequence (fdate ! lsg and {ls})
are also inserted (Fig. 1b). Finally, after inserting the three
subsequences and its corresponding suffixes, the completed
trie is obtained
Creation of the User Profile
Once the trie is created, the subsequences that characterize
the user profile and its relevance are calculated by
traversing the trie. For this purpose, frequency-based
methods are used. In particular, in EVABCD, to evaluate
the relevance of a subsequence, its relative frequency or
support [36] is calculated. In this case, the support of a
subsequence is defined as the ratio of the number of times
the subsequence has been inserted into the trie a
EVOLVING UNIX USER CLASSIFIER
A classifier is a mapping from the feature space to the class
label space. In the proposed classifier, the feature space is
defined by distributions of subsequences of events. On the
other hand, the class label space is represented by the most
representative distributions. Thus, a distribution in the class
label space represents a specific behavior which is one of
the prototypes of the EPLib. The prototypes are not fixed
and evolve taking into account the new information
collected online from the data stream—this is what makes
the classifier Evolving. The number of these prototypes is
not prefixed but it depends on the homogeneity of the
observed behaviors. The following sections describes how a
user behavior is represented by the proposed classifier, and
how this classifier is created in an evolving manner
User Behavior Representation
EVABCD receives observations in real time from the
environment to analyze. In our case, these observations
are UNIX commands and they are converted into the
corresponding distribution of subsequences online. In order
to classify a UNIX user behavior, these distributions must
be represented in a data space. For this reason, each
distribution is considered as a data vector that defines a
point that can be represented in the data space.
The data space in which we can represent these points
should consist of n dimensions, where n is the number of
the different subsequences that could be observed. It means
that we should know all the different subsequences of the
environment a priori. However, this value is unknown and
the creation of this data space from the beginning is not
efficient. For this reason, in EVABCD, the dimension of the
data space also evolves, it is incrementally growing
Calculating the Potential of a Data Sample
As in [12], a prototype is a data sample (a behavior
represented by a distribution of subsequences of commands)
that represents several samples which represent a
certain class.
The classifier is initialized with the first data sample,
which is stored in EPLib. Then, each data sample is
classified to one of the prototypes (classes) defined in the
classifier. Finally, based on the potential of the new data
sample to become a prototype [37], it could form a new
prototype or replace an existing one.
The potential (P) of the kth data sample ðxkÞ is calculated
by (1) which represents a function of the accumulated
distance between a sample and all the other k 1 samples
in the data space [12]. The result of this function represents
the density of the data that surrounds a certain data sample
EXPERIMENTAL SETUP AND RESULTS
In order to evaluate EVABCD in the UNIX environment, we
use a data set with the UNIX commands typed by 168 real
users and labeled in four different groups. Therefore, in
these experiments, we use supervised learning. The explained
process is applied for each of the four group
(classes) and one or more prototypes will be created for
each group. EVABCD is applied in these experiments
considering the data set as pseudo-online streams. However,
only using data sets in an offline mode, the proposed
classifier can be compared with other incremental and
nonincremental classifiers.
Data Set
In these experiments, we use the command-line data
collected by Greenberg [41] using UNIX csh command
interpreter. This data are identified in four target groups
which represent a total of 168 male and female users with a
wide cross section of computer experience and needs.
Salient features of each group are described below
. Novice programmers. The users of this group had little
or no previous exposure to programming, operating
systems, or UNIX-like command-based interfaces.
These users spent most of their time learning how to
program and use the basic system facilities
Comparison with Other Classifiers
In order to evaluate the performance of EVABCD, it is
compared with several different types of classifiers. This
experiment focuses on using the well-established batch
classifiers described as follows:
. The C5.0 algorithm [42] is a commercial version of
the C4.5 [43] decision-tree-based classifier.
. The Naive Bayes classifier (NB) is a simple
probabilistic classifier based on applying Bayes
theorem with strong (naive) independence assumptions
[44]. In this case, the numeric estimator
precision values are chosen based on analysis of
the training data. For this reason, the classifier is noincremental
[45].
. The K Nearest Neighbor classifier (kNN) is an
instance-based learning technique for classifying
objects based on closest training examples in the
feature space [46]. The parameter k affects the
performance of the kNN classifier significantly. In
these experiments, the value with which better
results are obtained is k ¼ 1; thus, a sample is
assigned to the class of its nearest neighbor.
However, using this value, the corresponding
classifier may not be robust enough to the noise in
the data sample.
. The AdaBoost (adaptive boosting) classifier [47]
combines multiple weak learners in order to form
an efficient and strong classifier. Each weak classifier
Experimental Design
In order to measure the performance of the proposed
classifier using the above data set, the well-established
technique of 10-fold cross validation is chosen. Thus, all
the users (training set) are divided into 10 disjoint subsets
with equal size. Each of the 10 subsets is left out in turn
for evaluation. It should be emphasized that EVABCD
does not need to work in this mode; this is done mainly
in order to have comparable results with the established
offline techniques.
The number of UNIX commands typed by a user and
used for creating his/her profile is very relevant in the
classification process. When EVABCD is carried out in the
field, the behavior of a user is classified (and the evolving
behavior library updated) after she/he types a limited
number of commands. In order to show the relevance of
this value using the data set already described, we consider
different number of UNIX commands for creating the
classifier: 100, 500, and 1.000 commands per user.
CONCLUSIONS
In this paper, we propose a generic approach, EVABCD, to
model and classify user behaviors from a sequence of
events. The underlying assumption in this approach is that
the data collected from the corresponding environment can
be transformed into a sequence of events. This sequence is
segmented and stored in a trie and the relevant subsequences
are evaluated by using a frequency-based
method. Then, a distribution of relevant subsequences is
10
created. However, as a user behavior is not fixed but rather
it changes and evolves, the proposed classifier is able to
keep up to date the created profiles using an Evolving
Systems approach. EVABCD is one pass, noniterative,
recursive, and it has the potential to be used in an
interactive mode; therefore, it is computationally very
efficient and fast. In addition, its structure is simple and
interpretable.
The proposed evolving classifier is evaluated in an
environment in which each user behavior is represented as
a sequence of UNIX commands. Although EVABCD has
been developed to be used online, the experiments have
been performed using a batch data set in order to compare
the performance to established (incremental and nonincremental)
classifiers. The test results with a data set of
168 real UNIX users demonstrates that, using an appropriate
subsequence length, EVABCD can perform almost
as well as other well-established offline classifiers in terms
of correct classification on validation data. However,
taking into account that EVABCD is able to adapt
extremely quickly to new data, and that this classifier
can cope with huge amounts of data in a real environment
which changes rapidly, the proposed approach is the most
suitable alternative.