12-06-2014, 04:20 PM
Clustering and Sequential Pattern Mining of
Online Collaborative Learning Data
Clustering and Sequential.pdf (Size: 273.72 KB / Downloads: 9)
INTRODUCTION
roup work is commonplace in many aspects of life,
particularly in the workplace where there are many
situations which require small groups of people to work
together to achieve a goal. For example, a task that requires
a complex combination of skills may only be possible
if a group of people, each offering different skills,
can work together. To take just one other example, it may
be necessary to draw on the combined efforts of a group
to achieve a task in the time available. However, it is often
difficult to make a group operate effectively, with high
productivity and satisfaction within the group about its
operation. Reflecting the importance of group work, there
has been a huge body of research on how to make groups
more effective and how to help group members build
relevant skills. In one meta-analysis of this body of work,
a set of five key factors and three enablers has been identified
[1]. For example, this work points both to the importance
of leadership as one of the five key factors and
to the effectiveness of training in leadership.
The importance of group work skills is reflected in
education systems, where students are given opportunities
to develop these valuable skills. Often, and increasingly,
such groups are supported by software tools. This
may be in the context of distance learning, where the
groups are distributed and the members must use software
to support their collaboration. In addition, even
when student groups work in the same classroom or
campus, they may be supported by a range of online
tools, such as chat, message boards and wikis. For small
groups that need to collaborate on substantial tasks over
several weeks, such tools can amass huge amounts of information
and generate large electronic traces of their
activity. This has the potential to reveal a great deal about
the group activity and the effectiveness of the group.
Our goal is to improve the teaching of the group work
skills and facilitation of effective team work by small
groups, working on substantial projects over several
weeks by exploiting the electronic traces of group activity.
Our approach is to analyse these traces to create mirroring
tools that enable the group members, their teachers
or facilitators to see useful indicators of the health and
progress of their group. We consider it important that our
work should be in the context of standard, state-of-the-art
tools for supporting groups. This means that we should
be able to exploit the data from a range of tools and media
that are valuable for small group management. These
include wikis, issues tracking systems and version control
software. The key contribution of our work is an improved
understanding of how to use data mining to build
mirroring tools that can help small long-term teams improve
their group work skills.
The emerging research community of Educational
Data Mining [2] exploits data from learners' interaction
with e-learning tools, particularly web-based learning
environments. The recognition of the huge potential value
of such data has led to a series of ten workshops and a
new conference [3]. There have been recent promising
results using a range of techniques [4-7]. There is good
reason for this new research area, primarily because it
needs to deal with issues that differ from those that had
previously had most attention in the wider data mining
and machine learning research. For example, educational
data presents several difficulties for the data mining algorithms
as it is temporal, noisy, correlated, incomplete and
may lack enough samples for some tasks. In addition,
there is a need for understandable and scrutable presentations
of the data mining results appropriate for the nondata
mining savvy users. This area is establishing the new
requirements for effective mining and analysis of learning
data. This paper continues this exploration of foundations
for this area, taking account of the particular demands of
one important class of educational context.
CSCL is an established and active research area. However,
much of the focus of that community is based upon
the value of collaboration for improved learning across
many disciplines. This is rather different from our focus.
So, for example, the CSCL community has done considerable
work on the use of discussion boards. This is relevant
to our work in that it does explore ways to improve participation
rates as in the work of Cheng and Vassileva [8].
They created an adaptive rewards system, based on
group and individual models of learners. This had elements
of mirroring but significantly differs from our goal
of supporting small groups for whom learning group
work skills is one of the learning objectives and the group
work is the key focus.
Some research has brought together CSCL and data
mining. Notably, Talavera and Gaudioso [9] applied clustering
to student interaction data to build profiles of student
behaviours. The context of the study was a course
teaching the use of Internet and the data was collected
using a learning management system from three main
sources: forums, email and chat. Their goal was to support
evaluation of collaborative activities and although
only preliminary results were presented their work confirmed
the potential of data mining to extract useful patterns
and get insight into collaboration profiles. Soller [10,
11] analysed conversation data where the goal was
knowledge sharing: a student presents and explains new
knowledge to peers; peers attempt to understand it. Hidden
Markov models and multidimensional scaling were
successfully applied to analyse the knowledge sharing
conversations. However, Soller required group members
to use a special interface using sentence starters, based on
Speech Act Theory. The requirement for a special interface,
limited to a single collaboration medium, with user
classified utterances has characterised other work, such as
Barros and Verdejo [12] whose DEGREE system enabled
students to submit text proposals, co-edit and refine
them, until agreement was reached. By contrast, we
wanted to ensure that the learners used collections of conventional
collaboration tools in an authentic manner, as
they are intended to be used to support group work: we
did not want to add interface restrictions or additional
activities for learners as a support for the data mining.
These goals ensure the potential generality of the tools we
want to create. It also means that we can explore use of a
range of collaboration tools, not just a single medium
such as chat.
The notion of mirroring has been discussed in a similar
context to ours [13]. In the current state of research, the
goal of mirroring that is effective is a realistic starting
point. Moreover, it has the potential to overcome some of
the inherent limitations of data mining that does not
make use of a deep model of the group task and the complex
character of each particular group. So, it offers promise
for powerful and useful tools that are more generic,
able to be used by many different groups working on different
tasks. We have already found that mirroring of
simple overall information about a group is valuable [14].
The work on social translucence [15, 16] has also shown
the value of mirroring for helping members of groups to
realise how they are affecting the group and to alter their
behaviour. Our experience with these tools has pointed to
their particular power in the context of long-term small
groups: the mirrored information serves as valuable starting
point for both discussing group work, as part of the
facilitation process, and it can serve as an excellent basis
for exploring the information within the collaboration
environment.
The paper is organised as follows. The next section
states our goals of mining group logs, identifies the main
stakeholders and how they can benefit from the extracted
patterns. Section 3 describes in more detail the context of
our study: the learner population, TRAC online system
and nature of the data collected. Section 4 presents the
initial data exploration performed and discuses its limitations.
Then the actual data mining is presented, with Section
5 describing the clustering work and Section 6 presenting
the frequent sequential pattern mining. We discuss
the results, problems encountered, and how the discovered
patterns can be used to improve teaching and
learning. Section 6 concludes the paper.
GOALS OF MINING GROUP WORK LOGS
We set our primary goal for the data mining as providing
mirroring tools that would be useful for helping improve
the learning about group work. This goal is realistic in the
context of the highly complex and variable nature of longterm,
small group activity, especially where the learners
undertake a diverse range of tasks, such as creating a
software system for an authentic client. Our mirroring
goal means that we aim to extract patterns and other inAUTHOR
ET AL.: TITLE 3
formation from the group logs and present it together
with desired patterns to the people involved, so that they
can interpret it, making use of their own knowledge of
the group tasks and activities.
To underpin our work, we have used the Big Five theory
of group work [1]. It is based on a broad metaanalysis
of research on small group interaction, drawing
on the large body of literature reporting studies of various
aspects of group work and determinants of success. It
has established five key factors: leadership, mutual performance
monitoring, backup behaviour, adaptability and team orientation.
Backup behaviour involves actions like reallocating
work between members as their different loads and
progress becomes recognised. Adaptability is a broader
form of changing plans as new information about internal
group and external issues are identified. Team orientation
covers aspects such as commitment to the group as a
whole. It also has identified three supporting mechanisms:
shared mental models, especially shared understanding
of how the group should operate; mutual trust; and
closed loop communication, which means that, regardless of
the medium, a person communicating a message receives
feedback about it and confirms this. This theory provides
a language with which to discuss group work and guides
our data mining.
Given our goal, it is important to distinguish the key
stakeholders because the information relevant to each is
somewhat different. We distinguish four classes of stakeholders:
• individual learner: each has a good knowledge of
their own goals and activities but may be unaware
of what others in their group have been doing and
how well they have been performing as a team
member and what they should be doing to be more
effective in their allocated roles;
• individual group: the group as a whole is aware of
some aspects of their performance but is less aware
of how they could improve their performance and
how well they are doing on the various dimensions
of the Big Five elements;
• group facilitator: this person works with the groups,
meeting them regularly and helping them see how
to improve their performance. This person is more
knowledgeable about group processes and has an
outsider view of the group. However, they need
help in seeing just what the group members have
been doing and how they have been interacting;
• course co-ordinator: this person needs to teach the
group skills and to monitor the progress of all the
groups. They have least knowledge of the details of
the individual groups and are most in need of support
in seeing a big picture overview of the large
amounts of log data to understand what the groups
are doing.
We were able to refine the goals of mirroring into the
following three sub-goals:
• timely problem identification: All stakeholders should
be keen to know about indicators of problems in the
group work, especially if these indicators can be
provided in time for remedial action to have a significant
effect. In particular, if the group facilitator,
can see patterns that are suggestive of potential
problems in some key aspects, such as leadership or
effective closed loop communication, they can discuss
these issues with the group and work with
them to find ways to improve the learning about
group work and to ensure the success of the group.
• support for self-monitoring: This is particularly important
for the individual. For example, the leader
should have distinctive behaviours and we would
like to provide high level mined results reflecting
the effectiveness of their interaction, as a leader;
• improved understanding of how effective groups make
use of the online collaboration tools: this is most important
for co-ordinators as it can inform their teaching
and organisation of the learning environment.
We will refer to our identified stakeholders and the
sub-goals of the data mining in the discussion of the data
mining and the value of different results for the different
CLUSTERING
As shown in the previous section, simple statistical exploration
of the data was quite limited. The results suggested
the need to consider multiple data attributes simultaneously.
Clustering allows us to use multiple attributes to
identify similar groups in an unsupervised fashion. In
addition, it provides the opportunity to mine the data at
the level of individual learners (i.e. to find groups of similar
learners) and then to examine the composition of each
group.
An application of clustering in an educational setting is
presented in [4], where students using an intelligent tutoring
system were clustered according to the types of
mistakes made. The authors suggested that through the
use of clustering, teachers could identify different types of
learners and apply different remedial methods. A similar
goal can be transferred to the current context, with clustering
possibly identifying different styles of groups
which may benefit from different styles of intervention.
However, it must be noted that with a small number of
groups, such analysis could be performed by the teachers
alone, without the aid of clustering results. Therefore, our
primary goal was simply to assess whether our data contained
features which could be translated through clustering
into meaningful information about groups and individual
learners.
stakeholders.
Limitations of Clustering
The main limitation was the small data sample, especially
in the first task, clustering of groups. Although the data
contained more than 15000 events, we had only 7 groups
and 43 students. Nevertheless, we think that the collected
data and selected attributes allowed for uncovering useful
patterns characterising the work of stronger and
weaker students as discussed above. The follow-up interviews
were very helpful for interpreting and validating
the patterns.
How to select the most appropriate clustering algorithm
and how to set its parameters is another important
issue. There are methods for determining a good number
of clusters and evaluating the clustering quality in terms
of cohesion and separation of the clusters found [20]. We
believe that in this application the expert knowledge of
the course co-ordinators and facilitators is essential to
find meaningful number of clusters and extract meaningful
characteristics, and then use them on new cohorts. For
larger datasets, hierarchical clustering may not be applicable
due to its high time and memory requirements; kmeans
may be still a good choice, especially some of its
modifications, such as bi-secting k-means [20] which is
less sensitive to initialization and is also more efficient.
6 SEQUENTIAL PATTERN MINING
An important aspect of our data which is ignored by mining
techniques such as clustering is the timing of events.
We believe that certain sequences of events distinguish the
better groups from the weaker ones. In particular, we
expected that we should be able to use these to gain indications
of closed loop communication, one of the enablers
in the Big Five Theory. Such sequence may represent a
characteristic team interaction on a specific resource, or
group members displaying specific work patterns across
the three aspects of TRAC. A data mining technique
which considers this temporal aspect is sequential pattern
mining [25]. It finds sequential patterns that occur in a
dataset with at least a minimal level of frequency called
support [26]. Sequential pattern mining has been previously
used in e-learning although for different goals than
others: to support personalised course delivered based on
the learner characteristics [7] and to recommend sequences
of resources for users to view in order to learn
about a given topic [27]. We first present the algorithm
we used and then the data pre-processing we applied.
Abstraction of raw data
The raw data for each group is first transformed into a list
of events, which are defined as:
Event = {eventType, Resource, Author, Time}, where
EventType is either T (for Ticket), S (for SVN) or W (for
wiki), Resource is the identifier of the ticket number,
source code file or wiki page, Author is the name of the
user who performed the action and Time is the absolute
time when the event occurred.
Generation of a Dataset of Sequences
The original sequence obtained for each group was from
1416 to 3395 events long. We then needed to break down
this long sequence into several meaningful sequences to
form a dataset of sequences of events. We considered the
following three ways.
• A sequence per resource, where a separate sequence is
obtained for the events on each ticket, wiki page,
and SVN file. Therefore, the number of sequences in
the dataset (for a group) will be equal to the number
of resources used.
• A sequence per group session, where sessions are
formed by cutting up the group’s event list where
gaps (of no activity) of a minimum length of time
occur (we used 7 hours). A related sequence formation
method is a sequence per author session: the event
list for each group member is extracted and then
sessions are formed as above.
• A sequence per task, where the task is defined by a
ticket. The task sequence includes: 1) all ticket
events on that ticket, 2) all SVN and wiki events referring
to the ticket and occurring between the
ticket open and close dates, and 3) all events on
SVN and wiki pages referred to by the ticket and
occurring between the ticket close and open dates.
Therefore, the number of sequences in the dataset
(for a group) will be equal to the number of tickets
for the group.
Kalina Yacef is a Senior Lecturer in the Computer
Human Adaptive Interaction (CHAI)
research lab, at the University of Sydney, Australia.
She received her PhD in Computer
Science from University of Paris V in 1999.
Her research interests spans across the fields
of Artificial Intelligence and Data Mining, Personalisation
and Computer Human Adapted
Interaction with a strong focus on Education
applications. Her work focuses on mining
users’ data for building smart, personalised
solutions as well as on the creation of novel and adaptive interfaces
for supporting users’ tasks. She regularly serves on the program
committees of international conferences in the fields of Artificial Intelligence
in Education and Educational Data Mining and she is the
editor of the new Journal on Educational Data Mining.
Osmar R. Zaïane received an MSc in electronics
from the University of Paris XI, France, in
1989 and an MSc in Computing Science from
Laval University, Canada, in 1992. He received
his PhD in Computing Science from Simon Fraser
University in 1999 specializing in data mining.
He is an Associate Professor at the University
of Alberta with research interest in novel
data mining techniques and currently focuses on
e-learning as well as Health Informatics applications.
He regularly serves on the program committees
of international conferences in the field of knowledge discovery
and data mining and was the program co-chair for the IEEE international
conference on data mining ICDM’2007. He is the editorin-
chief of ACM SIGKDD Explorations and Associate Editor of
Knowledge and Information Systems.