13-10-2012, 04:30 PM
Plagiarism and Collusion Detection using the Smith-Waterman Algorithm
Plagiarism and Collusion.pdf (Size: 185 KB / Downloads: 123)
Abstract
We investigate the use of variants of the Smith-Waterman algorithm
to locate similarities in texts and in program source code, with a view
to their application in the detection of plagiarism and collusion. The
Smith-Waterman algorithm is a classical tool in the identication and
quantication of local similarities in biological sequences, but we demon-
strate that somewhat dierent issues arise in this dierent context, and
that these factors can be exploited to yield signicant speed-up in prac-
tice. We include empirical evidence to indicate the practicality of the
approach and to illustrate the eciency gains.
Introduction and background
Plagiarism and collusion
Plagiarism and collusion in students' assessed work are issues of increasing con-
cern to the academic community as a whole. By plagiarism we mean the sub-
mission of part or all of another person's work as if it were ones own, without
the knowledge of the author, and with intention to deceive. Collusion, on the
other hand is the submission of work as ones own when (at least some of) that
work has been done partly or wholly by another person, and that other person
is party to the deception.
A variety of reasons have been proposed for the apparent increase in these
forms of cheating, particularly the availability of electronic communication and
internet access. But it is not our purpose here to pursue the reasons and the
implications, but rather to consider the issue of detection, and in particular, to describe one approach, seemingly not previously discussed in the literature,
that has been successfully employed to identify likely cases of collusion.
Collusion detection
The method that we propose can be applied to any form of textual material, such
as essays, reports, etc. Unlike many existing techniques for collusion detection,
it does not depend on statistical properties, such as counts of particular words,
but rather on structural similarities between (parts of) texts. A special case
involves collusion in computer programming assignments, which has been an
area of major concern to computer science educators over many years. The only
dierence in the approach to this special case is the way in which the source
material is parsed. Ordinary textual material will be parsed as a sequence of
words, where the term word is given an appropriate precise meaning, whereas a
computer program will be parsed as a sequence of lexical entities of the particular
programming language. In either case, a student's submission is ultimately
represented as a sequence of symbols over some nite alphabet; in the case of
plain text the alphabet size is eectively the number of words available in the
language, whereas in the case of computer programs it is (more or less) the
number of dierent lexical entities.
Motivation and outline of the method
The Smith-Waterman algorithm [8] is a classical method of comparing two
strings with a view to identifying highly similar sections within them. It is
widely-used in nding good near-matches, or so-called local alignments, within
biological sequences [3].
The basis of the method is a dynamic programming scheme, which we now
describe. Throughout, we denote the lengths of the given strings X and Y by
m and n respectively. If we imagine a portion X0 of string X aligned with a
portion Y 0 of string Y , we wish to allocate a score that, in some sense, repre-
sents the `goodness of t' between X0 and Y 0. Each matching symbol should
make a positive contribution to that score, and each symbol that has to be
inserted, deleted or substituted to transform X0 to Y 0 should make a negative
contribution.
Let h be the (positive) contribution made by a symbol `hit', d the (negative)
contribution made by a symbol insertion or deletion (an `indel'), and r the
(negative) contribution made by replacing one symbol by another. A more
general model is typically used in computational biology. Rather than a xed
positive score for a hit and a xed negative score for a replacement, a scoring
matrix is used, giving appropriate scores for all possible hits and replacements.
In addition, a more complex model for `gaps' is often used, the so-called ane
gap model, which imposes a constant cost for opening a gap in the alignment
and a dierent constant cost for extending that gap. Our methodology could
be extended to that more general model, but in our context it is not clear how
we would construct a scoring matrix, nor what appropriate gap costs would be,
so we describe only the simpler model.
Application of the algorithm
The question arises as to how we can most eectively and eciently use our
variant of the Smith-Waterman algorithm to obtain and present an appropriate
set of matches between two source documents.
Firstly, we have found that the most useful form of output from the algorithm
consists of annotated versions of the original two documents, with matching
sections suitably highlighted | see Figure 10 later in the paper for an example
of such a display. Hence we would not want any of the symbols that contribute
to one match to contribute to another | in other words we would want to
discount overlapping matches, since any overlap between matches would cause
confusion in such a display. In reality signicant overlaps between matches are
unlikely, since their existence would imply a degree of repetition within one or
both of the original documents, but we certainly cannot discount the possibility.
Very small overlaps may well arise commonly, and our methodology must be
able to deal with this.