08-09-2016, 10:05 AM
1453712152-PLAGIARISM.docx.pdf (Size: 371.49 KB / Downloads: 4)
INTRODUCTION
Along with the blossom of open source projects comes
the convenience for software plagiarism. Suppose a company
needs to implement a large software product, which,
if done from the scratch, could be timeconsuming. The
developers, if less selfdisciplined, may be tempted to find
a counterpart in some open source projects, rip off the interface
components, like I/O and Graphical User Interfaces
(GUIs), and finally fit the essential components (i.e., core
parts) into their own project with serious disguises. Because
only core parts are plagiarized, which accounts for a
small portion of the whole project, and heavy disguises are
applied, the plagiarism, which we call corepart plagiarism,
is hard to notice. In this paper, we study how to detect
corepart plagiarism both accurately and efficiently. In the
above example, the open source project from which code is
copied is the original program, and the company’s project is
called a plagiarism suspect.
.BACKGROUND
A program dependence graph (PDG) is a graph
representation
of the source code of a procedure [4]. Basic statements,
like variable declarations, assignments, and procedure
calls,
are represented by program vertices in PDGs. Each vertex
has one and only one type, and several important types are
listed in Table 1, which also illustrates how source code is
decomposed and mapped to program vertices. The data
and
control dependencies between statements are represented
by
edges between program vertices in PDGs.
Definition 1
(Control Dependency Edge). There
is a control dependency edge from a “control” vertex to a
second program vertex if the truth of the condition controls
whether the second vertex will be executed.
Definition 2
(Data Dependency Edge). There is a
data dependency edge from program vertex v1 to v2 if
there
is some variable var such that:
• v1 may be assigned to var, either directly or indirectly
through pointers.
• v2 may use the value in var, either directly or indirectly
through pointers.
Plagiarisms disguises
ORIGINAL PROGRAM
A procedure in a program, called join
01 static void
02 make_blank (struct line *blank, int count)
03 {
04 int i;
05 unsigned char *buffer;
06 struct field *fields;
07 blank>nfields = count;
08 blank>buf.size = blank>buf.length = count + 1;
09 blank>buf.buffer = (char*) xmalloc (blank>buf.size);
10 buffer = (unsigned char *) blank>buf.buffer;
11 blank>fields = fields =
(struct field *) xmalloc (sizeof (struct field) * count);
12 for (i = 0; i < count; i++){
Review of Plagiarism Detection
Stringbased: Each statement is treated as a string, and
a program is represented as a sequence of strings.
Two programs are compared to find sequences of same strings [1].
Because blanks and comments are discarded, this kind of algorithms are robust to format
alteration, but fragile to identifier renaming.
ASTbased [Baxter et al. 1998, Kontogiannis et al. 1995]
A program is represented as an Abstract Syntax Tree (AST)
Fragile to statement reordering, control replacement and code insertion
Tokenbased [Kamiya et al. 2002, Prechelt et al. 2002]
Variables of the same type are mapped to the same token
A program is represented as a token string
Fingerprint of token strings is used for robustness [Schleimer et al. 2003]
Partially robust to statement reordering, control replacement and code insertion
Representatives: Moss and JPlag