21-06-2012, 04:46 PM
A Comprehensive Survey of Data Mining-based Fraud Detection Research
A Comprehensive Survey of Data Mining.pdf (Size: 129.25 KB / Downloads: 16)
INTRODUCTION & MOTIVATION
Data mining is about finding insights which are statistically
reliable, unknown previously, and actionable from data (Elkan,
2001). This data must be available, relevant, adequate, and clean.
Also, the data mining problem must be well-defined, cannot be
solved by query and reporting tools, and guided by a data mining
process model (Lavrac et al, 2004).
The term fraud here refers to the abuse of a profit organisation’s
system without necessarily leading to direct legal consequences.
In a competitive environment, fraud can become a business
critical problem if it is very prevalent and if the prevention
procedures are not fail-safe. Fraud detection, being part of the
overall fraud control, automates and helps reduce the manual parts
of a screening/checking process. This area has become one of the
most established industry/government data mining applications.
It is impossible to be absolutely certain about the legitimacy of
and intention behind an application or transaction. Given the
reality, the best cost effective option is to tease out possible
evidences of fraud from the available data using mathematical
algorithms.
Performance Measures
Most fraud departments place monetary value on predictions to
maximise cost savings/profit and according to their policies. They
can either define explicit cost (Phua et al, 2004; Chan et al, 1999;
Fawcett and Provost, 1997) or benefit models (Fan et al, 2004;
Wang et al, 2003).
Cahill et al (2002) suggests giving a score for an instance (phone
call) by determining the similarity of it to known fraud examples
(fraud styles) divided by the dissimilarity of it to known legal
examples (legitimate telecommunications account).
Hybrid Approaches with Labelled Data
Popular supervised algorithms such as neural networks, Bayesian
networks, and decision trees have been combined or applied in a
sequential fashion to improve results. Chan et al (1999) utilises
naive Bayes, C4.5, CART, and RIPPER as base classifiers and
stacking to combine them. They also examine bridging
incompatible data sets from different companies and the pruning
of base classifiers.
Semi-supervised Approaches with Only Legal (Non-fraud) Data ©
Kim et al (2003) implements a novel fraud detection method in
five steps: First, generate rules randomly using association rules
algorithm Apriori and increase diversity by a calender schema;
second, apply rules on known legitimate transaction database,
discard any rule which matches this data; third, use remaining
rules to monitor actual system, discard any rule which detects no
anomalies; fourth, replicate any rule which detects anomalies by
adding tiny random mutations; and fifth, retain the successful
rules. This system has been and currently being tested for internal
fraud by employees within the retail transaction processing
system.
Critique of Methods and Techniques
In most scenarios of real-world fraud detection, the choice of
data mining techniques is more dependent on the practical
issues of operational requirements, resource constraints, and
management commitment towards reduction of fraud than the
technical issues poised by the data.
Other novel commercial fraud detection techniques include
graph-theoretic anomaly detection2 and Inductive Logic
Programming3. There has not been any empirical evaluation of
commercial data mining tools for fraud detection since Abbott
et al (1998).
Only seven studies claim to be implemented (or had been) as
actual fraud detection systems: in insurance (Major and
Riedinger, 2002; Cox, 1995), in credit card (Dorronsoro et al,
1997; Ghosh and Reilly, 1994), and in telecommunications
(Cortes et al, 2003; Cahill et al, 2002; Cox, 1997). Few fraud
detection studies which explicitly utilise temporal information
and virtually none use spatial information.