Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: A Survey on Unsupervised Outlier Detection in High-Dimensional Numerical Data
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
[attachment=71466]



Abstract: High-dimensional data in Euclidean space pose special challenges to data mining algorithms. These challenges are
often indiscriminately subsumed under the term ‘curse of dimensionality’, more concrete aspects being the so-called ‘distance
concentration effect’, the presence of irrelevant attributes concealing relevant information, or simply efficiency issues. In about
just the last few years, the task of unsupervised outlier detection has found new specialized solutions for tackling high-dimensional
data in Euclidean space. These approaches fall under mainly two categories, namely considering or not considering subspaces
(subsets of attributes) for the definition of outliers. The former are specifically addressing the presence of irrelevant attributes,
the latter do consider the presence of irrelevant attributes implicitly at best but are more concerned with general issues of
efficiency and effectiveness. Nevertheless, both types of specialized outlier detection algorithms tackle challenges specific to
high-dimensional data. In this survey article, we discuss some important aspects of the ‘curse of dimensionality’ in detail and
survey specialized algorithms for outlier detection from both categories. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data
Mining, 2012



INTRODUCTION
High-dimensional data poses special challenges for data
mining in general and outlier detection in particular.
Though in recent years, several surveys on outlier detection
have been published (see refs. [1–8], to name a few),
the difficulties in high-dimensional data and specialized
approaches in this area have not been sketched in any of
those (though, notably, a recent textbook edition sketches
three example approaches [9]). In fact, most approaches to
this problem have been proposed just recently in the past
two or three years. Since the development of unsupervised
methods for outlier detection in high-dimensional data in
Euclidean space appears to be an emerging topic, this
survey is specialized on this topic. We hope to help
researchers working in this area to become aware of other approaches, to understand the advancements and the lasting
problems, and to better identify the true achievements
of the various approaches. Though a lot of variants for
outlier detection are around, such as supervised versus
unsupervised or specialized approaches for specific data
types (such as item sets, time series, sequences, categorical
data, see refs. 7,10 for an overview), here we focus on
unsupervised approaches for numerical data in Euclidean
space.
Albeit the infamous ‘curse of dimensionality’ has been
credited for many problems and has indiscriminately been
used as a motivation for many new approaches, we
should try to understand the problems occurring in highdimensional
data in more detail. For example, there is
a widespread mistaken belief that every point in highdimensional
space is an outlier. This—misleading, to say
the least—statement has been suggested as a motivation
for the first approach specialized to outlier detection in subspaces of high dimensional data [11], recurring
superficially to a fundamental paper on the ‘curse of
dimensionality’ by Beyer et al. [12]. Alas, as we will
discuss in the following section, it is not as simple
as that.
Indeed, this often cited but less often well-understood
study [12], has been reconsidered recently by several
researchers independently in different research areas. We
will, therefore, begin our survey by inspecting truths
and myths associated with the ‘curse of dimensionality’,
study a couple of its effects that may be relevant for
outlier detection, and discuss the findings of the renewed
interest in this more than 10-year-old study (Section 2).
Afterwards, we will discuss different families of outlier
detection approaches concerned with high-dimensional
data: first, approaches that treat the issues of efficiency
and effectiveness in high-dimensional data without specific
interest in a definition of outliers with respect to subspaces
of the data (Section 3), and second, those that search
for outliers specifically in subspaces of the data space
(Section 4). In Section 5, we will comment on some opensource
tools providing implementations of outlier detection
algorithms, and remark on the difficulties of understanding
and evaluating the results of outlier detection algorithms.
Finally, in Section 6, we conclude the paper.



THE CURSE OF DIMENSIONALITY
The ‘curse of dimensionality’ has motivated a lot of
research in the area of databases due to its impacts
on similarity search [13–21]. Yet still there are open
questions, unresolved issues, and even the known results
have found less attention in research than appropriate. Here
we discuss some effects and influential characteristics of
high-dimensional data and relate these to issues for outlier
detection in high-dimensional data.
2.1. Concentration of Distances—Concentration
of Outlier Scores
Let us recall the basic statements of the fundamental
study of Beyer et al. [12]. Their key result states the
following:
Assumption The ratio of the variance of the length of any
point vector (denoted by Xd ) with the length of the
mean point vector (denoted by E[Xd ]) converges
to zero with increasing data dimensionality.
This assumption covers a broad range of data
distributions and distance measures (generally: all
Lp-norms with p ≥ 1).
Consequence The proportional difference between the
farthest-point distance Dmax and the closest-point distance
Dmin (the relative contrast) vanishes.
Formally:
If lim
d→∞ var Xd
E[Xd ]

= 0, then
Dmax − Dmin
Dmin
→ 0.
Intuitively, under the given assumption the relative contrast
between near and far neighbors does diminish as the
dimensionality increases. This concentration effect of the
distance measure hence reduces the usability of the measure
for discrimination between near and far neighbors [22].
The influence of different values for p in Lp-norms on
the concentration effect has been studied in refs. 23,24.
In ref. 23, the authors showed by means of an analytic
argument that L1 and L2 are the only integer norms useful
for higher dimensions. In addition, they studied the use of
projections for discrimination, the effectiveness of which
depended on localized dissimilarity measures that did not
satisfy the symmetry and triangle inequality conditions of
distance metrics. In ref. 24, fractional Lp distance measures
(with 0 <p< 1) have been studied in a similar context.
The authors provide evidence supporting the contention
that smaller values of p offer better results in higher
dimensional settings. Their result, however, is valid only
for uniformly distributed data [25]. The concentration of
the cosine similarity has been studied in ref. 26.
This does, however, only tell part of the story. A
change in the value range for the distances does not come
unexpected. The maximum distance from the origin in
the unit cube is √p d for Lp-norms, the average distance
converges to √p d · √
1
3 with increasing d. For a normalized
maximum distance, for example, in the unit-hypercube, this
comes down to √
1
3 ≈ 0.577. So at first sight, it might be
feasible to counter these effects by rescaling the distances
appropriately.
The resulting effects can be seen in Fig. 1 (compare to
ref. 25, which shows similar results without normalization).
In this figure, we plot some characteristics of the
distribution of vector length values for vectors uniformly
distributed in the unit cube over increasing dimensionality
d. The observed length of each vector is normalized
by √
1
d . Plotted are mean and standard deviation of the
observed lengths as well as the actually observed minimal
(maximal) length in each dimensionality. While the
observed distances now have a comparable average value,
there is still a loss in relative contrast. On this data set of
a sample of 105 instances, the curse (more explicitly, the
effect of distance concentration) is now clearly visible: the
standard deviation disappears already at 10 dimensions, and
the observed actual minimum and maximum shrink rapidly
with increasing dimensionality.