1 Outliers and Outlier Analysis
What Are Outliers?
Generally it is assumed that a statistical process is used to generate a set of data objects. An outlier
is a data object that deviates significantly from the rest of the objects, as if it were generated
by a different statistical process. In this chapter, we refer to data objects that are not outliers as “normal” or expected data. Outliers are termed as “abnormal” data
Types of Outliers
In general, outliers can be classified into three categories, namely global outliers, contextual
(or conditional) outliers, and collective outliers.
Global Outliers
In a given data set, a data object is a global outlier if it deviates significantly from the rest of the data set. Global outliers are sometimes called point anomalies, and are the simplest type of outliers. Most outlier detection methods are aimed at finding global outliers.
Contextual (or conditional) outliers
In a given data set, a data object is a contextual outlier if it deviates significantly with respect to a specific context of the object. Contextual outliers are also known as conditional outliers because they are conditional on the selected context. Therefore, in contextual outlier detection, the context has to be specified as part of the problem definition. Generally, in contextual outlier detection, the attributes of the data objects in question are divided into two groups:
Contextual attributes: The contextual attributes of a data object define the object’s
context. In the temperature example, the contextual attributes may be date and location.
Behavioral attributes: These define the object’s characteristics, and are used to evaluate
whether the object is an outlier in the context to which it belongs. In the temperature example, the behavioral attributes may be the temperature, humidity,
Collective Outliers
When various shippers are transporting items, some may be delayed. But 100 of them got delayed there is a collective outlier responsible for this.
2 Outlier Detection Methods
Supervised, Semi-Supervised, and Unsupervised Methods
If expert-labeled examples of normal and/or outlier objects can be obtained, they can be
used to build outlier detection models. The methods used can be divided into supervised
methods, semi-supervised methods, and unsupervised methods.
Supervised Methods
Supervised methods model data normality and abnormality. Domain experts examine
and label a sample of the underlying data.
Unsupervised Methods
In some application scenarios, objects labeled as “normal” or “outlier” are not available.
Thus, an unsupervised learning method has to be used.
Unsupervised outlier detection methods make an implicit assumption: The normal
objects are somewhat “clustered.” In other words, an unsupervised outlier detection
method expects that normal objects follow a pattern far more frequently than outliers.
Normal objects do not have to fall into one group sharing high similarity. Instead, they
can form multiple groups, where each group has distinct features. However, an outlier is
expected to occur far away in feature space from any of those groups of normal objects
Semi-Supervised Methods
In many applications, although obtaining some labeled examples is feasible, the number
of such labeled examples is often small. We may encounter cases where only a small set
of the normal and/or outlier objects are labeled, but most of the data are unlabeled.
Semi-supervised outlier detection methods were developed to tackle such scenarios.
Semi-supervised outlier detection methods can be regarded as applications of semisupervised
learning methods
Statistical Methods, Proximity-Based Methods,
and Clustering-Based Methods
Outlier detection methods make assumptions about outliers
versus the rest of the data. According to the assumptions made, we can categorize outlier
detection methods into three types: statistical methods, proximity-based methods, and
clustering-based methods.
Statistical Methods
Statistical methods (also known as model-based methods) make assumptions of
data normality. They assume that normal data objects are generated by a statistical
(stochastic) model, and that data not following the model are outliers.
Proximity-Based Methods
Proximity-based methods assume that an object is an outlier if the nearest neighbors
of the object are far away in feature space, that is, the proximity of the object to its
neighbors significantly deviates from the proximity of most of the other objects to their
neighbors in the same data set.
Clustering-Based Methods
Clustering-based methods assume that the normal data objects belong to large and
dense clusters,
3 Statistical Approaches
They assume that the normal objects in a data set aregenerated by a stochastic process (a generative model). Consequently, normal objects
occur in regions of high probability for the stochastic model, and objects in the regions
of low probability are outliers.
The general idea behind statistical methods for outlier detection is to learn a generative
model fitting the given data set, and then identify those objects in low-probability
regions of the model as outliers. However, there are many different ways to learn generative
models. In general, statistical methods for outlier detection can be divided into two
major categories: parametric methods and nonparametric methods, according to how the
models are specified and learned.
and pressure.
4 Proximity-Based Approaches
Given a set of objects in feature space, a distance measure can be used to quantify thesimilarity between objects. Intuitively, objects that are far from others can be regarded
as outliers. Proximity-based approaches assume that the proximity of an outlier object
to its nearest neighbors significantly deviates from the proximity of the object to most
of the other objects in the data set.
There are two types of proximity-based outlier detection methods: distance-based
and density-based methods. A distance-based outlier detection method consults the
neighborhood of an object, which is defined by a given radius. An object is then considered
an outlier if its neighborhood does not have enough other points. A density-based
outlier detection method investigates the density of an object and that of its neighbors.
Here, an object is identified as an outlier if its density is relatively much lower than that
of its neighbors
5 Clustering-Based Approaches
The notion of outliers is highly related to that of clusters. Clustering-based approachesdetect outliers by examining the relationship between objects and clusters. Intuitively,
an outlier is an object that belongs to a small and remote cluster, or does not belong to
any cluster.
This leads to three general approaches to clustering-based outlier detection. Consider
an object.
Does the object belong to any cluster? If not, then it is identified as an outlier.
Is there a large distance between the object and the cluster to which it is closest? If
yes, it is an outlier.
Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster
are outliers.
6 Classification-Based Approaches
Outlier detection can be treated as a classification problem if a training data set with classlabels is available. The general idea of classification-based outlier detection methods is
to train a classification model that can distinguish normal data from outliers.
Consider a training set that contains samples labeled as “normal” and others labeled
as “outlier.” A classifier can then be constructed based on the training set.
7 Mining Contextual and Collective Outliers
An object in a given data set is a contextual outlier (or conditional outlier) if it deviatessignificantly with respect to a specific context of the object (Section 12.1). The
context is defined using contextual attributes. These depend heavily on the application,
and are often provided by users as part of the contextual outlier detection task.
Contextual attributes can include spatial attributes, time, network locations, and sophisticated
structured attributes. In addition, behavioral attributes define characteristics of
the object, and are used to evaluate whether the object is an outlier in the context to
which it belongs.
Mining Collective Outliers
A group of data objects forms a collective outlier if the objects as a whole deviate significantly
from the entire data set, even though each individual object in the group may
not be an outlier. To detect collective outliers, we have to examine the
structure of the data set, that is, the relationships between multiple data objects. This
makes the problem more difficult than conventional and contextual outlier detection.
“How can we explore the data set structure?” This typically depends on the nature
of the data. For outlier detection in temporal data (e.g., time series and sequences), we
explore the structures formed by time, which occur in segments of the time series or subsequences.
To detect collective outliers in spatial data, we explore local areas. Similarly,
in graph and network data, we explore subgraphs. Each of these structures is inherent to
its respective data type.
8 Outlier Detection in High-Dimensional Data
In some applications, we may need to detect outliers in high-dimensional data. Thedimensionality curse poses huge challenges for effective outlier detection. As the dimensionality
increases, the distance between objects may be heavily dominated by noise.
That is, the distance and similarity between two points in a high-dimensional space
may not reflect the real relationship between the points. Consequently, conventional
outlier detection methods, which mainly use proximity or density to identify outliers,
deteriorate as dimensionality increases.
Angle-based outlier detection (ABOD)
Next Chapter: Data Mining Recent Trends and Research Frontiers
Excerpts from the Book
Data Mining Concepts and Techniques
Third EditionJiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University
No comments:
Post a Comment