## 1 Outliers and Outlier Analysis

What Are Outliers?

Generally it is assumed that a statistical process is used to generate a set of data objects. An outlier

is a data object that deviates significantly from the rest of the objects, as if it were generated

by a different statistical process. In this chapter, we refer to data objects that are not outliers as “normal” or expected data. Outliers are termed as “abnormal” data

Types of Outliers

In general, outliers can be classified into three categories, namely global outliers, contextual

(or conditional) outliers, and collective outliers.

Global Outliers

In a given data set, a data object is a global outlier if it deviates significantly from the rest of the data set. Global outliers are sometimes called point anomalies, and are the simplest type of outliers. Most outlier detection methods are aimed at finding global outliers.

Contextual (or conditional) outliers

In a given data set, a data object is a contextual outlier if it deviates significantly with respect to a specific context of the object. Contextual outliers are also known as conditional outliers because they are conditional on the selected context. Therefore, in contextual outlier detection, the context has to be specified as part of the problem definition. Generally, in contextual outlier detection, the attributes of the data objects in question are divided into two groups:

Contextual attributes: The contextual attributes of a data object define the object’s

context. In the temperature example, the contextual attributes may be date and location.

Behavioral attributes: These define the object’s characteristics, and are used to evaluate

whether the object is an outlier in the context to which it belongs. In the temperature example, the behavioral attributes may be the temperature, humidity,

Collective Outliers

When various shippers are transporting items, some may be delayed. But 100 of them got delayed there is a collective outlier responsible for this.

## 2 Outlier Detection Methods

Supervised, Semi-Supervised, and Unsupervised Methods

If expert-labeled examples of normal and/or outlier objects can be obtained, they can be

used to build outlier detection models. The methods used can be divided into supervised

methods, semi-supervised methods, and unsupervised methods.

Supervised Methods

Supervised methods model data normality and abnormality. Domain experts examine

and label a sample of the underlying data.

Unsupervised Methods

In some application scenarios, objects labeled as “normal” or “outlier” are not available.

Thus, an unsupervised learning method has to be used.

Unsupervised outlier detection methods make an implicit assumption: The normal

objects are somewhat “clustered.” In other words, an unsupervised outlier detection

method expects that normal objects follow a pattern far more frequently than outliers.

Normal objects do not have to fall into one group sharing high similarity. Instead, they

can form multiple groups, where each group has distinct features. However, an outlier is

expected to occur far away in feature space from any of those groups of normal objects

Semi-Supervised Methods

In many applications, although obtaining some labeled examples is feasible, the number

of such labeled examples is often small. We may encounter cases where only a small set

of the normal and/or outlier objects are labeled, but most of the data are unlabeled.

Semi-supervised outlier detection methods were developed to tackle such scenarios.

Semi-supervised outlier detection methods can be regarded as applications of semisupervised

learning methods

Statistical Methods, Proximity-Based Methods,

and Clustering-Based Methods

Outlier detection methods make assumptions about outliers

versus the rest of the data. According to the assumptions made, we can categorize outlier

detection methods into three types: statistical methods, proximity-based methods, and

clustering-based methods.

Statistical Methods

Statistical methods (also known as model-based methods) make assumptions of

data normality. They assume that normal data objects are generated by a statistical

(stochastic) model, and that data not following the model are outliers.

Proximity-Based Methods

Proximity-based methods assume that an object is an outlier if the nearest neighbors

of the object are far away in feature space, that is, the proximity of the object to its

neighbors significantly deviates from the proximity of most of the other objects to their

neighbors in the same data set.

Clustering-Based Methods

Clustering-based methods assume that the normal data objects belong to large and

dense clusters,

## 3 Statistical Approaches

They assume that the normal objects in a data set aregenerated by a stochastic process (a generative model). Consequently, normal objects

occur in regions of high probability for the stochastic model, and objects in the regions

of low probability are outliers.

The general idea behind statistical methods for outlier detection is to learn a generative

model fitting the given data set, and then identify those objects in low-probability

regions of the model as outliers. However, there are many different ways to learn generative

models. In general, statistical methods for outlier detection can be divided into two

major categories: parametric methods and nonparametric methods, according to how the

models are specified and learned.

and pressure.

## 4 Proximity-Based Approaches

Given a set of objects in feature space, a distance measure can be used to quantify thesimilarity between objects. Intuitively, objects that are far from others can be regarded

as outliers. Proximity-based approaches assume that the proximity of an outlier object

to its nearest neighbors significantly deviates from the proximity of the object to most

of the other objects in the data set.

There are two types of proximity-based outlier detection methods: distance-based

and density-based methods. A distance-based outlier detection method consults the

neighborhood of an object, which is defined by a given radius. An object is then considered

an outlier if its neighborhood does not have enough other points. A density-based

outlier detection method investigates the density of an object and that of its neighbors.

Here, an object is identified as an outlier if its density is relatively much lower than that

of its neighbors

## 5 Clustering-Based Approaches

The notion of outliers is highly related to that of clusters. Clustering-based approachesdetect outliers by examining the relationship between objects and clusters. Intuitively,

an outlier is an object that belongs to a small and remote cluster, or does not belong to

any cluster.

This leads to three general approaches to clustering-based outlier detection. Consider

an object.

Does the object belong to any cluster? If not, then it is identified as an outlier.

Is there a large distance between the object and the cluster to which it is closest? If

yes, it is an outlier.

Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster

are outliers.

## 6 Classification-Based Approaches

Outlier detection can be treated as a classification problem if a training data set with classlabels is available. The general idea of classification-based outlier detection methods is

to train a classification model that can distinguish normal data from outliers.

Consider a training set that contains samples labeled as “normal” and others labeled

as “outlier.” A classifier can then be constructed based on the training set.

## 7 Mining Contextual and Collective Outliers

An object in a given data set is a contextual outlier (or conditional outlier) if it deviatessignificantly with respect to a specific context of the object (Section 12.1). The

context is defined using contextual attributes. These depend heavily on the application,

and are often provided by users as part of the contextual outlier detection task.

Contextual attributes can include spatial attributes, time, network locations, and sophisticated

structured attributes. In addition, behavioral attributes define characteristics of

the object, and are used to evaluate whether the object is an outlier in the context to

which it belongs.

Mining Collective Outliers

A group of data objects forms a collective outlier if the objects as a whole deviate significantly

from the entire data set, even though each individual object in the group may

not be an outlier. To detect collective outliers, we have to examine the

structure of the data set, that is, the relationships between multiple data objects. This

makes the problem more difficult than conventional and contextual outlier detection.

“How can we explore the data set structure?” This typically depends on the nature

of the data. For outlier detection in temporal data (e.g., time series and sequences), we

explore the structures formed by time, which occur in segments of the time series or subsequences.

To detect collective outliers in spatial data, we explore local areas. Similarly,

in graph and network data, we explore subgraphs. Each of these structures is inherent to

its respective data type.

## 8 Outlier Detection in High-Dimensional Data

In some applications, we may need to detect outliers in high-dimensional data. Thedimensionality curse poses huge challenges for effective outlier detection. As the dimensionality

increases, the distance between objects may be heavily dominated by noise.

That is, the distance and similarity between two points in a high-dimensional space

may not reflect the real relationship between the points. Consequently, conventional

outlier detection methods, which mainly use proximity or density to identify outliers,

deteriorate as dimensionality increases.

Angle-based outlier detection (ABOD)

Next Chapter: Data Mining Recent Trends and Research Frontiers

Excerpts from the Book

## Data Mining Concepts and Techniques

Third EditionJiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber, Jian Pei

Simon Fraser University

## No comments:

## Post a Comment