Thursday, April 14, 2016

Data Mining - Cluster Analysis: Basic Concepts and Methods

To be rewritten to integrate the material which is now in the form excerpts.

Partitioned Clusters - Hierarchical clusters - Grid based cluster - Density based Clusters

Clustering is the process of grouping a set of data objects into multiple groups or clusters
so that objects within a cluster have high similarity, but are very dissimilar to objects
in other clusters. Dissimilarities and similarities are assessed based on the attribute values
describing the objects and often involve distance measures.1 Clustering as a data
mining tool has its roots in many application areas such as biology, security, business
intelligence, and Web search.

1 Cluster Analysis Basics

What Is Cluster Analysis?
Cluster analysis or simply clustering is the process of partitioning a set of data objects
(or observations) into subsets. Each subset is a cluster, such that objects in a cluster
are similar to one another, yet dissimilar to objects in other clusters. The set of clusters
resulting from a cluster analysis can be referred to as a clustering.

Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity. Clustering can
also be used for outlier detection, where outliers (values that are “far away” from any
cluster) may be more interesting than common cases.

Data clustering is under vigorous development. Contributing areas of research
include data mining, statistics, machine learning, spatial database technology, information
retrieval, Web search, biology, marketing, and many other application areas. Owing
to the huge amounts of data collected in databases, cluster analysis has recently become
a highly active topic in data mining research.

As a branch of statistics, cluster analysis has been extensively studied, with the
main focus on distance-based cluster analysis. Cluster analysis tools based on k-means,
k-medoids, and several other methods also have been built into many statistical analysis
software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning, recall
that classification is known as supervised learning because the class label information is
given, that is, the learning algorithm is supervised in that it is told the class membership
of each training tuple. Clustering is known as unsupervised learning because the
class label information is not present. For this reason, clustering is a form of learning
by observation, rather than learning by examples. In data mining, efforts have focused
on finding methods for efficient and effective cluster analysis in large databases. Active
themes of research focus on the scalability of clustering methods, the effectiveness of
methods for clustering complex shapes (e.g., nonconvex) and types of data (e.g., text,
graphs, and images), high-dimensional clustering techniques (e.g., clustering objects
with thousands of features), and methods for clustering mixed numerical and nominal
data in large databases.

Requirements for Cluster Analysis

The following are typical requirements of clustering in data mining.
Scalability: Many clustering algorithms work well on small data sets containing fewer
than several hundred data objects; however, a large database may contain millions or
even billions of objects, particularly in Web search scenarios. Clustering on only a
sample of a given large data set may lead to biased results. Therefore, highly scalable
clustering algorithms are needed.
Ability to deal with different types of attributes: Many algorithms are designed to
cluster numeric (interval-based) data. However, applications may require clustering
other data types, such as binary, nominal (categorical), and ordinal data, or mixtures
of these data types. Recently, more and more applications need clustering techniques
for complex data types such as graphs, sequences, images, and documents.
Discovery of clusters with arbitrary shape: Many clustering algorithms determine
clusters based on Euclidean or Manhattan distance measures (Chapter 2). Algorithms
based on such distance measures tend to find spherical clusters with similar size and
density. However, a cluster could be of any shape. Consider sensors, for example,
which are often deployed for environment surveillance. Cluster analysis on sensor
readings can detect interesting phenomena. We may want to use clustering to find
the frontier of a running forest fire, which is often not spherical. It is important to
develop algorithms that can detect clusters of arbitrary shape

Requirements for domain knowledge to determine input parameters: Many clustering
algorithms require users to provide domain knowledge in the form of input
parameters such as the desired number of clusters. Consequently, the clustering
results may be sensitive to such parameters. Parameters are often hard to determine,
especially for high-dimensionality data sets and where users have yet to grasp a deep
understanding of their data. Requiring the specification of domain knowledge not
only burdens users, but also makes the quality of clustering difficult to control.
Ability to deal with noisy data: Most real-world data sets contain outliers and/or
missing, unknown, or erroneous data. Sensor readings, for example, are often
noisy—some readings may be inaccurate due to the sensing mechanisms, and some
readings may be erroneous due to interferences from surrounding transient objects.
Clustering algorithms can be sensitive to such noise and may produce poor-quality
clusters. Therefore, we need clustering methods that are robust to noise.
Incremental clustering and insensitivity to input order: In many applications,
incremental updates (representing newer data) may arrive at any time. Some clustering
algorithms cannot incorporate incremental updates into existing clustering
structures and, instead, have to recompute a new clustering from scratch. Clustering
algorithms may also be sensitive to the input data order. That is, given a set
of data objects, clustering algorithms may return dramatically different clusterings
depending on the order in which the objects are presented. Incremental clustering
algorithms and algorithms that are insensitive to the input order are needed.

Capability of clustering high-dimensionality data: A data set can contain numerous
dimensions or attributes. When clustering documents, for example, each keyword
can be regarded as a dimension, and there are often thousands of keywords. Most
clustering algorithms are good at handling low-dimensional data such as data sets
involving only two or three dimensions. Finding clusters of data objects in a highdimensional
space is challenging, especially considering that such data can be very
sparse and highly skewed.
Constraint-based clustering: Real-world applications may need to perform clustering
under various kinds of constraints. Suppose that your job is to choose the
locations for a given number of new automatic teller machines (ATMs) in a city. To
decide upon this, you may cluster households while considering constraints such as
the city’s rivers and highway networks and the types and number of customers per
cluster. A challenging task is to find data groups with good clustering behavior that
satisfy specified constraints.
Interpretability and usability: Users want clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied in with specific
semantic interpretations and applications. It is important to study how an
application goal may influence the selection of clustering features and clustering

Overview of Basic Clustering Methods

Partitioning methods: Given a set of n objects, a partitioning method constructs k
partitions of the data, where each partition represents a cluster and k ≤ n. That is, it
divides the data into k groups such that each group must contain at least one object.
In other words, partitioning methods conduct one-level partitioning on data sets.
The basic partitioning methods typically adopt exclusive cluster separation. That is,
each object must belong to exactly one group.

iterative relocation technique that attempts to improve the partitioning by moving
objects from one group to another. The general criterion of a good partitioning is
that objects in the same cluster are “close” or related to each other, whereas objects
in different clusters are “far apart” or very different.

Hierarchical methods: A hierarchical method creates a hierarchical decomposition of
the given set of data objects. A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical decomposition is formed.
The agglomerative approach, also called the bottom-up approach, starts with each
object forming a separate group. It successively merges the objects or groups close
to one another, until all the groups are merged into one (the topmost level of the
hierarchy), or a termination condition holds. The divisive approach, also called the
top-down approach, starts with all the objects in the same cluster. In each successive
iteration, a cluster is split into smaller clusters, until eventually each object is in one
cluster, or a termination condition holds.

Density-based methods: Some  clustering methods have been developed based on the notion of density. Their general idea
is to continue growing a given cluster as long as the density (number of objects or
data points) in the “neighborhood” exceeds some threshold. For example, for each
data point within a given cluster, the neighborhood of a given radius has to contain
at least a minimum number of points. Such a method can be used to filter out noise
or outliers and discover clusters of arbitrary shape.

Grid-based methods: Grid-based methods quantize the object space into a finite
number of cells that form a grid structure. All the clustering operations are performed
on the grid structure (i.e., on the quantized space). The main advantage of
this approach is its fast processing time, which is typically independent of the number
of data objects and dependent only on the number of cells in each dimension in
the quantized space.

2 Partitioning Methods

Formally, given a data set, D, of n objects, and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition
represents a cluster. The clusters are formed to optimize an objective partitioning
criterion, such as a dissimilarity function based on distance, so that the objects within a
cluster are “similar” to one another and “dissimilar” to objects in other clusters in terms
of the data set attributes.

k-Means: A Centroid-Based Technique
Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute
the objects in D into k clusters, C1,...,Ck
, that is, Ci ⊂ D and Ci ∩Cj = ∅ for
(1 ≤ i,j ≤ k). An objective function is used to assess the partitioning quality so that
objects within a cluster are similar to one another but dissimilar to objects in other
clusters. This is, the objective function aims for high intracluster similarity and low
intercluster similarity.
A centroid-based partitioning technique uses the centroid of a cluster, Ci
, to represent
that cluster. Conceptually, the centroid of a cluster is its center point. The centroid can
be defined in various ways such as by the mean or medoid of the objects (or points)
assigned to the cluster

k-Medoids: A Representative Object-Based Technique
The k-means algorithm is sensitive to outliers because such objects are far away from the
majority of the data, and thus, when assigned to a cluster, they can dramatically distort
the mean value of the cluster.

we can
pick actual objects to represent the clusters, using one representative object per cluster.
Each remaining object is assigned to the cluster of which the representative object is
the most similar. The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its corresponding
representative object

. This is the basis for the k-medoids method, which groups n
objects into k clusters by minimizing the absolute error.
The Partitioning Around Medoids (PAM) algorithm is a popular
realization of k-medoids clustering.

3 Hierarchical Methods

While partitioning methods meet the basic clustering requirement of organizing a set of
objects into a number of exclusive groups, in some situations we may want to partition
our data into groups at different levels such as in a hierarchy. A hierarchical clustering
method works by grouping data objects into a hierarchy or “tree” of clusters

Employees may be clusted nto major groups such as executives, managers, and
staff. One can further partition these groups into smaller subgroups. For instance, the
general group of staff can be further divided into subgroups of senior officers, officers,
and trainees. All these groups form a hierarchy. We can easily summarize or characterize
the data that are organized into a hierarchy, which can be used to find, say, the average
salary of managers and of officers.

A promising direction for improving the clustering quality of hierarchical methods
is to integrate hierarchical clustering with other clustering techniques, resulting in
multiple-phase (or multiphase) clustering. Two such methods are BIRCH and Chameleon. BIRCH begins by partitioning objects hierarchically
using tree structures, where the leaf or low-level nonleaf nodes can be
viewed as “microclusters” depending on the resolution scale. It then applies other clustering algorithms to perform macroclustering on the microclusters. Chameleon
 explores dynamic modeling in hierarchical clustering.

A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering.

Distance Measures in Algorithmic Methods
Whether using an agglomerative method or a divisive method, a core need is to measure
the distance between two clusters, where each cluster is generally a set of objects.

BIRCH: Multiphase Hierarchical Clustering
Using Clustering Feature Trees
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is designed for
clustering a large amount of numeric data by integrating hierarchical clustering (at the
initial microclustering stage) and other clustering methods such as iterative partitioning
(at the later macroclustering stage). It overcomes the two difficulties in agglomerative
clustering methods: (1) scalability and (2) the inability to undo what was done in the
previous step.

A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical

y. The
primary phases are
Phase 1: BIRCH scans the database to build an initial in-memory CF-tree, which
can be viewed as a multilevel compression of the data that tries to preserve the data’s
inherent clustering structure.
Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of
the CF-tree, which removes sparse clusters as outliers and groups dense clusters into
larger ones.

Chameleon: Multiphase Hierarchical Clustering
Using Dynamic Modeling
Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine
the similarity between pairs of clusters. In Chameleon, cluster similarity is assessed
based on (1) how well connected objects are within a cluster and (2) the proximity of
clusters. That is, two clusters are merged if their interconnectivity is high and they are
close together. Thus, Chameleon does not depend on a static, user-supplied model and
can automatically adapt to the internal characteristics of the clusters being merged. The
merge process facilitates the discovery of natural and homogeneous clusters and applies
to all data types as long as a similarity function can be specified.

 Probabilistic Hierarchical Clustering

Probabilistic hierarchical clustering aims to overcome some of these disadvantages
by using probabilistic models to measure distances between clusters.

In practice, we can assume that the data generative models adopt common distribution
functions, such as Gaussian distribution or Bernoulli distribution, which are
governed by parameters. The task of learning a generative model is then reduced to
finding the parameter values for which the model best fits the observed data set.

4 Density-Based Methods

Partitioning and hierarchical methods are designed to find spherical-shaped clusters.
They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval
clusters in Figure 10.13. Given such data, they would likely inaccurately identify convex
regions, where noise or outliers are included in the clusters.
To find clusters of arbitrary shape, alternatively, we can model clusters as dense
regions in the data space, separated by sparse regions. This is the main strategy behind
density-based clustering methods, which can discover clusters of nonspherical shape.

DBSCAN: Density-Based Clustering Based on Connected
Regions with High Density
“How can we find dense regions in density-based clustering?” The density of an object o
can be measured by the number of objects close to o. DBSCAN (Density-Based Spatial
Clustering of Applications with Noise) finds core objects, that is, objects that have dense
neighborhoods. It connects core objects and their neighborhoods to form dense regions
as clusters.
“How does DBSCAN quantify the neighborhood of an object?” A user-specified parameter
> 0 is used to specify the radius of a neighborhood we consider for every object.
The -neighborhood of an object o is the space within a radius centered at o.
Due to the fixed neighborhood size parameterized by , the density of a neighborhood
can be measured simply by the number of objects in the neighborhood.

OPTICS: Ordering Points to Identify
the Clustering Structure
Although DBSCAN can cluster objects given input parameters such as (the maximum
radius of a neighborhood) and MinPts (the minimum number of points required
in the neighborhood of a core object), it encumbers users with the responsibility of
selecting parameter values that will lead to the discovery of acceptable clusters. This is
a problem associated with many other clustering algorithms. Such parameter settings
are usually empirically set and difficult to determine, especially for real-world, highdimensional
data sets. Most algorithms are sensitive to these parameter values: Slightly
different settings may lead to very different clusterings of the data. Moreover, real-world,
high-dimensional data sets often have very skewed distributions such that their intrinsic

OPTICS computes an ordering of all objects in a given database and, for each object
in the database, stores the core-distance and a suitable reachability-distance. OPTICS
maintains a list called OrderSeeds to generate the output ordering. Objects in OrderSeeds
are sorted by the reachability-distance from their respective closest core objects,
that is, by the smallest reachability-distance of each object.
OPTICS begins with an arbitrary object from the input database as the current
object, p. It retrieves the -neighborhood of p, determines the core-distance, and sets
the reachability-distance to undefined.

DENCLUE: Clustering Based on Density
Distribution Functions
Density estimation is a core issue in density-based clustering methods. DENCLUE
(DENsity-based CLUstEring) is a clustering method based on a set of density distribution

5 Grid-Based Methods

The clustering methods discussed so far are data-driven—they partition the set of
objects and adapt to the distribution of the objects in the embedding space. Alternatively,
a grid-based clustering method takes a space-driven approach by partitioning
the embedding space into cells independent of the distribution of the input objects

STING: STatistical INformation Grid
STING is a grid-based multiresolution clustering technique in which the embedding
spatial area of the input objects is divided into rectangular cells. The space can be divided
in a hierarchical and recursive way. Several levels of such rectangular cells correspond to
different levels of resolution and form a hierarchical structure: Each cell at a high level
is partitioned to form a number of cells at the next lower level. Statistical information
regarding the attributes in each grid cell, such as the mean, maximum, and minimum
values, is precomputed and stored as statistical parameters. These statistical parameters
are useful for query processing and for other data analysis tasks.

CLIQUE: An Apriori-like Subspace Clustering Method

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding densitybased
clusters in subspaces. CLIQUE partitions each dimension into nonoverlapping
intervals, thereby partitioning the entire embedding space of the data objects into cells.
It uses a density threshold to identify dense cells and sparse ones. A cell is dense if the
number of objects mapped to it exceeds the density threshold.
The main strategy behind CLIQUE for identifying a candidate search space uses the
monotonicity of dense cells with respect to dimensionality. This is based on the Apriori
property used in frequent pattern and association rule mining.

6 Evaluation of Clustering

Assessing clustering tendency. In this task, for a given data set, we assess whether a
nonrandom structure exists in the data. Blindly applying a clustering method on a
data set will return clusters; however, the clusters mined may be misleading. Clustering
analysis on a data set is meaningful only when there is a nonrandom structure in
the data.
Determining the number of clusters in a data set. A few algorithms, such as k-means,
require the number of clusters in a data set as the parameter. Moreover, the number
of clusters can be regarded as an interesting and important summary statistic of a
data set. Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
Measuring clustering quality. After applying a clustering method on a data set, we
want to assess how good the resulting clusters are. A number of measures can be used.
Some methods measure how well the clusters fit the data set, while others measure
how well the clusters match the ground truth, if such truth is available. There are also
measures that score clusterings and thus can compare two sets of clustering results
on the same data set.

Next Chapter

Data Mining Recent Trends and Research Frontiers

Excerpts From the Book

Data Mining Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University

No comments:

Post a Comment