To be rewritten to integrate the material which is now in the form excerpts.

Clustering is the process of grouping a set of data objects into multiple groups or clusters

so that objects within a cluster have high similarity, but are very dissimilar to objects

in other clusters. Dissimilarities and similarities are assessed based on the attribute values

describing the objects and often involve distance measures.1 Clustering as a data

mining tool has its roots in many application areas such as biology, security, business

intelligence, and Web search.

What Is Cluster Analysis?

Cluster analysis or simply clustering is the process of partitioning a set of data objects

(or observations) into subsets. Each subset is a cluster, such that objects in a cluster

are similar to one another, yet dissimilar to objects in other clusters. The set of clusters

resulting from a cluster analysis can be referred to as a clustering.

Clustering is also called data segmentation in some applications because clustering

partitions large data sets into groups according to their similarity. Clustering can

also be used for outlier detection, where outliers (values that are “far away” from any

cluster) may be more interesting than common cases.

Data clustering is under vigorous development. Contributing areas of research

include data mining, statistics, machine learning, spatial database technology, information

retrieval, Web search, biology, marketing, and many other application areas. Owing

to the huge amounts of data collected in databases, cluster analysis has recently become

a highly active topic in data mining research.

As a branch of statistics, cluster analysis has been extensively studied, with the

main focus on distance-based cluster analysis. Cluster analysis tools based on k-means,

k-medoids, and several other methods also have been built into many statistical analysis

software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning, recall

that classification is known as supervised learning because the class label information is

given, that is, the learning algorithm is supervised in that it is told the class membership

of each training tuple. Clustering is known as unsupervised learning because the

class label information is not present. For this reason, clustering is a form of learning

by observation, rather than learning by examples. In data mining, efforts have focused

on finding methods for efficient and effective cluster analysis in large databases. Active

themes of research focus on the scalability of clustering methods, the effectiveness of

methods for clustering complex shapes (e.g., nonconvex) and types of data (e.g., text,

graphs, and images), high-dimensional clustering techniques (e.g., clustering objects

with thousands of features), and methods for clustering mixed numerical and nominal

data in large databases.

Requirements for Cluster Analysis

The following are typical requirements of clustering in data mining.

Scalability: Many clustering algorithms work well on small data sets containing fewer

than several hundred data objects; however, a large database may contain millions or

even billions of objects, particularly in Web search scenarios. Clustering on only a

sample of a given large data set may lead to biased results. Therefore, highly scalable

clustering algorithms are needed.

Ability to deal with different types of attributes: Many algorithms are designed to

cluster numeric (interval-based) data. However, applications may require clustering

other data types, such as binary, nominal (categorical), and ordinal data, or mixtures

of these data types. Recently, more and more applications need clustering techniques

for complex data types such as graphs, sequences, images, and documents.

Discovery of clusters with arbitrary shape: Many clustering algorithms determine

clusters based on Euclidean or Manhattan distance measures (Chapter 2). Algorithms

based on such distance measures tend to find spherical clusters with similar size and

density. However, a cluster could be of any shape. Consider sensors, for example,

which are often deployed for environment surveillance. Cluster analysis on sensor

readings can detect interesting phenomena. We may want to use clustering to find

the frontier of a running forest fire, which is often not spherical. It is important to

develop algorithms that can detect clusters of arbitrary shape

Requirements for domain knowledge to determine input parameters: Many clustering

algorithms require users to provide domain knowledge in the form of input

parameters such as the desired number of clusters. Consequently, the clustering

results may be sensitive to such parameters. Parameters are often hard to determine,

especially for high-dimensionality data sets and where users have yet to grasp a deep

understanding of their data. Requiring the specification of domain knowledge not

only burdens users, but also makes the quality of clustering difficult to control.

Ability to deal with noisy data: Most real-world data sets contain outliers and/or

missing, unknown, or erroneous data. Sensor readings, for example, are often

noisy—some readings may be inaccurate due to the sensing mechanisms, and some

readings may be erroneous due to interferences from surrounding transient objects.

Clustering algorithms can be sensitive to such noise and may produce poor-quality

clusters. Therefore, we need clustering methods that are robust to noise.

Incremental clustering and insensitivity to input order: In many applications,

incremental updates (representing newer data) may arrive at any time. Some clustering

algorithms cannot incorporate incremental updates into existing clustering

structures and, instead, have to recompute a new clustering from scratch. Clustering

algorithms may also be sensitive to the input data order. That is, given a set

of data objects, clustering algorithms may return dramatically different clusterings

depending on the order in which the objects are presented. Incremental clustering

algorithms and algorithms that are insensitive to the input order are needed.

Capability of clustering high-dimensionality data: A data set can contain numerous

dimensions or attributes. When clustering documents, for example, each keyword

can be regarded as a dimension, and there are often thousands of keywords. Most

clustering algorithms are good at handling low-dimensional data such as data sets

involving only two or three dimensions. Finding clusters of data objects in a highdimensional

space is challenging, especially considering that such data can be very

sparse and highly skewed.

Constraint-based clustering: Real-world applications may need to perform clustering

under various kinds of constraints. Suppose that your job is to choose the

locations for a given number of new automatic teller machines (ATMs) in a city. To

decide upon this, you may cluster households while considering constraints such as

the city’s rivers and highway networks and the types and number of customers per

cluster. A challenging task is to find data groups with good clustering behavior that

satisfy specified constraints.

Interpretability and usability: Users want clustering results to be interpretable,

comprehensible, and usable. That is, clustering may need to be tied in with specific

semantic interpretations and applications. It is important to study how an

application goal may influence the selection of clustering features and clustering

methods

Overview of Basic Clustering Methods

Partitioning methods: Given a set of n objects, a partitioning method constructs k

partitions of the data, where each partition represents a cluster and k ≤ n. That is, it

divides the data into k groups such that each group must contain at least one object.

In other words, partitioning methods conduct one-level partitioning on data sets.

The basic partitioning methods typically adopt exclusive cluster separation. That is,

each object must belong to exactly one group.

iterative relocation technique that attempts to improve the partitioning by moving

objects from one group to another. The general criterion of a good partitioning is

that objects in the same cluster are “close” or related to each other, whereas objects

in different clusters are “far apart” or very different.

Hierarchical methods: A hierarchical method creates a hierarchical decomposition of

the given set of data objects. A hierarchical method can be classified as being either

agglomerative or divisive, based on how the hierarchical decomposition is formed.

The agglomerative approach, also called the bottom-up approach, starts with each

object forming a separate group. It successively merges the objects or groups close

to one another, until all the groups are merged into one (the topmost level of the

hierarchy), or a termination condition holds. The divisive approach, also called the

top-down approach, starts with all the objects in the same cluster. In each successive

iteration, a cluster is split into smaller clusters, until eventually each object is in one

cluster, or a termination condition holds.

Density-based methods: Some clustering methods have been developed based on the notion of density. Their general idea

is to continue growing a given cluster as long as the density (number of objects or

data points) in the “neighborhood” exceeds some threshold. For example, for each

data point within a given cluster, the neighborhood of a given radius has to contain

at least a minimum number of points. Such a method can be used to filter out noise

or outliers and discover clusters of arbitrary shape.

Grid-based methods: Grid-based methods quantize the object space into a finite

number of cells that form a grid structure. All the clustering operations are performed

on the grid structure (i.e., on the quantized space). The main advantage of

this approach is its fast processing time, which is typically independent of the number

of data objects and dependent only on the number of cells in each dimension in

the quantized space.

Formally, given a data set, D, of n objects, and k, the number of clusters to form, a

partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition

represents a cluster. The clusters are formed to optimize an objective partitioning

criterion, such as a dissimilarity function based on distance, so that the objects within a

cluster are “similar” to one another and “dissimilar” to objects in other clusters in terms

of the data set attributes.

k-Means: A Centroid-Based Technique

Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute

the objects in D into k clusters, C1,...,Ck

, that is, Ci ⊂ D and Ci ∩Cj = ∅ for

(1 ≤ i,j ≤ k). An objective function is used to assess the partitioning quality so that

objects within a cluster are similar to one another but dissimilar to objects in other

clusters. This is, the objective function aims for high intracluster similarity and low

intercluster similarity.

A centroid-based partitioning technique uses the centroid of a cluster, Ci

, to represent

that cluster. Conceptually, the centroid of a cluster is its center point. The centroid can

be defined in various ways such as by the mean or medoid of the objects (or points)

assigned to the cluster

k-Medoids: A Representative Object-Based Technique

The k-means algorithm is sensitive to outliers because such objects are far away from the

majority of the data, and thus, when assigned to a cluster, they can dramatically distort

the mean value of the cluster.

we can

pick actual objects to represent the clusters, using one representative object per cluster.

Each remaining object is assigned to the cluster of which the representative object is

the most similar. The partitioning method is then performed based on the principle of

minimizing the sum of the dissimilarities between each object p and its corresponding

representative object

. This is the basis for the k-medoids method, which groups n

objects into k clusters by minimizing the absolute error.

The Partitioning Around Medoids (PAM) algorithm is a popular

realization of k-medoids clustering.

objects into a number of exclusive groups, in some situations we may want to partition

our data into groups at different levels such as in a hierarchy. A hierarchical clustering

method works by grouping data objects into a hierarchy or “tree” of clusters

Employees may be clusted nto major groups such as executives, managers, and

staff. One can further partition these groups into smaller subgroups. For instance, the

general group of staff can be further divided into subgroups of senior officers, officers,

and trainees. All these groups form a hierarchy. We can easily summarize or characterize

the data that are organized into a hierarchy, which can be used to find, say, the average

salary of managers and of officers.

A promising direction for improving the clustering quality of hierarchical methods

is to integrate hierarchical clustering with other clustering techniques, resulting in

multiple-phase (or multiphase) clustering. Two such methods are BIRCH and Chameleon. BIRCH begins by partitioning objects hierarchically

using tree structures, where the leaf or low-level nonleaf nodes can be

viewed as “microclusters” depending on the resolution scale. It then applies other clustering algorithms to perform macroclustering on the microclusters. Chameleon

explores dynamic modeling in hierarchical clustering.

A tree structure called a dendrogram is commonly used to represent the process of

hierarchical clustering.

Distance Measures in Algorithmic Methods

Whether using an agglomerative method or a divisive method, a core need is to measure

the distance between two clusters, where each cluster is generally a set of objects.

BIRCH: Multiphase Hierarchical Clustering

Using Clustering Feature Trees

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is designed for

clustering a large amount of numeric data by integrating hierarchical clustering (at the

initial microclustering stage) and other clustering methods such as iterative partitioning

(at the later macroclustering stage). It overcomes the two difficulties in agglomerative

clustering methods: (1) scalability and (2) the inability to undo what was done in the

previous step.

A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical

clustering.

y. The

primary phases are

Phase 1: BIRCH scans the database to build an initial in-memory CF-tree, which

can be viewed as a multilevel compression of the data that tries to preserve the data’s

inherent clustering structure.

Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of

the CF-tree, which removes sparse clusters as outliers and groups dense clusters into

larger ones.

Chameleon: Multiphase Hierarchical Clustering

Using Dynamic Modeling

Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine

the similarity between pairs of clusters. In Chameleon, cluster similarity is assessed

based on (1) how well connected objects are within a cluster and (2) the proximity of

clusters. That is, two clusters are merged if their interconnectivity is high and they are

close together. Thus, Chameleon does not depend on a static, user-supplied model and

can automatically adapt to the internal characteristics of the clusters being merged. The

merge process facilitates the discovery of natural and homogeneous clusters and applies

to all data types as long as a similarity function can be specified.

Probabilistic Hierarchical Clustering

Probabilistic hierarchical clustering aims to overcome some of these disadvantages

by using probabilistic models to measure distances between clusters.

In practice, we can assume that the data generative models adopt common distribution

functions, such as Gaussian distribution or Bernoulli distribution, which are

governed by parameters. The task of learning a generative model is then reduced to

finding the parameter values for which the model best fits the observed data set.

They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval

clusters in Figure 10.13. Given such data, they would likely inaccurately identify convex

regions, where noise or outliers are included in the clusters.

To find clusters of arbitrary shape, alternatively, we can model clusters as dense

regions in the data space, separated by sparse regions. This is the main strategy behind

density-based clustering methods, which can discover clusters of nonspherical shape.

DBSCAN: Density-Based Clustering Based on Connected

Regions with High Density

“How can we find dense regions in density-based clustering?” The density of an object o

can be measured by the number of objects close to o. DBSCAN (Density-Based Spatial

Clustering of Applications with Noise) finds core objects, that is, objects that have dense

neighborhoods. It connects core objects and their neighborhoods to form dense regions

as clusters.

“How does DBSCAN quantify the neighborhood of an object?” A user-specified parameter

> 0 is used to specify the radius of a neighborhood we consider for every object.

The -neighborhood of an object o is the space within a radius centered at o.

Due to the fixed neighborhood size parameterized by , the density of a neighborhood

can be measured simply by the number of objects in the neighborhood.

OPTICS: Ordering Points to Identify

the Clustering Structure

Although DBSCAN can cluster objects given input parameters such as (the maximum

radius of a neighborhood) and MinPts (the minimum number of points required

in the neighborhood of a core object), it encumbers users with the responsibility of

selecting parameter values that will lead to the discovery of acceptable clusters. This is

a problem associated with many other clustering algorithms. Such parameter settings

are usually empirically set and difficult to determine, especially for real-world, highdimensional

data sets. Most algorithms are sensitive to these parameter values: Slightly

different settings may lead to very different clusterings of the data. Moreover, real-world,

high-dimensional data sets often have very skewed distributions such that their intrinsic

clustering

OPTICS computes an ordering of all objects in a given database and, for each object

in the database, stores the core-distance and a suitable reachability-distance. OPTICS

maintains a list called OrderSeeds to generate the output ordering. Objects in OrderSeeds

are sorted by the reachability-distance from their respective closest core objects,

that is, by the smallest reachability-distance of each object.

OPTICS begins with an arbitrary object from the input database as the current

object, p. It retrieves the -neighborhood of p, determines the core-distance, and sets

the reachability-distance to undefined.

DENCLUE: Clustering Based on Density

Distribution Functions

Density estimation is a core issue in density-based clustering methods. DENCLUE

(DENsity-based CLUstEring) is a clustering method based on a set of density distribution

functions.

objects and adapt to the distribution of the objects in the embedding space. Alternatively,

a grid-based clustering method takes a space-driven approach by partitioning

the embedding space into cells independent of the distribution of the input objects

STING: STatistical INformation Grid

STING is a grid-based multiresolution clustering technique in which the embedding

spatial area of the input objects is divided into rectangular cells. The space can be divided

in a hierarchical and recursive way. Several levels of such rectangular cells correspond to

different levels of resolution and form a hierarchical structure: Each cell at a high level

is partitioned to form a number of cells at the next lower level. Statistical information

regarding the attributes in each grid cell, such as the mean, maximum, and minimum

values, is precomputed and stored as statistical parameters. These statistical parameters

are useful for query processing and for other data analysis tasks.

CLIQUE: An Apriori-like Subspace Clustering Method

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding densitybased

clusters in subspaces. CLIQUE partitions each dimension into nonoverlapping

intervals, thereby partitioning the entire embedding space of the data objects into cells.

It uses a density threshold to identify dense cells and sparse ones. A cell is dense if the

number of objects mapped to it exceeds the density threshold.

The main strategy behind CLIQUE for identifying a candidate search space uses the

monotonicity of dense cells with respect to dimensionality. This is based on the Apriori

property used in frequent pattern and association rule mining.

Assessing clustering tendency. In this task, for a given data set, we assess whether a

nonrandom structure exists in the data. Blindly applying a clustering method on a

data set will return clusters; however, the clusters mined may be misleading. Clustering

analysis on a data set is meaningful only when there is a nonrandom structure in

the data.

Determining the number of clusters in a data set. A few algorithms, such as k-means,

require the number of clusters in a data set as the parameter. Moreover, the number

of clusters can be regarded as an interesting and important summary statistic of a

data set. Therefore, it is desirable to estimate this number even before a clustering

algorithm is used to derive detailed clusters.

Measuring clustering quality. After applying a clustering method on a data set, we

want to assess how good the resulting clusters are. A number of measures can be used.

Some methods measure how well the clusters fit the data set, while others measure

how well the clusters match the ground truth, if such truth is available. There are also

measures that score clusterings and thus can compare two sets of clustering results

on the same data set.

Next Chapter

Data Mining Recent Trends and Research Frontiers

Excerpts From the Book

Data Mining Concepts and Techniques

Third Edition

Jiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber, Jian Pei

Simon Fraser University

## Partitioned Clusters - Hierarchical clusters - Grid based cluster - Density based Clusters

Clustering is the process of grouping a set of data objects into multiple groups or clusters

so that objects within a cluster have high similarity, but are very dissimilar to objects

in other clusters. Dissimilarities and similarities are assessed based on the attribute values

describing the objects and often involve distance measures.1 Clustering as a data

mining tool has its roots in many application areas such as biology, security, business

intelligence, and Web search.

## 1 Cluster Analysis Basics

What Is Cluster Analysis?

Cluster analysis or simply clustering is the process of partitioning a set of data objects

(or observations) into subsets. Each subset is a cluster, such that objects in a cluster

are similar to one another, yet dissimilar to objects in other clusters. The set of clusters

resulting from a cluster analysis can be referred to as a clustering.

Clustering is also called data segmentation in some applications because clustering

partitions large data sets into groups according to their similarity. Clustering can

also be used for outlier detection, where outliers (values that are “far away” from any

cluster) may be more interesting than common cases.

Data clustering is under vigorous development. Contributing areas of research

include data mining, statistics, machine learning, spatial database technology, information

retrieval, Web search, biology, marketing, and many other application areas. Owing

to the huge amounts of data collected in databases, cluster analysis has recently become

a highly active topic in data mining research.

As a branch of statistics, cluster analysis has been extensively studied, with the

main focus on distance-based cluster analysis. Cluster analysis tools based on k-means,

k-medoids, and several other methods also have been built into many statistical analysis

software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning, recall

that classification is known as supervised learning because the class label information is

given, that is, the learning algorithm is supervised in that it is told the class membership

of each training tuple. Clustering is known as unsupervised learning because the

class label information is not present. For this reason, clustering is a form of learning

by observation, rather than learning by examples. In data mining, efforts have focused

on finding methods for efficient and effective cluster analysis in large databases. Active

themes of research focus on the scalability of clustering methods, the effectiveness of

methods for clustering complex shapes (e.g., nonconvex) and types of data (e.g., text,

graphs, and images), high-dimensional clustering techniques (e.g., clustering objects

with thousands of features), and methods for clustering mixed numerical and nominal

data in large databases.

Requirements for Cluster Analysis

The following are typical requirements of clustering in data mining.

Scalability: Many clustering algorithms work well on small data sets containing fewer

than several hundred data objects; however, a large database may contain millions or

even billions of objects, particularly in Web search scenarios. Clustering on only a

sample of a given large data set may lead to biased results. Therefore, highly scalable

clustering algorithms are needed.

Ability to deal with different types of attributes: Many algorithms are designed to

cluster numeric (interval-based) data. However, applications may require clustering

other data types, such as binary, nominal (categorical), and ordinal data, or mixtures

of these data types. Recently, more and more applications need clustering techniques

for complex data types such as graphs, sequences, images, and documents.

Discovery of clusters with arbitrary shape: Many clustering algorithms determine

clusters based on Euclidean or Manhattan distance measures (Chapter 2). Algorithms

based on such distance measures tend to find spherical clusters with similar size and

density. However, a cluster could be of any shape. Consider sensors, for example,

which are often deployed for environment surveillance. Cluster analysis on sensor

readings can detect interesting phenomena. We may want to use clustering to find

the frontier of a running forest fire, which is often not spherical. It is important to

develop algorithms that can detect clusters of arbitrary shape

Requirements for domain knowledge to determine input parameters: Many clustering

algorithms require users to provide domain knowledge in the form of input

parameters such as the desired number of clusters. Consequently, the clustering

results may be sensitive to such parameters. Parameters are often hard to determine,

especially for high-dimensionality data sets and where users have yet to grasp a deep

understanding of their data. Requiring the specification of domain knowledge not

only burdens users, but also makes the quality of clustering difficult to control.

Ability to deal with noisy data: Most real-world data sets contain outliers and/or

missing, unknown, or erroneous data. Sensor readings, for example, are often

noisy—some readings may be inaccurate due to the sensing mechanisms, and some

readings may be erroneous due to interferences from surrounding transient objects.

Clustering algorithms can be sensitive to such noise and may produce poor-quality

clusters. Therefore, we need clustering methods that are robust to noise.

Incremental clustering and insensitivity to input order: In many applications,

incremental updates (representing newer data) may arrive at any time. Some clustering

algorithms cannot incorporate incremental updates into existing clustering

structures and, instead, have to recompute a new clustering from scratch. Clustering

algorithms may also be sensitive to the input data order. That is, given a set

of data objects, clustering algorithms may return dramatically different clusterings

depending on the order in which the objects are presented. Incremental clustering

algorithms and algorithms that are insensitive to the input order are needed.

Capability of clustering high-dimensionality data: A data set can contain numerous

dimensions or attributes. When clustering documents, for example, each keyword

can be regarded as a dimension, and there are often thousands of keywords. Most

clustering algorithms are good at handling low-dimensional data such as data sets

involving only two or three dimensions. Finding clusters of data objects in a highdimensional

space is challenging, especially considering that such data can be very

sparse and highly skewed.

Constraint-based clustering: Real-world applications may need to perform clustering

under various kinds of constraints. Suppose that your job is to choose the

locations for a given number of new automatic teller machines (ATMs) in a city. To

decide upon this, you may cluster households while considering constraints such as

the city’s rivers and highway networks and the types and number of customers per

cluster. A challenging task is to find data groups with good clustering behavior that

satisfy specified constraints.

Interpretability and usability: Users want clustering results to be interpretable,

comprehensible, and usable. That is, clustering may need to be tied in with specific

semantic interpretations and applications. It is important to study how an

application goal may influence the selection of clustering features and clustering

methods

Overview of Basic Clustering Methods

Partitioning methods: Given a set of n objects, a partitioning method constructs k

partitions of the data, where each partition represents a cluster and k ≤ n. That is, it

divides the data into k groups such that each group must contain at least one object.

In other words, partitioning methods conduct one-level partitioning on data sets.

The basic partitioning methods typically adopt exclusive cluster separation. That is,

each object must belong to exactly one group.

iterative relocation technique that attempts to improve the partitioning by moving

objects from one group to another. The general criterion of a good partitioning is

that objects in the same cluster are “close” or related to each other, whereas objects

in different clusters are “far apart” or very different.

Hierarchical methods: A hierarchical method creates a hierarchical decomposition of

the given set of data objects. A hierarchical method can be classified as being either

agglomerative or divisive, based on how the hierarchical decomposition is formed.

The agglomerative approach, also called the bottom-up approach, starts with each

object forming a separate group. It successively merges the objects or groups close

to one another, until all the groups are merged into one (the topmost level of the

hierarchy), or a termination condition holds. The divisive approach, also called the

top-down approach, starts with all the objects in the same cluster. In each successive

iteration, a cluster is split into smaller clusters, until eventually each object is in one

cluster, or a termination condition holds.

Density-based methods: Some clustering methods have been developed based on the notion of density. Their general idea

is to continue growing a given cluster as long as the density (number of objects or

data points) in the “neighborhood” exceeds some threshold. For example, for each

data point within a given cluster, the neighborhood of a given radius has to contain

at least a minimum number of points. Such a method can be used to filter out noise

or outliers and discover clusters of arbitrary shape.

Grid-based methods: Grid-based methods quantize the object space into a finite

number of cells that form a grid structure. All the clustering operations are performed

on the grid structure (i.e., on the quantized space). The main advantage of

this approach is its fast processing time, which is typically independent of the number

of data objects and dependent only on the number of cells in each dimension in

the quantized space.

## 2 Partitioning Methods

Formally, given a data set, D, of n objects, and k, the number of clusters to form, a

partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition

represents a cluster. The clusters are formed to optimize an objective partitioning

criterion, such as a dissimilarity function based on distance, so that the objects within a

cluster are “similar” to one another and “dissimilar” to objects in other clusters in terms

of the data set attributes.

k-Means: A Centroid-Based Technique

Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute

the objects in D into k clusters, C1,...,Ck

, that is, Ci ⊂ D and Ci ∩Cj = ∅ for

(1 ≤ i,j ≤ k). An objective function is used to assess the partitioning quality so that

objects within a cluster are similar to one another but dissimilar to objects in other

clusters. This is, the objective function aims for high intracluster similarity and low

intercluster similarity.

A centroid-based partitioning technique uses the centroid of a cluster, Ci

, to represent

that cluster. Conceptually, the centroid of a cluster is its center point. The centroid can

be defined in various ways such as by the mean or medoid of the objects (or points)

assigned to the cluster

k-Medoids: A Representative Object-Based Technique

The k-means algorithm is sensitive to outliers because such objects are far away from the

majority of the data, and thus, when assigned to a cluster, they can dramatically distort

the mean value of the cluster.

we can

pick actual objects to represent the clusters, using one representative object per cluster.

Each remaining object is assigned to the cluster of which the representative object is

the most similar. The partitioning method is then performed based on the principle of

minimizing the sum of the dissimilarities between each object p and its corresponding

representative object

. This is the basis for the k-medoids method, which groups n

objects into k clusters by minimizing the absolute error.

The Partitioning Around Medoids (PAM) algorithm is a popular

realization of k-medoids clustering.

## 3 Hierarchical Methods

While partitioning methods meet the basic clustering requirement of organizing a set ofobjects into a number of exclusive groups, in some situations we may want to partition

our data into groups at different levels such as in a hierarchy. A hierarchical clustering

method works by grouping data objects into a hierarchy or “tree” of clusters

Employees may be clusted nto major groups such as executives, managers, and

staff. One can further partition these groups into smaller subgroups. For instance, the

general group of staff can be further divided into subgroups of senior officers, officers,

and trainees. All these groups form a hierarchy. We can easily summarize or characterize

the data that are organized into a hierarchy, which can be used to find, say, the average

salary of managers and of officers.

A promising direction for improving the clustering quality of hierarchical methods

is to integrate hierarchical clustering with other clustering techniques, resulting in

multiple-phase (or multiphase) clustering. Two such methods are BIRCH and Chameleon. BIRCH begins by partitioning objects hierarchically

using tree structures, where the leaf or low-level nonleaf nodes can be

viewed as “microclusters” depending on the resolution scale. It then applies other clustering algorithms to perform macroclustering on the microclusters. Chameleon

explores dynamic modeling in hierarchical clustering.

A tree structure called a dendrogram is commonly used to represent the process of

hierarchical clustering.

Distance Measures in Algorithmic Methods

Whether using an agglomerative method or a divisive method, a core need is to measure

the distance between two clusters, where each cluster is generally a set of objects.

BIRCH: Multiphase Hierarchical Clustering

Using Clustering Feature Trees

Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is designed for

clustering a large amount of numeric data by integrating hierarchical clustering (at the

initial microclustering stage) and other clustering methods such as iterative partitioning

(at the later macroclustering stage). It overcomes the two difficulties in agglomerative

clustering methods: (1) scalability and (2) the inability to undo what was done in the

previous step.

A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical

clustering.

y. The

primary phases are

Phase 1: BIRCH scans the database to build an initial in-memory CF-tree, which

can be viewed as a multilevel compression of the data that tries to preserve the data’s

inherent clustering structure.

Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of

the CF-tree, which removes sparse clusters as outliers and groups dense clusters into

larger ones.

Chameleon: Multiphase Hierarchical Clustering

Using Dynamic Modeling

Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine

the similarity between pairs of clusters. In Chameleon, cluster similarity is assessed

based on (1) how well connected objects are within a cluster and (2) the proximity of

clusters. That is, two clusters are merged if their interconnectivity is high and they are

close together. Thus, Chameleon does not depend on a static, user-supplied model and

can automatically adapt to the internal characteristics of the clusters being merged. The

merge process facilitates the discovery of natural and homogeneous clusters and applies

to all data types as long as a similarity function can be specified.

Probabilistic Hierarchical Clustering

Probabilistic hierarchical clustering aims to overcome some of these disadvantages

by using probabilistic models to measure distances between clusters.

In practice, we can assume that the data generative models adopt common distribution

functions, such as Gaussian distribution or Bernoulli distribution, which are

governed by parameters. The task of learning a generative model is then reduced to

finding the parameter values for which the model best fits the observed data set.

## 4 Density-Based Methods

Partitioning and hierarchical methods are designed to find spherical-shaped clusters.They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval

clusters in Figure 10.13. Given such data, they would likely inaccurately identify convex

regions, where noise or outliers are included in the clusters.

To find clusters of arbitrary shape, alternatively, we can model clusters as dense

regions in the data space, separated by sparse regions. This is the main strategy behind

density-based clustering methods, which can discover clusters of nonspherical shape.

DBSCAN: Density-Based Clustering Based on Connected

Regions with High Density

“How can we find dense regions in density-based clustering?” The density of an object o

can be measured by the number of objects close to o. DBSCAN (Density-Based Spatial

Clustering of Applications with Noise) finds core objects, that is, objects that have dense

neighborhoods. It connects core objects and their neighborhoods to form dense regions

as clusters.

“How does DBSCAN quantify the neighborhood of an object?” A user-specified parameter

> 0 is used to specify the radius of a neighborhood we consider for every object.

The -neighborhood of an object o is the space within a radius centered at o.

Due to the fixed neighborhood size parameterized by , the density of a neighborhood

can be measured simply by the number of objects in the neighborhood.

OPTICS: Ordering Points to Identify

the Clustering Structure

Although DBSCAN can cluster objects given input parameters such as (the maximum

radius of a neighborhood) and MinPts (the minimum number of points required

in the neighborhood of a core object), it encumbers users with the responsibility of

selecting parameter values that will lead to the discovery of acceptable clusters. This is

a problem associated with many other clustering algorithms. Such parameter settings

are usually empirically set and difficult to determine, especially for real-world, highdimensional

data sets. Most algorithms are sensitive to these parameter values: Slightly

different settings may lead to very different clusterings of the data. Moreover, real-world,

high-dimensional data sets often have very skewed distributions such that their intrinsic

clustering

OPTICS computes an ordering of all objects in a given database and, for each object

in the database, stores the core-distance and a suitable reachability-distance. OPTICS

maintains a list called OrderSeeds to generate the output ordering. Objects in OrderSeeds

are sorted by the reachability-distance from their respective closest core objects,

that is, by the smallest reachability-distance of each object.

OPTICS begins with an arbitrary object from the input database as the current

object, p. It retrieves the -neighborhood of p, determines the core-distance, and sets

the reachability-distance to undefined.

DENCLUE: Clustering Based on Density

Distribution Functions

Density estimation is a core issue in density-based clustering methods. DENCLUE

(DENsity-based CLUstEring) is a clustering method based on a set of density distribution

functions.

## 5 Grid-Based Methods

The clustering methods discussed so far are data-driven—they partition the set ofobjects and adapt to the distribution of the objects in the embedding space. Alternatively,

a grid-based clustering method takes a space-driven approach by partitioning

the embedding space into cells independent of the distribution of the input objects

STING: STatistical INformation Grid

STING is a grid-based multiresolution clustering technique in which the embedding

spatial area of the input objects is divided into rectangular cells. The space can be divided

in a hierarchical and recursive way. Several levels of such rectangular cells correspond to

different levels of resolution and form a hierarchical structure: Each cell at a high level

is partitioned to form a number of cells at the next lower level. Statistical information

regarding the attributes in each grid cell, such as the mean, maximum, and minimum

values, is precomputed and stored as statistical parameters. These statistical parameters

are useful for query processing and for other data analysis tasks.

CLIQUE: An Apriori-like Subspace Clustering Method

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding densitybased

clusters in subspaces. CLIQUE partitions each dimension into nonoverlapping

intervals, thereby partitioning the entire embedding space of the data objects into cells.

It uses a density threshold to identify dense cells and sparse ones. A cell is dense if the

number of objects mapped to it exceeds the density threshold.

The main strategy behind CLIQUE for identifying a candidate search space uses the

monotonicity of dense cells with respect to dimensionality. This is based on the Apriori

property used in frequent pattern and association rule mining.

## 6 Evaluation of Clustering

Assessing clustering tendency. In this task, for a given data set, we assess whether a

nonrandom structure exists in the data. Blindly applying a clustering method on a

data set will return clusters; however, the clusters mined may be misleading. Clustering

analysis on a data set is meaningful only when there is a nonrandom structure in

the data.

Determining the number of clusters in a data set. A few algorithms, such as k-means,

require the number of clusters in a data set as the parameter. Moreover, the number

of clusters can be regarded as an interesting and important summary statistic of a

data set. Therefore, it is desirable to estimate this number even before a clustering

algorithm is used to derive detailed clusters.

Measuring clustering quality. After applying a clustering method on a data set, we

want to assess how good the resulting clusters are. A number of measures can be used.

Some methods measure how well the clusters fit the data set, while others measure

how well the clusters match the ground truth, if such truth is available. There are also

measures that score clusterings and thus can compare two sets of clustering results

on the same data set.

Next Chapter

Data Mining Recent Trends and Research Frontiers

Excerpts From the Book

Data Mining Concepts and Techniques

Third Edition

Jiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber, Jian Pei

Simon Fraser University

## No comments:

## Post a Comment