## Thursday, April 14, 2016

### Data Mining - Advanced Cluster Analysis

Probabilistic model-based clustering: A general framework
and a method for deriving clusters where each object is assigned a probability of
belonging to a cluster. Probabilistic model-based clustering is widely used in many
data mining applications such as text mining.

Clustering high-dimensional data: When the dimensionality is high, conventional
distance measures can be dominated by noise.
Clustering graph and network data: Graph and network data are increasingly popular
in applications such as online social networks, the World Wide Web, and digital
libraries.  key issues in clustering graph and network data, including similarity measurement and clustering methods are to be described.
Clustering with constraints:. In some applications, however, various constraints may exist.
These constraints may rise from background knowledge or spatial distribution of
the objects.

1 Probabilistic Model-Based Clustering

In this section, we
demonstrate the need for fuzzy or flexible cluster assignment in some applications, and
introduce a general method to compute probabilistic clusters and assignments.

Fuzzy Clusters
Given a set of objects, X = {x1,...,xn}, a fuzzy set S is a subset of X that allows each
object in X to have a membership degree between 0 and 1. Formally, a fuzzy set, S, can
be modeled as a function, FS:X → [0,1].

Probabilistic Model-Based Clusters
“Fuzzy clusters  provide the flexibility of allowing an object to participate
in multiple clusters. Is there a general framework to specify clusterings where objects may
participate in multiple clusters in a probabilistic way?”

Statistically, we can assume that a hidden category is a distribution over the data
space, which can be mathematically represented using a probability density function
(or distribution function). We call such a hidden category a probabilistic cluster. For a
probabilistic cluster, C, its probability density function, f , and a point, o, in the data
space, f (o) is the relative likelihood that an instance of C appears at o.

Given data set, D, and k, the number of clusters required, the task of probabilistic
model-based cluster analysis is to infer a set of k probabilistic clusters that is most likely to
generate D using this data generation process. An important question remaining is how
we can measure the likelihood that a set of k probabilistic clusters and their probabilities
will generate an observed data set

Expectation-Maximization Algorithm

It can easily be shown that k-means clustering is a special case of fuzzy clustering
The k-means algorithm iterates until the clustering cannot be improved.

Each iteration consists of two steps:
The expectation step (E-step): Given the current cluster centers, each object is assigned
to the cluster with a center that is closest to the object. Here, an object is expected to
belong to the closest cluster.
The maximization step (M-step): Given the cluster assignment, for each cluster, the
algorithm adjusts the center so that the sum of the distances from the objects
assigned to this cluster and the new center is minimized. That is, the similarity of
objects assigned to a cluster is maximized.

We can generalize this two-step method to tackle fuzzy clustering and probabilistic
model-based clustering. In general, an expectation-maximization (EM) algorithm is
a framework that approaches maximum likelihood or maximum a posteriori estimates
of parameters in statistical models. In the context of fuzzy or probabilistic model-based
clustering, an EM algorithm starts with an initial set of parameters and iterates until
the clustering cannot be improved, that is, until the clustering converges or the change
is sufficiently small (less than a preset threshold). Each iteration also consists of two
steps:

2 Clustering High-Dimensional Data

Methods for
high-dimensional data clustering can be divided into two categories: subspace clustering
methods and dimensionality reduction methods.

In some applications, a data object may be described by 10 or more attributes. Such
objects are referred to as a high-dimensional data space.

3 Clustering Graph and Network Data
Cluster analysis on graph and network data extracts valuable knowledge and information.
Such data are increasingly popular in many applications.

Similarity Measures

Geodesic Distance
A simple measure of the distance between two vertices in a graph is the shortest path
between the vertices. Formally, the geodesic distance between two vertices is the length
in terms of the number of edges of the shortest path between the vertices. For two
vertices that are not connected in a graph, the geodesic distance is defined as infinite

SimRank: Similarity Based on Random Walk
and Structural Context
For some applications, geodesic distance may be inappropriate in measuring the similarity
between vertices in a graph. Here we introduce SimRank, a similarity measure
based on random walk and on the structural context of the graph. In mathematics, a
random walk is a trajectory that consists of taking successive random steps.

## 4 Clustering with Constraints

Users often have background knowledge that they want to integrate into cluster analysis.
There may also be application-specific requirements. Such information can be modeled
as clustering constraints

Next Chapter
Data Mining Recent Trends and Research Frontiers

Excerpts from the Book

Data Mining Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University