Probabilistic model-based clustering: A general framework

and a method for deriving clusters where each object is assigned a probability of

belonging to a cluster. Probabilistic model-based clustering is widely used in many

data mining applications such as text mining.

Clustering high-dimensional data: When the dimensionality is high, conventional

distance measures can be dominated by noise.

Clustering graph and network data: Graph and network data are increasingly popular

in applications such as online social networks, the World Wide Web, and digital

libraries. key issues in clustering graph and network data, including similarity measurement and clustering methods are to be described.

Clustering with constraints:. In some applications, however, various constraints may exist.

These constraints may rise from background knowledge or spatial distribution of

the objects.

1 Probabilistic Model-Based Clustering

In this section, we

demonstrate the need for fuzzy or flexible cluster assignment in some applications, and

introduce a general method to compute probabilistic clusters and assignments.

Fuzzy Clusters

Given a set of objects, X = {x1,...,xn}, a fuzzy set S is a subset of X that allows each

object in X to have a membership degree between 0 and 1. Formally, a fuzzy set, S, can

be modeled as a function, FS:X → [0,1].

Probabilistic Model-Based Clusters

“Fuzzy clusters provide the flexibility of allowing an object to participate

in multiple clusters. Is there a general framework to specify clusterings where objects may

participate in multiple clusters in a probabilistic way?”

Statistically, we can assume that a hidden category is a distribution over the data

space, which can be mathematically represented using a probability density function

(or distribution function). We call such a hidden category a probabilistic cluster. For a

probabilistic cluster, C, its probability density function, f , and a point, o, in the data

space, f (o) is the relative likelihood that an instance of C appears at o.

Given data set, D, and k, the number of clusters required, the task of probabilistic

model-based cluster analysis is to infer a set of k probabilistic clusters that is most likely to

generate D using this data generation process. An important question remaining is how

we can measure the likelihood that a set of k probabilistic clusters and their probabilities

will generate an observed data set

Expectation-Maximization Algorithm

It can easily be shown that k-means clustering is a special case of fuzzy clustering

The k-means algorithm iterates until the clustering cannot be improved.

Each iteration consists of two steps:

The expectation step (E-step): Given the current cluster centers, each object is assigned

to the cluster with a center that is closest to the object. Here, an object is expected to

belong to the closest cluster.

The maximization step (M-step): Given the cluster assignment, for each cluster, the

algorithm adjusts the center so that the sum of the distances from the objects

assigned to this cluster and the new center is minimized. That is, the similarity of

objects assigned to a cluster is maximized.

We can generalize this two-step method to tackle fuzzy clustering and probabilistic

model-based clustering. In general, an expectation-maximization (EM) algorithm is

a framework that approaches maximum likelihood or maximum a posteriori estimates

of parameters in statistical models. In the context of fuzzy or probabilistic model-based

clustering, an EM algorithm starts with an initial set of parameters and iterates until

the clustering cannot be improved, that is, until the clustering converges or the change

is sufficiently small (less than a preset threshold). Each iteration also consists of two

steps:

2 Clustering High-Dimensional Data

Methods for

high-dimensional data clustering can be divided into two categories: subspace clustering

methods and dimensionality reduction methods.

In some applications, a data object may be described by 10 or more attributes. Such

objects are referred to as a high-dimensional data space.

3 Clustering Graph and Network Data

Cluster analysis on graph and network data extracts valuable knowledge and information.

Such data are increasingly popular in many applications.

Similarity Measures

Geodesic Distance

A simple measure of the distance between two vertices in a graph is the shortest path

between the vertices. Formally, the geodesic distance between two vertices is the length

in terms of the number of edges of the shortest path between the vertices. For two

vertices that are not connected in a graph, the geodesic distance is defined as infinite

SimRank: Similarity Based on Random Walk

and Structural Context

For some applications, geodesic distance may be inappropriate in measuring the similarity

between vertices in a graph. Here we introduce SimRank, a similarity measure

based on random walk and on the structural context of the graph. In mathematics, a

random walk is a trajectory that consists of taking successive random steps.

## 4 Clustering with Constraints

Users often have background knowledge that they want to integrate into cluster analysis.There may also be application-specific requirements. Such information can be modeled

as clustering constraints

Next Chapter

Data Mining Recent Trends and Research Frontiers

Excerpts from the Book

Data Mining Concepts and Techniques

Third Edition

Jiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber, Jian Pei

Simon Fraser University

## No comments:

## Post a Comment