## 1 Pattern Mining: A Road Map

Based on pattern diversity, pattern mining can be classified using the following

criteria:

Basic patterns: As discussed in Chapter 6, a frequent pattern may have several alternative

forms, including a simple frequent pattern, a closed pattern, or a max-pattern.

To review, a frequent pattern is a pattern (or itemset) that satisfies a minimum support

threshold. A pattern p is a closed pattern if there is no superpattern p

0 with the

same support as p. Pattern p is a max-pattern if there exists no frequent superpattern

of p. Frequent patterns can also be mapped into association rules, or other kinds

of rules based on interestingness measures. Sometimes we may also be interested in

infrequent or rare patterns (i.e., patterns that occur rarely but are of critical importance,

or negative patterns (i.e., patterns that reveal a negative correlation between

items).

We refer to the rule set mined as

consisting of multilevel association rules. If, instead, the rules within a given set do

not reference items or attributes at different abstraction levels, then the set contains

single-level association rules.

Based on the number of dimensions involved in the rule or pattern: If the items

or attributes in an association rule or pattern reference only one dimension, it is a

single-dimensional association rule/pattern.

If a rule/pattern references two or more dimensions, such as age, income, and buys,

then it is a multidimensional association rule/pattern.

Based on the types of values handled in the rule or pattern: If a rule involves associations

between the presence or absence of items, it is a Boolean association rule

If a rule describes associations between quantitative items or attributes, then it

is a quantitative association rule. In these rules, quantitative values for items or

attributes are partitioned into intervals.

Based on the constraints or criteria used to mine selective patterns: The patterns

or rules to be discovered can be constraint-based (i.e., satisfying a set of userdefined

constraints), approximate, compressed, near-match (i.e., those that tally

the support count of the near or almost matching itemsets), top-k (i.e., the k most

frequent itemsets for a user-specified value, k), redundancy-aware top-k (i.e., the

top-k patterns with similar or redundant patterns excluded), and so on.

Alternatively, pattern mining can be classified with respect to the kinds of data and

applications involved, using the following criteria:

Based on kinds of data and features to be mined: Given relational and data warehouse

data, most people are interested in itemsets. Thus, frequent pattern mining

in this context is essentially frequent itemset mining,

In sequential patterns, that is, frequent subsequences (which are often separated by some other events) in a sequence

of ordered events.

We may also mine structural patterns, that is, frequent substructures, in a structured

data set. Note that structure is a general concept that covers many different

kinds of structural forms such as directed graphs, undirected graphs, lattices, trees,

sequences, sets, single items, or combinations of such structures. Single items are the

simplest form of structure. Each element of a general pattern may contain a subsequence,

a subtree, a subgraph, and so on, and such containment relationships can

be defined recursively. Therefore, structural pattern mining can be considered as the

most general form of frequent pattern mining.

Based on application domain-specific semantics: Both data and applications can be

very diverse, and therefore the patterns to be mined can differ largely based on their

domain-specific semantics. Various kinds of application data include spatial data,

temporal data, spatiotemporal data, multimedia data (e.g., image, audio, and video

data), text data, time-series data, DNA and biological sequences, software programs,

chemical compound structures, web structures, sensor networks, social and information

networks, biological networks, data streams, and so on. This diversity can lead

to dramatically different pattern mining methodologies.

Based on data analysis usages: Frequent pattern mining often serves as an intermediate

step for improved data understanding and more powerful data analysis. For

example, it can be used as a feature extraction step for classification, which is often

referred to as pattern-based classification. Similarly, pattern-based clustering has

shown its strength at clustering high-dimensional data. For improved data understanding,

patterns can be used for semantic annotation or contextual analysis. Pattern

analysis can also be used in recommender systems, which recommend information

items (e.g., books, movies, web pages) that are likely to be of interest to the user

based on similar users’ patterns. Different analysis tasks may require mining rather

different kinds of patterns as well.

## 2 Pattern Mining in Multilevel, Multidimensional Space

Mining Multilevel Associations

It is interesting to examine how

to develop effective (useful and not trivial) methods for mining patterns at multiple abstraction levels, with

sufficient flexibility for easy traversal among different abstraction spaces.

Association rules generated from mining data at multiple abstraction levels are

called multiple-level or multilevel association rules. Multilevel association rules can be

mined efficiently using concept hierarchies under a support-confidence framework. In

general, a top-down strategy is employed, where counts are accumulated for the calculation

of frequent itemsets at each concept level, starting at concept level 1 and working

downward in the hierarchy toward the more specific concept levels, until no more frequent

itemsets can be found. For each level, any algorithm for discovering frequent

itemsets may be used, such as Apriori or its variations.

Using uniform minimum support for all levels (referred to as uniform support):

The same minimum support threshold is used when mining at each abstraction level.

Using reduced minimum support at lower levels (referred to as reduced support):

Each abstraction level has its own minimum support threshold. The deeper the

abstraction level, the smaller the corresponding threshold.

Using item or group-based minimum support (referred to as group-based support):

Because users or experts often have insight as to which groups are more

important than others, it is sometimes more desirable to set up user-specific, item, or

group-based minimal support thresholds when mining multilevel rules.

Mining Multidimensional Associations

Following the terminology used in multidimensional databases, we refer to each distinct

predicate in a rule as a dimension.

Instead of considering transactional data only, sales and related information are often

linked with relational data or integrated into a data warehouse. Such data stores are

multidimensional in nature. For instance, in addition to keeping track of the items purchased

in sales transactions, a relational database may record other attributes associated

with the items and/or transactions such as the item description or the branch location

of the sale. Additional relational information regarding the customers who purchased

the items (e.g., customer age, occupation, credit rating, income, and address) may also

be stored. Considering each database attribute or warehouse dimension as a predicate,

we can therefore mine association rules containing multiple predicates such as

age(X, “20...29”) ∧ occupation(X, “student”)⇒buys(X, “laptop”). (7.7)

Association rules that involve two or more dimensions or predicates can be referred

to as multidimensional association rules.

Mining Quantitative Association Rules

(1) a data cube method, (2)

a clustering-based method, and (3) a statistical analysis method to uncover exceptional

behaviors

Data Cube–Based Mining of Quantitative Associations

The transformed multidimensional data may be used to construct a

data cube. Data cubes are well suited for the mining of multidimensional association

rules: They store aggregates (e.g., counts) in multidimensional space, which is essential

for computing the support and confidence of multidimensional association rules.

Mining Clustering-Based Quantitative Associations

Besides using discretization-based or data cube–based data sets to generate quantitative

association rules, we can also generate quantitative association rules by clustering

data in the quantitative dimensions.

A typical top-down approach for finding clustering-based quantitative frequent patterns

is as follows. For each quantitative dimension, a standard clustering algorithm

(e.g., k-means or a density-based clustering algorithm, as described in Chapter 10) can

be applied to find clusters in this dimension that satisfy the minimum support threshold.

For each cluster, we then examine the 2-D spaces generated by combining the cluster

with a cluster or nominal value of another dimension to see if such a combination passes

the minimum support threshold. If it does, we continue to search for clusters in this

2-D region and progress to even higher-dimensional combinations. The Apriori pruning

still applies in this process: If, at any point, the support of a combination does not

have minimum support, its further partitioning or combination with other dimensions

cannot have minimum support either.

Using Statistical Theory to Disclose Exceptional Behavior

An association rule under the new definition is a rule of the form:

population subset ⇒ mean of values for the subset,

where the mean of the subset is significantly different from the mean of its complement

in the database (and this is validated by an appropriate statistical test).

Mining Rare Patterns and Negative Patterns

## 3 Constraint-Based Frequent Pattern Mining

users have a good sense of

which “direction” of mining may lead to interesting patterns and the “form” of the patterns

or rules they want to find. They may also have a sense of “conditions” for the rules,

which would eliminate the discovery of certain rules that they know would not be of

interest. Thus, a good heuristic is to have the users specify such intuition or expectations

as constraints to confine the search space. This strategy is known as constraint-based

mining. The constraints can include the following:

Knowledge type constraints: These specify the type of knowledge to be mined, such

as association, correlation, classification, or clustering.

Data constraints: These specify the set of task-relevant data.

Dimension/level constraints: These specify the desired dimensions (or attributes)

of the data, the abstraction levels, or the level of the concept hierarchies to be used in

mining.

Interestingness constraints: These specify thresholds on statistical measures of rule

interestingness such as support, confidence, and correlation.

Rule constraints: These specify the form of, or conditions on, the rules to be mined.

Such constraints may be expressed as metarules (rule templates), as the maximum or

minimum number of predicates that can occur in the rule antecedent or consequent,

or as relationships among attributes, attribute values, and/or aggregates.

These constraints can be specified using a high-level declarative data mining query

language and user interface.

Metarule-Guided Mining of Association Rules

“How are metarules useful?” Metarules allow users to specify the syntactic form of rules

that they are interested in mining. The rule forms can be used as constraints to help

improve the efficiency of the mining process. Metarules may be based on the analyst’s

experience, expectations, or intuition regarding the data or may be automatically

generated based on the database schema.

Constraint-Based Pattern Generation: Pruning Pattern Space and Pruning Data Space

Rule constraints specify expected set/subset relationships of the variables in the mined

rules, constant initiation of variables, and constraints on aggregate functions and other

forms of constraints. Users typically employ their knowledge of the application or

data to specify rule constraints for the mining task. These rule constraints may be

used together with, or as an alternative to, metarule-guided mining. In this section,

we examine rule constraints as to how they can be used to make the mining process

more efficient.

Pruning Pattern Space with Pattern Pruning Constraints

Based on how a constraint may interact with the pattern mining process, there are five

categories of pattern mining constraints: (1) antimonotonic, (2) monotonic, (3) succinct,

(4) convertible, and (5) inconvertible.

Pruning Data Space with Data Pruning Constraints

The second way of search space pruning in constraint-based frequent pattern mining

is pruning data space. This strategy prunes pieces of data if they will not contribute to

the subsequent generation of satisfiable patterns in the mining process. We consider two

properties: data succinctness and data antimonotonicity.

Constraints are data-succinct if they can be used at the beginning of a pattern mining

process to prune the data subsets that cannot satisfy the constraints.

Interestingly, many constraints are data-antimonotonic in the sense that during the

mining process, if a data entry cannot satisfy a data-antimonotonic constraint based on

the current pattern, then it can be pruned.

## 4. Mining High-Dimensional Data and Colossal Patterns

The frequent pattern mining methods presented so far handle large data sets havinga small number of dimensions. However, some applications may need to mine highdimensional

data (i.e., data with hundreds or thousands of dimensions).

Researchers have overcome this difficulty in two directions. One direction extends a

pattern growth approach by further exploring the vertical data format to handle data

sets with a large number of dimensions (also called features or items, e.g., genes) but

a small number of rows (also called transactions or tuples, e.g., samples). This is useful

in applications like the analysis of gene expressions in bioinformatics, for example,

where we often need to analyze microarray data that contain a large number of genes

(e.g., 10,000 to 100,000) but only a small number of samples (e.g., 100 to 1000). The

other direction develops a new mining methodology, called Pattern-Fusion, which mines

colossal patterns, that is, patterns of very long length.

Mining Colossal Patterns by Pattern-Fusion

Researchers are more interested in finding large patterns

(e.g., long sequences) than finding small ones since larger patterns usually carry more

significant meaning. We call these large patterns colossal patterns, as distinguished from

patterns with large support sets. Finding colossal patterns is challenging because incremental

mining tends to get “trapped” by an explosive number of midsize patterns before

it can even reach candidate patterns of large size

A new mining strategy called Pattern-Fusion was developed, which fuses a small

number of shorter frequent patterns into colossal pattern candidates. It thereby takes

leaps in the pattern search space and avoids the pitfalls of both breadth-first and depth-

first searches. This method finds a good approximation to the complete set of colossal

frequent patterns.

The Pattern-Fusion method has the following major characteristics. First, it traverses

the tree in a bounded-breadth way. Only a fixed number of patterns in a bounded-size

candidate pool are used as starting nodes to search downward in the pattern tree. As

such, it avoids the problem of exponential search space.

Second, Pattern-Fusion has the capability to identify “shortcuts” whenever possible.

Each pattern’s growth is not performed with one-item addition, but with an agglomeration

of multiple patterns in the pool. These shortcuts direct Pattern-Fusion much more

rapidly down the search tree toward the colossal patterns.

e Pattern-Fusion method is outlined in the following two

phases:

1. Initial Pool: Pattern-Fusion assumes an initial pool of small frequent patterns is

available. This is the complete set of frequent patterns up to a small size (e.g., 3).

This initial pool can be mined with any existing efficient mining algorithm.

2. Iterative Pattern-Fusion: Pattern-Fusion takes as input a user-specified parameter,

K, which is the maximum number of patterns to be mined. The mining process is

iterative. At each iteration, K seed patterns are randomly picked from the current

pool. For each of these K seeds, we find all the patterns within a ball of a size specified

by τ . All the patterns in each “ball” are then fused together to generate a set of

superpatterns. These superpatterns form a new pool. If the pool contains more than

K patterns, the next iteration begins with this pool for the new round of random

drawing. As the support set of every superpattern shrinks with each new iteration,

the iteration process terminates.

Note that Pattern-Fusion merges small subpatterns of a large pattern instead of

incrementally-expanding patterns with single items.

## 5 Mining Compressed or Approximate Patterns

To reduce the huge set of frequent patterns generated in mining while maintaining

high-quality patterns, we can instead mine a compressed or approximate set of frequent

patterns. Top-k most frequent closed patterns were proposed to make the mining process

concentrate on only the set of k most frequent patterns. Although interesting, they usually

do not epitomize the k most representative patterns because of the uneven frequency

distribution among itemsets. Constraint-based mining of frequent patterns

incorporates user-specified constraints to filter out uninteresting patterns. Measures of

pattern/rule interestingness and correlation can also be used to help confine

the search to patterns/rules of interest.

Mining Compressed Patterns by Pattern Clustering

. Clustering is the automatic process of grouping

like objects together, so that objects within a cluster are similar to one another and dissimilar

to objects in other clusters. In this case, the objects are frequent patterns. The

frequent patterns are clustered using a tightness measure called δ-cluster. A representative

pattern is selected for each cluster, thereby offering a compressed version of the set

of frequent patterns.

Before we begin, let’s review some definitions. An itemset X is a closed frequent

itemset in a data set D if X is frequent and there exists no proper super-itemset Y of X

such that Y has the same support count as X in D. An itemset X is a maximal frequent

itemset in data set D if X is frequent and there exists no super-itemset Y such that

X ⊂ Y and Y is frequent in D.

We can use the following distance measure between closed patterns. Let P1 and P2 be

two closed patterns. Their supporting transaction sets are T(P1) and T(P2), respectively.

The pattern distance of P1 and P2, Pat Dist(P1,P2), is defined as

Pat Dist(P1,P2) = 1 − [|T(P1)∩T(P2)|]/[|T(P1)∪T(P2)|]

Pattern distance is a valid distance metric defined on the set of transactions. Note that it

incorporates the support information of patterns.

The pattern compression problem can be defined as follows: Given a transaction database, a minimum support min sup, and the cluster

quality measure δ, the pattern compression problem is to find a set of representative patterns

R such that for each frequent pattern P (with respect to min sup), there is a representative

pattern Pr ∈ R (with respect to min supr), which covers P, and the value of |R| is

minimized.

Finding a minimum set of representative patterns is an NP-Hard problem. However,

efficient methods have been developed that reduce the number of closed frequent

patterns generated by orders of magnitude with respect to the original collection of

closed patterns. The methods succeed in finding a high-quality compression of the

pattern set.

Extracting Redundancy-Aware Top-k Patterns

A small set of k representative patterns that have not only high significance but also low redundancy

are called redundancy-aware top-k patterns.

A significance measure S is a function mapping a pattern p ∈ P to a real value such

that S(p) is the degree of interestingness (or usefulness) of the pattern p.

Given the significance measure S, the redundancy R between two patterns p and

q is defined as R(p,q) = S(p) + S(q) − S(p,q). Subsequently, we have S(p|q) = S(p) −

R(p,q).

## 6 Pattern Exploration and Application

Semantic Annotation of Frequent Patterns

“What is an appropriate semantic annotation for a frequent pattern?”

we can define the basic task of semantic pattern annotation

as follows:

1. Select context units and design a strength weight for each unit to model the contexts

of frequent patterns.

2. Design similarity measures for the contexts of two patterns, and for a transaction and

a pattern context.

3. For a given frequent pattern, extract the most significant context indicators, representative

transactions, and semantically similar patterns to construct a structured

annotation.

Applications of Pattern Mining

Pattern mining is widely used for noise filtering and data cleaning as preprocessing

in many data-intensive applications.

Pattern mining often helps in the discovery of inherent structures and clusters

hidden in the data.

Frequent patterns can also be used effectively for subspace clustering in highdimensional

space.

Pattern analysis is useful in the analysis of spatiotemporal data, time-series data,

image data, video data, and multimedia data

Pattern mining has also been used for the analysis of sequence or structural data

such as trees, graphs, subsequences, and networks

Next Chapter: Data Mining - Classification: Basic Concepts

Data Mining - Classification: Advanced Methods

Data Mining Recent Trends and Research Frontiers

Excerpts from the Book

Data Mining Concepts and Techniques

Third Edition

Jiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber, Jian Pei

Simon Fraser University

## No comments:

## Post a Comment