1 Pattern Mining: A Road Map
Based on pattern diversity, pattern mining can be classified using the following
criteria:
Basic patterns: As discussed in Chapter 6, a frequent pattern may have several alternative
forms, including a simple frequent pattern, a closed pattern, or a max-pattern.
To review, a frequent pattern is a pattern (or itemset) that satisfies a minimum support
threshold. A pattern p is a closed pattern if there is no superpattern p
0 with the
same support as p. Pattern p is a max-pattern if there exists no frequent superpattern
of p. Frequent patterns can also be mapped into association rules, or other kinds
of rules based on interestingness measures. Sometimes we may also be interested in
infrequent or rare patterns (i.e., patterns that occur rarely but are of critical importance,
or negative patterns (i.e., patterns that reveal a negative correlation between
items).
We refer to the rule set mined as
consisting of multilevel association rules. If, instead, the rules within a given set do
not reference items or attributes at different abstraction levels, then the set contains
single-level association rules.
Based on the number of dimensions involved in the rule or pattern: If the items
or attributes in an association rule or pattern reference only one dimension, it is a
single-dimensional association rule/pattern.
If a rule/pattern references two or more dimensions, such as age, income, and buys,
then it is a multidimensional association rule/pattern.
Based on the types of values handled in the rule or pattern: If a rule involves associations
between the presence or absence of items, it is a Boolean association rule
If a rule describes associations between quantitative items or attributes, then it
is a quantitative association rule. In these rules, quantitative values for items or
attributes are partitioned into intervals.
Based on the constraints or criteria used to mine selective patterns: The patterns
or rules to be discovered can be constraint-based (i.e., satisfying a set of userdefined
constraints), approximate, compressed, near-match (i.e., those that tally
the support count of the near or almost matching itemsets), top-k (i.e., the k most
frequent itemsets for a user-specified value, k), redundancy-aware top-k (i.e., the
top-k patterns with similar or redundant patterns excluded), and so on.
Alternatively, pattern mining can be classified with respect to the kinds of data and
applications involved, using the following criteria:
Based on kinds of data and features to be mined: Given relational and data warehouse
data, most people are interested in itemsets. Thus, frequent pattern mining
in this context is essentially frequent itemset mining,
In sequential patterns, that is, frequent subsequences (which are often separated by some other events) in a sequence
of ordered events.
We may also mine structural patterns, that is, frequent substructures, in a structured
data set. Note that structure is a general concept that covers many different
kinds of structural forms such as directed graphs, undirected graphs, lattices, trees,
sequences, sets, single items, or combinations of such structures. Single items are the
simplest form of structure. Each element of a general pattern may contain a subsequence,
a subtree, a subgraph, and so on, and such containment relationships can
be defined recursively. Therefore, structural pattern mining can be considered as the
most general form of frequent pattern mining.
Based on application domain-specific semantics: Both data and applications can be
very diverse, and therefore the patterns to be mined can differ largely based on their
domain-specific semantics. Various kinds of application data include spatial data,
temporal data, spatiotemporal data, multimedia data (e.g., image, audio, and video
data), text data, time-series data, DNA and biological sequences, software programs,
chemical compound structures, web structures, sensor networks, social and information
networks, biological networks, data streams, and so on. This diversity can lead
to dramatically different pattern mining methodologies.
Based on data analysis usages: Frequent pattern mining often serves as an intermediate
step for improved data understanding and more powerful data analysis. For
example, it can be used as a feature extraction step for classification, which is often
referred to as pattern-based classification. Similarly, pattern-based clustering has
shown its strength at clustering high-dimensional data. For improved data understanding,
patterns can be used for semantic annotation or contextual analysis. Pattern
analysis can also be used in recommender systems, which recommend information
items (e.g., books, movies, web pages) that are likely to be of interest to the user
based on similar users’ patterns. Different analysis tasks may require mining rather
different kinds of patterns as well.
2 Pattern Mining in Multilevel, Multidimensional Space
Mining Multilevel Associations
It is interesting to examine how
to develop effective (useful and not trivial) methods for mining patterns at multiple abstraction levels, with
sufficient flexibility for easy traversal among different abstraction spaces.
Association rules generated from mining data at multiple abstraction levels are
called multiple-level or multilevel association rules. Multilevel association rules can be
mined efficiently using concept hierarchies under a support-confidence framework. In
general, a top-down strategy is employed, where counts are accumulated for the calculation
of frequent itemsets at each concept level, starting at concept level 1 and working
downward in the hierarchy toward the more specific concept levels, until no more frequent
itemsets can be found. For each level, any algorithm for discovering frequent
itemsets may be used, such as Apriori or its variations.
Using uniform minimum support for all levels (referred to as uniform support):
The same minimum support threshold is used when mining at each abstraction level.
Using reduced minimum support at lower levels (referred to as reduced support):
Each abstraction level has its own minimum support threshold. The deeper the
abstraction level, the smaller the corresponding threshold.
Using item or group-based minimum support (referred to as group-based support):
Because users or experts often have insight as to which groups are more
important than others, it is sometimes more desirable to set up user-specific, item, or
group-based minimal support thresholds when mining multilevel rules.
Mining Multidimensional Associations
Following the terminology used in multidimensional databases, we refer to each distinct
predicate in a rule as a dimension.
Instead of considering transactional data only, sales and related information are often
linked with relational data or integrated into a data warehouse. Such data stores are
multidimensional in nature. For instance, in addition to keeping track of the items purchased
in sales transactions, a relational database may record other attributes associated
with the items and/or transactions such as the item description or the branch location
of the sale. Additional relational information regarding the customers who purchased
the items (e.g., customer age, occupation, credit rating, income, and address) may also
be stored. Considering each database attribute or warehouse dimension as a predicate,
we can therefore mine association rules containing multiple predicates such as
age(X, “20...29”) ∧ occupation(X, “student”)⇒buys(X, “laptop”). (7.7)
Association rules that involve two or more dimensions or predicates can be referred
to as multidimensional association rules.
Mining Quantitative Association Rules
(1) a data cube method, (2)
a clustering-based method, and (3) a statistical analysis method to uncover exceptional
behaviors
Data Cube–Based Mining of Quantitative Associations
The transformed multidimensional data may be used to construct a
data cube. Data cubes are well suited for the mining of multidimensional association
rules: They store aggregates (e.g., counts) in multidimensional space, which is essential
for computing the support and confidence of multidimensional association rules.
Mining Clustering-Based Quantitative Associations
Besides using discretization-based or data cube–based data sets to generate quantitative
association rules, we can also generate quantitative association rules by clustering
data in the quantitative dimensions.
A typical top-down approach for finding clustering-based quantitative frequent patterns
is as follows. For each quantitative dimension, a standard clustering algorithm
(e.g., k-means or a density-based clustering algorithm, as described in Chapter 10) can
be applied to find clusters in this dimension that satisfy the minimum support threshold.
For each cluster, we then examine the 2-D spaces generated by combining the cluster
with a cluster or nominal value of another dimension to see if such a combination passes
the minimum support threshold. If it does, we continue to search for clusters in this
2-D region and progress to even higher-dimensional combinations. The Apriori pruning
still applies in this process: If, at any point, the support of a combination does not
have minimum support, its further partitioning or combination with other dimensions
cannot have minimum support either.
Using Statistical Theory to Disclose Exceptional Behavior
An association rule under the new definition is a rule of the form:
population subset ⇒ mean of values for the subset,
where the mean of the subset is significantly different from the mean of its complement
in the database (and this is validated by an appropriate statistical test).
Mining Rare Patterns and Negative Patterns
3 Constraint-Based Frequent Pattern Mining
users have a good sense of
which “direction” of mining may lead to interesting patterns and the “form” of the patterns
or rules they want to find. They may also have a sense of “conditions” for the rules,
which would eliminate the discovery of certain rules that they know would not be of
interest. Thus, a good heuristic is to have the users specify such intuition or expectations
as constraints to confine the search space. This strategy is known as constraint-based
mining. The constraints can include the following:
Knowledge type constraints: These specify the type of knowledge to be mined, such
as association, correlation, classification, or clustering.
Data constraints: These specify the set of task-relevant data.
Dimension/level constraints: These specify the desired dimensions (or attributes)
of the data, the abstraction levels, or the level of the concept hierarchies to be used in
mining.
Interestingness constraints: These specify thresholds on statistical measures of rule
interestingness such as support, confidence, and correlation.
Rule constraints: These specify the form of, or conditions on, the rules to be mined.
Such constraints may be expressed as metarules (rule templates), as the maximum or
minimum number of predicates that can occur in the rule antecedent or consequent,
or as relationships among attributes, attribute values, and/or aggregates.
These constraints can be specified using a high-level declarative data mining query
language and user interface.
Metarule-Guided Mining of Association Rules
“How are metarules useful?” Metarules allow users to specify the syntactic form of rules
that they are interested in mining. The rule forms can be used as constraints to help
improve the efficiency of the mining process. Metarules may be based on the analyst’s
experience, expectations, or intuition regarding the data or may be automatically
generated based on the database schema.
Constraint-Based Pattern Generation: Pruning Pattern Space and Pruning Data Space
Rule constraints specify expected set/subset relationships of the variables in the mined
rules, constant initiation of variables, and constraints on aggregate functions and other
forms of constraints. Users typically employ their knowledge of the application or
data to specify rule constraints for the mining task. These rule constraints may be
used together with, or as an alternative to, metarule-guided mining. In this section,
we examine rule constraints as to how they can be used to make the mining process
more efficient.
Pruning Pattern Space with Pattern Pruning Constraints
Based on how a constraint may interact with the pattern mining process, there are five
categories of pattern mining constraints: (1) antimonotonic, (2) monotonic, (3) succinct,
(4) convertible, and (5) inconvertible.
Pruning Data Space with Data Pruning Constraints
The second way of search space pruning in constraint-based frequent pattern mining
is pruning data space. This strategy prunes pieces of data if they will not contribute to
the subsequent generation of satisfiable patterns in the mining process. We consider two
properties: data succinctness and data antimonotonicity.
Constraints are data-succinct if they can be used at the beginning of a pattern mining
process to prune the data subsets that cannot satisfy the constraints.
Interestingly, many constraints are data-antimonotonic in the sense that during the
mining process, if a data entry cannot satisfy a data-antimonotonic constraint based on
the current pattern, then it can be pruned.
4. Mining High-Dimensional Data and Colossal Patterns
The frequent pattern mining methods presented so far handle large data sets havinga small number of dimensions. However, some applications may need to mine highdimensional
data (i.e., data with hundreds or thousands of dimensions).
Researchers have overcome this difficulty in two directions. One direction extends a
pattern growth approach by further exploring the vertical data format to handle data
sets with a large number of dimensions (also called features or items, e.g., genes) but
a small number of rows (also called transactions or tuples, e.g., samples). This is useful
in applications like the analysis of gene expressions in bioinformatics, for example,
where we often need to analyze microarray data that contain a large number of genes
(e.g., 10,000 to 100,000) but only a small number of samples (e.g., 100 to 1000). The
other direction develops a new mining methodology, called Pattern-Fusion, which mines
colossal patterns, that is, patterns of very long length.
Mining Colossal Patterns by Pattern-Fusion
Researchers are more interested in finding large patterns
(e.g., long sequences) than finding small ones since larger patterns usually carry more
significant meaning. We call these large patterns colossal patterns, as distinguished from
patterns with large support sets. Finding colossal patterns is challenging because incremental
mining tends to get “trapped” by an explosive number of midsize patterns before
it can even reach candidate patterns of large size
A new mining strategy called Pattern-Fusion was developed, which fuses a small
number of shorter frequent patterns into colossal pattern candidates. It thereby takes
leaps in the pattern search space and avoids the pitfalls of both breadth-first and depth-
first searches. This method finds a good approximation to the complete set of colossal
frequent patterns.
The Pattern-Fusion method has the following major characteristics. First, it traverses
the tree in a bounded-breadth way. Only a fixed number of patterns in a bounded-size
candidate pool are used as starting nodes to search downward in the pattern tree. As
such, it avoids the problem of exponential search space.
Second, Pattern-Fusion has the capability to identify “shortcuts” whenever possible.
Each pattern’s growth is not performed with one-item addition, but with an agglomeration
of multiple patterns in the pool. These shortcuts direct Pattern-Fusion much more
rapidly down the search tree toward the colossal patterns.
e Pattern-Fusion method is outlined in the following two
phases:
1. Initial Pool: Pattern-Fusion assumes an initial pool of small frequent patterns is
available. This is the complete set of frequent patterns up to a small size (e.g., 3).
This initial pool can be mined with any existing efficient mining algorithm.
2. Iterative Pattern-Fusion: Pattern-Fusion takes as input a user-specified parameter,
K, which is the maximum number of patterns to be mined. The mining process is
iterative. At each iteration, K seed patterns are randomly picked from the current
pool. For each of these K seeds, we find all the patterns within a ball of a size specified
by τ . All the patterns in each “ball” are then fused together to generate a set of
superpatterns. These superpatterns form a new pool. If the pool contains more than
K patterns, the next iteration begins with this pool for the new round of random
drawing. As the support set of every superpattern shrinks with each new iteration,
the iteration process terminates.
Note that Pattern-Fusion merges small subpatterns of a large pattern instead of
incrementally-expanding patterns with single items.
5 Mining Compressed or Approximate Patterns
To reduce the huge set of frequent patterns generated in mining while maintaining
high-quality patterns, we can instead mine a compressed or approximate set of frequent
patterns. Top-k most frequent closed patterns were proposed to make the mining process
concentrate on only the set of k most frequent patterns. Although interesting, they usually
do not epitomize the k most representative patterns because of the uneven frequency
distribution among itemsets. Constraint-based mining of frequent patterns
incorporates user-specified constraints to filter out uninteresting patterns. Measures of
pattern/rule interestingness and correlation can also be used to help confine
the search to patterns/rules of interest.
Mining Compressed Patterns by Pattern Clustering
. Clustering is the automatic process of grouping
like objects together, so that objects within a cluster are similar to one another and dissimilar
to objects in other clusters. In this case, the objects are frequent patterns. The
frequent patterns are clustered using a tightness measure called δ-cluster. A representative
pattern is selected for each cluster, thereby offering a compressed version of the set
of frequent patterns.
Before we begin, let’s review some definitions. An itemset X is a closed frequent
itemset in a data set D if X is frequent and there exists no proper super-itemset Y of X
such that Y has the same support count as X in D. An itemset X is a maximal frequent
itemset in data set D if X is frequent and there exists no super-itemset Y such that
X ⊂ Y and Y is frequent in D.
We can use the following distance measure between closed patterns. Let P1 and P2 be
two closed patterns. Their supporting transaction sets are T(P1) and T(P2), respectively.
The pattern distance of P1 and P2, Pat Dist(P1,P2), is defined as
Pat Dist(P1,P2) = 1 − [|T(P1)∩T(P2)|]/[|T(P1)∪T(P2)|]
Pattern distance is a valid distance metric defined on the set of transactions. Note that it
incorporates the support information of patterns.
The pattern compression problem can be defined as follows: Given a transaction database, a minimum support min sup, and the cluster
quality measure δ, the pattern compression problem is to find a set of representative patterns
R such that for each frequent pattern P (with respect to min sup), there is a representative
pattern Pr ∈ R (with respect to min supr), which covers P, and the value of |R| is
minimized.
Finding a minimum set of representative patterns is an NP-Hard problem. However,
efficient methods have been developed that reduce the number of closed frequent
patterns generated by orders of magnitude with respect to the original collection of
closed patterns. The methods succeed in finding a high-quality compression of the
pattern set.
Extracting Redundancy-Aware Top-k Patterns
A small set of k representative patterns that have not only high significance but also low redundancy
are called redundancy-aware top-k patterns.
A significance measure S is a function mapping a pattern p ∈ P to a real value such
that S(p) is the degree of interestingness (or usefulness) of the pattern p.
Given the significance measure S, the redundancy R between two patterns p and
q is defined as R(p,q) = S(p) + S(q) − S(p,q). Subsequently, we have S(p|q) = S(p) −
R(p,q).
6 Pattern Exploration and Application
Semantic Annotation of Frequent Patterns
“What is an appropriate semantic annotation for a frequent pattern?”
we can define the basic task of semantic pattern annotation
as follows:
1. Select context units and design a strength weight for each unit to model the contexts
of frequent patterns.
2. Design similarity measures for the contexts of two patterns, and for a transaction and
a pattern context.
3. For a given frequent pattern, extract the most significant context indicators, representative
transactions, and semantically similar patterns to construct a structured
annotation.
Applications of Pattern Mining
Pattern mining is widely used for noise filtering and data cleaning as preprocessing
in many data-intensive applications.
Pattern mining often helps in the discovery of inherent structures and clusters
hidden in the data.
Frequent patterns can also be used effectively for subspace clustering in highdimensional
space.
Pattern analysis is useful in the analysis of spatiotemporal data, time-series data,
image data, video data, and multimedia data
Pattern mining has also been used for the analysis of sequence or structural data
such as trees, graphs, subsequences, and networks
Next Chapter: Data Mining - Classification: Basic Concepts
Data Mining - Classification: Advanced Methods
Data Mining Recent Trends and Research Frontiers
Excerpts from the Book
Data Mining Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University
No comments:
Post a Comment