Thursday, April 14, 2016

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

1 Basic Concepts

Frequent pattern mining searches for recurring relationships in a given data set.

Market Basket Analysis:

This process
analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets” (Figure 6.1). The discovery of these associations
can help retailers develop marketing strategies by gaining insight into which items
are frequently purchased together by customers.

Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for
Rule (6.1) means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together. A confidence of 60% means that 60% of
the customers who purchased a computer also bought the software. Typically, association
rules are considered interesting if they satisfy both a minimum support threshold
and a minimum confidence threshold. These thresholds can be a set by users or
domain experts. Additional analysis can be performed to discover interesting statistical
correlations between associated items.

Frequent Itemsets, Closed Itemsets,
and Association Rules


A set of items is referred to as an itemset.
2 An itemset that contains k items is a
k-itemset.
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min sup.
2. Generate strong association rules from the frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.


2 Frequent Itemset Mining Methods


Apriori Algorithm: Finding Frequent Itemsets
by Confined Candidate Generation
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules [AS94b]. The name of the algorithm
is based on the fact that the algorithm uses prior knowledge of frequent itemset properties,
as we shall see later. Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k + 1)-itemsets. First, the set of frequent
1-itemsets is found by scanning the database to accumulate the count for each item, and
collecting those items that satisfy minimum support. The resulting set is denoted by L1.
Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and
so on, until no more frequent k-itemsets can be found. The finding of each Lk requires
one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an
important property called the Apriori property is used to reduce the search space.

“How is the Apriori property used in the algorithm?” To understand this, let us look at
how Lk−1 is used to find Lk
for k ≥ 2. A two-step process is followed, consisting of join
and prune actions.
1. The join step: To find Lk
, a set of candidate k-itemsets is generated by joining
Lk−1 with itself.
2. The prune step: Ck
is a superset of Lk
, that is, its members may or may not be
frequent, but all of the frequent k-itemsets are included in Ck.
To reduce the size of Ck
, the Apriori property
is used as follows. Any (k − 1)-itemset that is not frequent cannot be a subset of a
frequent k-itemset. Hence, if any (k − 1)-subset of a candidate k-itemset is not in
Lk−1, then the candidate cannot be frequent either and so can be removed from Ck
.
This subset testing can be done quickly by maintaining a hash tree of all frequent
itemsets.


Generating Association Rules from Frequent Itemsets
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them (where strong association
rules satisfy both minimum support and minimum confidence). This can be done
using Eq. (6.4) for confidence, which we show again here for completeness:
confidence(A ⇒ B) = P(B|A) =
support count(A ∪B)
support count(A)
.
The conditional probability is expressed in terms of itemset support count, where
support count(A ∪B) is the number of transactions containing the itemsets A ∪B, and
support count(A) is the number of transactions containing the itemset A. Based on this
equation, association rules can be generated as follows:
For each frequent itemset l, generate all nonempty subsets of l.
For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if support count(l)
support count(s) ≥
min conf, where min conf is the minimum confidence threshold.
Because the rules are generated from frequent itemsets, each one automatically satis-
fies the minimum support. Frequent itemsets can be stored ahead of time in hash tables
along with their counts so that they can be accessed quickly

Improving the Efficiency of Apriori

Hash-based technique (hashing itemsets into corresponding buckets): A hash-based
technique can be used to reduce the size of the candidate k-itemsets, Ck
, for k > 1.
For example, when scanning each transaction in the database to generate the frequent
1-itemsets, L1, we can generate all the 2-itemsets for each transaction, hash (i.e., map)
them into the different buckets of a hash table structure, and increase the corresponding
bucket counts (Figure 6.5). A 2-itemset with a corresponding bucket count in the
hash table that is below the support threshold cannot be frequent and thus should
be removed from the candidate set. Such a hash-based technique may substantially
reduce the number of candidate k-itemsets examined (especially when k = 2).
Transaction reduction (reducing the number of transactions scanned in future iterations):
A transaction that does not contain any frequent k-itemsets cannot contain any
frequent (k + 1)-itemsets. Therefore, such a transaction can be marked or removed
from further consideration because subsequent database scans for j-itemsets, where
j > k, will not need to consider such a transaction.
Partitioning (partitioning the data to find candidate itemsets): A partitioning technique
can be used that requires just two database scans to mine the frequent itemsets
(Figure 6.6). It consists of two phases. In phase I, the algorithm divides the transactions
of D into n nonoverlapping partitions. If the minimum relative support
threshold for transactions in D is min sup, then the minimum support count for a
partition is min sup × the number of transactions in that partition. For each partition,

Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for
frequent itemsets in S instead of D.

Dynamic itemset counting (adding candidate itemsets at different points during a
scan): A dynamic itemset counting technique was proposed in which the database
is partitioned into blocks marked by start points. In this variation, new candidate
itemsets can be added at any start point, unlike in Apriori, which determines new
candidate itemsets only immediately before each complete database scan


A Pattern-Growth Approach for Mining Frequent Itemsets

frequent pattern growth, or simply FP-growth, which adopts a divide-and-conquer
strategy as follows. First, it compresses the database representing frequent items into a
frequent pattern tree, or FP-tree, which retains the itemset association information. It
then divides the compressed database into a set of conditional databases (a special kind of
projected database), each associated with one frequent item or “pattern fragment,” and
mines each database separately. For each “pattern fragment,” only its associated data sets
need to be examined. Therefore, this approach may substantially reduce the size of the
data sets to be searched, along with the “growth” of patterns being examined.

Mining Frequent Itemsets Using the Vertical Data Format

Mining Closed and Max Patterns

A recommended methodology is to search for closed frequent itemsets directly during
the mining process. This requires us to prune the search space as soon as we

can identify the case of closed itemsets during mining. Pruning strategies include the
following:
Item merging: If every transaction containing a frequent itemset X also contains an itemset
Y but not any proper superset of Y , then X ∪Y forms a frequent closed itemset and there
is no need to search for any itemset containing X but no Y .

Sub-itemset pruning: If a frequent itemset X is a proper subset of an already found frequent
closed itemset Y and support count(X)=support count(Y), then X and all of X’s
descendants in the set enumeration tree cannot be frequent closed itemsets and thus can
be pruned.

Item skipping: In the depth-first mining of closed itemsets, at each level, there will be
a prefix itemset X associated with a header table and a projected database. If a local
frequent item p has the same support in several header tables at different levels, we can
safely prune p from the header tables at higher levels.

3 Which Patterns Are Interesting?—Pattern Evaluation Methods

From Association Analysis to Correlation Analysis

 a correlation measure can
be used to augment the support–confidence framework for association rules. This leads
to correlation rules of the form
A ⇒ B [support, confidence, correlation]. (6.7)
That is, a correlation rule is measured not only by its support and confidence but also
by the correlation between itemsets A and B

Lift is a simple correlation measure that is given as follows. The occurrence of itemset
A is independent of the occurrence of itemset B if P(A ∪B) = P(A)P(B); otherwise,
itemsets A and B are dependent and correlated as events. This definition can easily be
extended to more than two itemsets. The lift between the occurrence of A and B can be
measured by computing
lift(A, B) =
P(A ∪B)
P(A)P(B)
. (6.8)
If the resulting value of Eq. (6.8) is less than 1, then the occurrence of A is negatively
correlated with the occurrence of B, meaning that the occurrence of one likely leads to
the absence of the other one. If the resulting value is greater than 1, then A and B are
positively correlated, meaning that the occurrence of one implies the occurrence of the
other. If the resulting value is equal to 1, then A and B are independent and there is no
correlation between them.

The second correlation measure is the χ2 measure.  To compute the χ2 value, we take the squared difference
between the observed and expected value for a slot (A and B pair) in the contingency
table, divided by the expected value. This amount is summed for all slots of the
contingency table.

A Comparison of Pattern Evaluation Measures
The above discussion shows that instead of using the simple support–confidence framework
to evaluate frequent patterns, other measures, such as lift and χ
2
, often disclose
more intrinsic pattern relationships. How effective are these measures? Should we also
consider other alternatives?

Four more measures: all confidence, max confidence, Kulczynski, and cosine.


Given two itemsets, A and B, the all confidence measure of A and B is defined as
all conf(A,B) =
sup(A ∪B)
max{sup(A),sup(B)}
= min{P(A|B),P(B|A)}, (6.9)
where max{sup(A), sup(B)} is the maximum support of the itemsets A and B. Thus,
all conf(A,B) is also the minimum confidence of the two association rules related to
A and B, namely, “A ⇒ B” and “B ⇒ A.”
Given two itemsets, A and B, the max confidence measure of A and B is defined as
max conf(A, B) = max{P(A|B),P(B|A)}. (6.10)
The max conf measure is the maximum confidence of the two association rules,
“A ⇒ B” and “B ⇒ A.”
Given two itemsets, A and B, the Kulczynski measure of A and B (abbreviated as
Kulc) is defined as
Kulc(A, B) =
1
2
(P(A|B) + P(B|A)). (6.11)
It was proposed in 1927 by Polish mathematician S. Kulczynski. It can be viewed as an
average of two confidence measures. That is, it is the average of two conditional probabilities:
the probability of itemset B given itemset A, and the probability of itemset A
given itemset B.
Finally, given two itemsets, A and B, the cosine measure of A and B is defined as
cosine(A, B) =
P(A ∪B)

P(A) × P(B)
=
sup(A ∪B)
p
sup(A) × sup(B)
=
p
P(A|B) × P(B|A). (6.12)
The cosine measure can be viewed as a harmonized lift measure: The two formulae are
similar except that for cosine, the square root is taken on the product of the probabilities
of A and B. This is an important difference, however, because by taking the square root,
the cosine value is only influenced by the supports of A, B, and A ∪B, and not by the
total number of transactions.
Each of these four measures defined has the following property: Its value is only
influenced by the supports of A, B, and A ∪B, or more exactly, by the conditional probabilities
of P(A|B) and P(B|A), but not by the total number of transactions. Another
common property is that each measure ranges from 0 to 1, and the higher the value, the
closer the relationship between A and B.
Now, together with lift and χ
2
Thus total six pattern evaluation measures were given so far.

In summary, the use of only support and confidence measures to mine associations
may generate a large number of rules, many of which can be uninteresting to
users. Instead,  the support–confidence framework augmented with a pattern interestingness
measure,  helps focus the mining toward rules with strong pattern
relationships. The added measure substantially reduces the number of rules generated
and leads to the discovery of more meaningful rules.   Among the four null-invariant measures described,  all confidence,
max confidence, Kulc, and cosine, the authors recommend using Kulc in conjunction with the
imbalance ratio.


Next Chapter: Advanced Patterns - Data Mining
Data Mining - Classification: Basic Concepts
Data Mining - Classification: Advanced Methods





.






No comments:

Post a Comment