## 1 Basic Concepts

Frequent pattern mining searches for recurring relationships in a given data set.Market Basket Analysis:

This process

analyzes customer buying habits by finding associations between the different items that

customers place in their “shopping baskets” (Figure 6.1). The discovery of these associations

can help retailers develop marketing strategies by gaining insight into which items

are frequently purchased together by customers.

Rule support and confidence are two measures of rule interestingness. They respectively

reflect the usefulness and certainty of discovered rules. A support of 2% for

Rule (6.1) means that 2% of all the transactions under analysis show that computer

and antivirus software are purchased together. A confidence of 60% means that 60% of

the customers who purchased a computer also bought the software. Typically, association

rules are considered interesting if they satisfy both a minimum support threshold

and a minimum confidence threshold. These thresholds can be a set by users or

domain experts. Additional analysis can be performed to discover interesting statistical

correlations between associated items.

Frequent Itemsets, Closed Itemsets,

and Association Rules

A set of items is referred to as an itemset.

2 An itemset that contains k items is a

k-itemset.

In general, association rule mining can be viewed as a two-step process:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as

frequently as a predetermined minimum support count, min sup.

2. Generate strong association rules from the frequent itemsets: By definition, these

rules must satisfy minimum support and minimum confidence.

## 2 Frequent Itemset Mining Methods

Apriori Algorithm: Finding Frequent Itemsets

by Confined Candidate Generation

Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining

frequent itemsets for Boolean association rules [AS94b]. The name of the algorithm

is based on the fact that the algorithm uses prior knowledge of frequent itemset properties,

as we shall see later. Apriori employs an iterative approach known as a level-wise

search, where k-itemsets are used to explore (k + 1)-itemsets. First, the set of frequent

1-itemsets is found by scanning the database to accumulate the count for each item, and

collecting those items that satisfy minimum support. The resulting set is denoted by L1.

Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and

so on, until no more frequent k-itemsets can be found. The finding of each Lk requires

one full scan of the database.

To improve the efficiency of the level-wise generation of frequent itemsets, an

important property called the Apriori property is used to reduce the search space.

“How is the Apriori property used in the algorithm?” To understand this, let us look at

how Lk−1 is used to find Lk

for k ≥ 2. A two-step process is followed, consisting of join

and prune actions.

1. The join step: To find Lk

, a set of candidate k-itemsets is generated by joining

Lk−1 with itself.

2. The prune step: Ck

is a superset of Lk

, that is, its members may or may not be

frequent, but all of the frequent k-itemsets are included in Ck.

To reduce the size of Ck

, the Apriori property

is used as follows. Any (k − 1)-itemset that is not frequent cannot be a subset of a

frequent k-itemset. Hence, if any (k − 1)-subset of a candidate k-itemset is not in

Lk−1, then the candidate cannot be frequent either and so can be removed from Ck

.

This subset testing can be done quickly by maintaining a hash tree of all frequent

itemsets.

Generating Association Rules from Frequent Itemsets

Once the frequent itemsets from transactions in a database D have been found, it is

straightforward to generate strong association rules from them (where strong association

rules satisfy both minimum support and minimum confidence). This can be done

using Eq. (6.4) for confidence, which we show again here for completeness:

confidence(A ⇒ B) = P(B|A) =

support count(A ∪B)

support count(A)

.

The conditional probability is expressed in terms of itemset support count, where

support count(A ∪B) is the number of transactions containing the itemsets A ∪B, and

support count(A) is the number of transactions containing the itemset A. Based on this

equation, association rules can be generated as follows:

For each frequent itemset l, generate all nonempty subsets of l.

For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if support count(l)

support count(s) ≥

min conf, where min conf is the minimum confidence threshold.

Because the rules are generated from frequent itemsets, each one automatically satis-

fies the minimum support. Frequent itemsets can be stored ahead of time in hash tables

along with their counts so that they can be accessed quickly

Improving the Efficiency of Apriori

Hash-based technique (hashing itemsets into corresponding buckets): A hash-based

technique can be used to reduce the size of the candidate k-itemsets, Ck

, for k > 1.

For example, when scanning each transaction in the database to generate the frequent

1-itemsets, L1, we can generate all the 2-itemsets for each transaction, hash (i.e., map)

them into the different buckets of a hash table structure, and increase the corresponding

bucket counts (Figure 6.5). A 2-itemset with a corresponding bucket count in the

hash table that is below the support threshold cannot be frequent and thus should

be removed from the candidate set. Such a hash-based technique may substantially

reduce the number of candidate k-itemsets examined (especially when k = 2).

Transaction reduction (reducing the number of transactions scanned in future iterations):

A transaction that does not contain any frequent k-itemsets cannot contain any

frequent (k + 1)-itemsets. Therefore, such a transaction can be marked or removed

from further consideration because subsequent database scans for j-itemsets, where

j > k, will not need to consider such a transaction.

Partitioning (partitioning the data to find candidate itemsets): A partitioning technique

can be used that requires just two database scans to mine the frequent itemsets

(Figure 6.6). It consists of two phases. In phase I, the algorithm divides the transactions

of D into n nonoverlapping partitions. If the minimum relative support

threshold for transactions in D is min sup, then the minimum support count for a

partition is min sup × the number of transactions in that partition. For each partition,

Sampling (mining on a subset of the given data): The basic idea of the sampling

approach is to pick a random sample S of the given data D, and then search for

frequent itemsets in S instead of D.

Dynamic itemset counting (adding candidate itemsets at different points during a

scan): A dynamic itemset counting technique was proposed in which the database

is partitioned into blocks marked by start points. In this variation, new candidate

itemsets can be added at any start point, unlike in Apriori, which determines new

candidate itemsets only immediately before each complete database scan

A Pattern-Growth Approach for Mining Frequent Itemsets

frequent pattern growth, or simply FP-growth, which adopts a divide-and-conquer

strategy as follows. First, it compresses the database representing frequent items into a

frequent pattern tree, or FP-tree, which retains the itemset association information. It

then divides the compressed database into a set of conditional databases (a special kind of

projected database), each associated with one frequent item or “pattern fragment,” and

mines each database separately. For each “pattern fragment,” only its associated data sets

need to be examined. Therefore, this approach may substantially reduce the size of the

data sets to be searched, along with the “growth” of patterns being examined.

Mining Frequent Itemsets Using the Vertical Data Format

Mining Closed and Max Patterns

A recommended methodology is to search for closed frequent itemsets directly during

the mining process. This requires us to prune the search space as soon as we

can identify the case of closed itemsets during mining. Pruning strategies include the

following:

Item merging: If every transaction containing a frequent itemset X also contains an itemset

Y but not any proper superset of Y , then X ∪Y forms a frequent closed itemset and there

is no need to search for any itemset containing X but no Y .

Sub-itemset pruning: If a frequent itemset X is a proper subset of an already found frequent

closed itemset Y and support count(X)=support count(Y), then X and all of X’s

descendants in the set enumeration tree cannot be frequent closed itemsets and thus can

be pruned.

Item skipping: In the depth-first mining of closed itemsets, at each level, there will be

a prefix itemset X associated with a header table and a projected database. If a local

frequent item p has the same support in several header tables at different levels, we can

safely prune p from the header tables at higher levels.

3 Which Patterns Are Interesting?—Pattern Evaluation Methods

From Association Analysis to Correlation Analysis

a correlation measure can

be used to augment the support–confidence framework for association rules. This leads

to correlation rules of the form

A ⇒ B [support, confidence, correlation]. (6.7)

That is, a correlation rule is measured not only by its support and confidence but also

by the correlation between itemsets A and B

Lift is a simple correlation measure that is given as follows. The occurrence of itemset

A is independent of the occurrence of itemset B if P(A ∪B) = P(A)P(B); otherwise,

itemsets A and B are dependent and correlated as events. This definition can easily be

extended to more than two itemsets. The lift between the occurrence of A and B can be

measured by computing

lift(A, B) =

P(A ∪B)

P(A)P(B)

. (6.8)

If the resulting value of Eq. (6.8) is less than 1, then the occurrence of A is negatively

correlated with the occurrence of B, meaning that the occurrence of one likely leads to

the absence of the other one. If the resulting value is greater than 1, then A and B are

positively correlated, meaning that the occurrence of one implies the occurrence of the

other. If the resulting value is equal to 1, then A and B are independent and there is no

correlation between them.

The second correlation measure is the χ2 measure. To compute the χ2 value, we take the squared difference

between the observed and expected value for a slot (A and B pair) in the contingency

table, divided by the expected value. This amount is summed for all slots of the

contingency table.

A Comparison of Pattern Evaluation Measures

The above discussion shows that instead of using the simple support–confidence framework

to evaluate frequent patterns, other measures, such as lift and χ

2

, often disclose

more intrinsic pattern relationships. How effective are these measures? Should we also

consider other alternatives?

Four more measures: all confidence, max confidence, Kulczynski, and cosine.

Given two itemsets, A and B, the all confidence measure of A and B is defined as

all conf(A,B) =

sup(A ∪B)

max{sup(A),sup(B)}

= min{P(A|B),P(B|A)}, (6.9)

where max{sup(A), sup(B)} is the maximum support of the itemsets A and B. Thus,

all conf(A,B) is also the minimum confidence of the two association rules related to

A and B, namely, “A ⇒ B” and “B ⇒ A.”

Given two itemsets, A and B, the max confidence measure of A and B is defined as

max conf(A, B) = max{P(A|B),P(B|A)}. (6.10)

The max conf measure is the maximum confidence of the two association rules,

“A ⇒ B” and “B ⇒ A.”

Given two itemsets, A and B, the Kulczynski measure of A and B (abbreviated as

Kulc) is defined as

Kulc(A, B) =

1

2

(P(A|B) + P(B|A)). (6.11)

It was proposed in 1927 by Polish mathematician S. Kulczynski. It can be viewed as an

average of two confidence measures. That is, it is the average of two conditional probabilities:

the probability of itemset B given itemset A, and the probability of itemset A

given itemset B.

Finally, given two itemsets, A and B, the cosine measure of A and B is defined as

cosine(A, B) =

P(A ∪B)

√

P(A) × P(B)

=

sup(A ∪B)

p

sup(A) × sup(B)

=

p

P(A|B) × P(B|A). (6.12)

The cosine measure can be viewed as a harmonized lift measure: The two formulae are

similar except that for cosine, the square root is taken on the product of the probabilities

of A and B. This is an important difference, however, because by taking the square root,

the cosine value is only influenced by the supports of A, B, and A ∪B, and not by the

total number of transactions.

Each of these four measures defined has the following property: Its value is only

influenced by the supports of A, B, and A ∪B, or more exactly, by the conditional probabilities

of P(A|B) and P(B|A), but not by the total number of transactions. Another

common property is that each measure ranges from 0 to 1, and the higher the value, the

closer the relationship between A and B.

Now, together with lift and χ

2

Thus total six pattern evaluation measures were given so far.

In summary, the use of only support and confidence measures to mine associations

may generate a large number of rules, many of which can be uninteresting to

users. Instead, the support–confidence framework augmented with a pattern interestingness

measure, helps focus the mining toward rules with strong pattern

relationships. The added measure substantially reduces the number of rules generated

and leads to the discovery of more meaningful rules. Among the four null-invariant measures described, all confidence,

max confidence, Kulc, and cosine, the authors recommend using Kulc in conjunction with the

imbalance ratio.

Next Chapter: Advanced Patterns - Data Mining

Data Mining - Classification: Basic Concepts

Data Mining - Classification: Advanced Methods

.

## No comments:

## Post a Comment