Content to be rewritten to integrate various points.

## 1 Basic Concepts

What Is Classification?

Classification and numeric prediction are the two major types of prediction

problems.

General Approach to Classification

“How does classification work?” Data classification is a two-step process, consisting of a

learning step (where a classification model is constructed) and a classification step (where

the model is used to predict class labels for given data).

In the first step, a classifier is built describing a predetermined set of data classes or

concepts. This is the learning step (or training phase), where a classification algorithm

builds the classifier by analyzing or “learning from” a training set made up of database

tuples and their associated class labels. A tuple, X, is represented by an n-dimensional

attribute vector, X = (x1, x2,..., xn), depicting n measurements made on the tuple

from n database attributes, respectively, A1, A2,..., An.

1 Each tuple, X, is assumed to

belong to a predefined class as determined by another database attribute called the class

label attribute. The class label attribute is discrete-valued and unordered. It is categorical

(or nominal) in that each value serves as a category or class. The individual tuples

making up the training set are referred to as training tuples and are randomly sampled

from the database under analysis. In the context of classification, data tuples can be

referred to as samples, examples, instances, data points, or objects.

Because the class label of each training tuple is provided, this step is also known as

supervised learning (i.e., the learning of the classifier is “supervised” in that it is told

to which class each training tuple belongs). It contrasts with unsupervised learning (or

clustering), in which the class label of each training tuple is not known, and the number

or set of classes to be learned may not be known in advance.

The accuracy of a classifier on a given test set is the percentage of test set tuples that

are correctly classified by the classifier. The associated class label of each test tuple is compared

with the learned classifier’s class prediction for that tuple

## 2 Decision Tree Induction

Decision tree induction is the learning of decision trees from class-labeled trainingtuples. A decision tree is a flowchart-like tree structure, where each internal node (nonleaf

node) denotes a test on an attribute, each branch represents an outcome of the

test, and each leaf node (or terminal node) holds a class label. The topmost node in

a tree is the root node.

“How are decision trees used for classification?” Given a tuple, X, for which the associated

class label is unknown, the attribute values of the tuple are tested against the

decision tree. A path is traced from the root to a leaf node, which holds the class

prediction for that tuple. Decision trees can easily be converted to classification rules.

“Why are decision tree classifiers so popular?” The construction of decision tree classifiers

does not require any domain knowledge or parameter setting, and therefore is

appropriate for exploratory knowledge discovery. Decision trees can handle multidimensional

data. Their representation of acquired knowledge in tree form is intuitive and

generally easy to assimilate by humans. The learning and classification steps of decision

tree induction are simple and fast. In general, decision tree classifiers have good accuracy.

However, successful use may depend on the data at hand. Decision tree induction

algorithms have been used for classification in many application areas such as medicine,

manufacturing and production, financial analysis, astronomy, and molecular biology.

Decision trees are the basis of several commercial rule induction systems

Basic algorithm for inducing a decision tree from training tuples.

______________________

The algorithm is called with three parameters: D, attribute list, and Attribute

selection method. We refer to D as a data partition. Initially, it is the complete set

of training tuples and their associated class labels. The parameter attribute list is a

list of attributes describing the tuples. Attribute selection method specifies a heuristic

procedure for selecting the attribute that “best” discriminates the given tuples

according to class. This procedure employs an attribute selection measure such as

information gain or the Gini index. Whether the tree is strictly binary is generally

driven by the attribute selection measure. Some attribute selection measures, such as

the Gini index, enforce the resulting tree to be binary. Others, like information gain,

do not, therein allowing multiway splits (i.e., two or more branches to be grown from

a node).

The tree starts as a single node, N, representing the training tuples in D (step 1).3

Algorithm: Generate decision tree. Generate a decision tree from the training tuples of

data partition, D.

Input:

Data partition, D, which is a set of training tuples and their associated class labels;

attribute list, the set of candidate attributes;

Attribute selection method, a procedure to determine the splitting criterion that “best”

partitions the data tuples into individual classes. This criterion consists of a

splitting attribute and, possibly, either a split-point or splitting subset.

Output: A decision tree.

Method:

(1) create a node N;

(2) if tuples in D are all of the same class, C, then

(3) return N as a leaf node labeled with the class C;

(4) if attribute list is empty then

(5) return N as a leaf node labeled with the majority class in D; // majority voting

(6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion;

(7) label node N with splitting criterion;

(8) if splitting attribute is discrete-valued and

multiway splits allowed then // not restricted to binary trees

(9) attribute list ← attribute list − splitting attribute; // remove splitting attribute

(10) for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition

(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition

(12) if Dj

is empty then

(13) attach a leaf labeled with the majority class in D to node N;

(14) else attach the node returned by Generate decision tree(Dj

, attribute list) to node N;

endfor

(15) return N;

______________________________________

Basic algorithm for inducing a decision tree from training tuples.

Attribute Selection Measures

An attribute selection measure is a heuristic for selecting the splitting criterion that

“best” separates a given data partition, D, of class-labeled training tuples into individual

classes. If we were to split D into smaller partitions according to the outcomes of the

splitting criterion, ideally each partition would be pure (i.e., all the tuples that fall into a

given partition would belong to the same class). Conceptually, the “best” splitting criterion

is the one that most closely results in such a scenario. Attribute selection measures

are also known as splitting rules because they determine how the tuples at a given node

are to be split.

Information Gain

ID3 uses information gain as its attribute selection measure. This measure is based on

pioneering work by Claude Shannon on information theory, which studied the value or

“information content” of messages.

The gain ratio is defined as

GainRatio(A) = Gain(A)/SplitInfoA (D)

The attribute with the maximum gain ratio is selected as the splitting attribute. Note,

however, that as the split information approaches 0, the ratio becomes unstable. A constraint

is added to avoid this, whereby the information gain of the test selected must be

large—at least as great as the average gain over all tests examined.

Gini Index

The Gini index is used in CART. Using the notation previously described, the Gini index

measures the impurity of D, a data partition or set of training tuples, as

Gini(D)

Tree Pruning

When a decision tree is built, many of the branches will reflect anomalies in the training

data due to noise or outliers. Tree pruning methods address this problem of overfitting

the data. Such methods typically use statistical measures to remove the least-reliable

branches

Scalability and Decision Tree Induction

“What if D, the disk-resident training set of class-labeled tuples, does not fit in memory? In

other words, how scalable is decision tree induction?” The efficiency of existing decision

tree algorithms, such as ID3, C4.5, and CART, has been well established for relatively

small data sets. Efficiency becomes an issue of concern when these algorithms are applied

to the mining of very large real-world databases. The pioneering decision tree algorithms

that we have discussed so far have the restriction that the training tuples should reside

in memory.

In data mining applications, very large training sets of millions of tuples are common

Several scalable decision tree induction methods have been introduced in recent studies.

RainForest, for example, adapts to the amount of main memory available and applies

to any decision tree induction algorithm. The method maintains an AVC-set (where

“AVC” stands for “Attribute-Value, Classlabel”) for each attribute, at each tree node,

describing the training tuples at the node.

BOAT (Bootstrapped Optimistic Algorithm for Tree construction) is a decision tree

algorithm that takes a completely different approach to scalability—it is not based on

the use of any special data structures. Instead, it uses a statistical technique known as

“bootstrapping” (Section 8.5.4) to create several smaller samples (or subsets) of the

given training data, each of which fits in memory. Each subset is used to construct a

tree, resulting in several trees. The trees are examined and used to construct a new tree,

T', that turns out to be “very close” to the tree that would have been generated if all the

original training data had fit in memory.

Visual Mining for Decision Tree Induction

“Are there any interactive approaches to decision tree induction that allow us to visualize

the data and the tree as it is being constructed? Can we use any knowledge of our

data to help in building the tree?”

Perception-based classification

(PBC) is an interactive approach based on multidimensional visualization techniques

and allows the user to incorporate background knowledge about the data when building

a decision tree. By visually interacting with the data, the user is also likely to develop a

deeper understanding of the data. The resulting trees tend to be smaller than those built

using traditional decision tree induction methods and so are easier to interpret, while

achieving about the same accuracy

## 3 Bayes Classification Methods

“What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They can

predict class membership probabilities such as the probability that a given tuple belongs

to a particular class.

Naive Bayesian Classification

## 4 Rule-Based Classification

Using IF-THEN Rules for Classification

Rules are a good way of representing information or bits of knowledge. A rule-based

classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression

of the form

IF condition THEN conclusion.

That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., their

attribute values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the

tuples that it covers and see what percentage of them the rule can correctly classify.

The size ordering scheme assigns the highest priority to the triggering rule that has

the “toughest” requirements, where toughness is measured by the rule antecedent size.

That is, the triggering rule with the most attribute tests is fired.

The rule ordering scheme prioritizes the rules beforehand. The ordering may be

class-based or rule-based. With class-based ordering, the classes are sorted in order of

decreasing “importance” such as by decreasing order of prevalence. That is, all the rules

for the most prevalent (or most frequent) class come first, the rules for the next prevalent

class come next, and so on. Alternatively, they may be sorted based on the misclassification

cost per class. Within each class, the rules are not ordered—they don’t have to be

because they all predict the same class (and so there can be no class conflict!).

With rule-based ordering, the rules are organized into one long priority list, according

to some measure of rule quality, such as accuracy, coverage, or size (number of

attribute tests in the rule antecedent), or based on advice from domain experts. When

rule ordering is used, the rule set is known as a decision list. With rule ordering, the triggering

rule that appears earliest in the list has the highest priority, and so it gets to fire its

class prediction. Any other rule that satisfies X is ignored. Most rule-based classification

systems use a class-based rule-ordering strategy.

Rule Extraction from a Decision Tree

To extract rules from a decision tree, one rule is created for each path from the root

to a leaf node. Each splitting criterion along a given path is logically ANDed to form the

rule antecedent (“IF” part). The leaf node holds the class prediction, forming the rule

consequent (“THEN” part).

Rule Induction Using a Sequential Covering Algorithm

IF-THEN rules can be extracted directly from the training data (i.e., without having to

generate a decision tree first) using a sequential covering algorithm. The name comes

from the notion that the rules are learned sequentially (one at a time), where each rule

for a given class will ideally cover many of the class’s tuples (and hopefully none of

the tuples of other classes). Sequential covering algorithms are the most widely used

approach to mining disjunctive sets of classification rules, and form the topic of this

subsection.

## 5 Model Evaluation and Selection

Metrics for Evaluating Classifier Performance

some terminology.

Positive tuples are tuples of the main class of interest and negative tuples (all other tuples). Given two classes, for

example, the positive tuples may be computer buyers while the negative tuples are non-buyers of computers

There are four additional terms that are the “building blocks” used in computing many evaluation measures.

True positives (TP): These refer to the positive tuples that were correctly labeled by

the classifier. Let TP be the number of true positives.

True negatives(TN): These are the negative tuples that were correctly labeled by the

classifier. Let TN be the number of true negatives.

False positives (FP): These are the negative tuples that were incorrectly labeled as

positive (e.g., tuples of class buys computer = no for which the classifier predicted

buys computer = yes). Let FP be the number of false positives.

False negatives (FN): These are the positive tuples that were mislabeled as negative

(e.g., tuples of class buys computer = yes for which the classifier predicted

buys computer = no). Let FN be the number of false negatives.

The confusion matrix is a useful tool for analyzing how well your classifier can

recognize tuples of different classes. TP and TN tell us when the classifier is getting

things right, while FP and FN tell us when the classifier is getting things wrong.

Accuracy = [TP + TN]/[P + N]

In the pattern recognition literature, this is also referred to as the overall recognition

rate of the classifier, that is, it reflects how well the classifier recognizes tuples of the various

classes.

error rate or misclassification rate of a classifier, M, which

is simply 1 − accuracy(M), where accuracy(M) is the accuracy of M. This also can be

computed as

error rate = [FP + FN]/[P + N]

Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion

of positive tuples that are correctly identified), while specificity is the true negative rate

(i.e., the proportion of negative tuples that are correctly identified). These measures are

defined as

sensitivity = TP/ P

specificity = TN/N

It can be shown that accuracy is a function of sensitivity and specificity:

accuracy = sensitivity*P/(P + N) + specificity*N(P + N)

In addition to accuracy-based measures, classifiers can also be compared with respect

to the following additional aspects:

Speed: This refers to the computational costs involved in generating and using the

given classifier.

Robustness: This is the ability of the classifier to make correct predictions given noisy

data or data with missing values. Robustness is typically assessed with a series of

synthetic data sets representing increasing degrees of noise and missing values.

Scalability: This refers to the ability to construct the classifier efficiently given large

amounts of data. Scalability is typically assessed with a series of data sets of increasing

size.

Interpretability: This refers to the level of understanding and insight that is provided

by the classifier or predictor. Interpretability is subjective and therefore more difficult

to assess. Decision trees and classification rules can be easy to interpret, yet their

interpretability may diminish the more they become complex

Holdout Method and Random Subsampling

The holdout method is what we have alluded to so far in our discussions about accuracy.

In this method, the given data are randomly partitioned into two independent sets, a

training set and a test set. Typically, two-thirds of the data are allocated to the training

set, and the remaining one-third is allocated to the test set. The training set is used to

derive the model. The model’s accuracy is then estimated with the test set.

Cross-Validation

In k-fold cross-validation, the initial data are randomly partitioned into k mutually

exclusive subsets or “folds,” D1, D2,..., Dk

, each of approximately equal size. Training

and testing is performed k times. In iteration i, partition Di

is reserved as the test set,

and the remaining partitions are collectively used to train the model. That is, in the

first iteration, subsets D2,..., Dk collectively serve as the training set to obtain a first

model, which is tested on D1; the second iteration is trained on subsets D1, D3,..., Dk

and tested on D2; and so on. Unlike the holdout and random subsampling methods,

here each sample is used the same number of times for training and once for testing. For

classification, the accuracy estimate is the overall number of correct classifications from

the k iterations, divided by the total number of tuples in the initial data.

Bootstrap

Unlike the accuracy estimation methods just mentioned, the bootstrap method samples

the given training tuples uniformly with replacement. That is, each time a tuple is

selected, it is equally likely to be selected again and re-added to the training set. For

instance, imagine a machine that randomly selects tuples for our training set. In sampling

with replacement, the machine is allowed to select the same tuple more than once.

There are several bootstrap methods. A commonly used one is the .632 bootstrap,

which works as follows. Suppose we are given a data set of d tuples. The data set is

sampled d times, with replacement, resulting in a bootstrap sample or training set of d

samples. It is very likely that some of the original data tuples will occur more than once

in this sample. The data tuples that did not make it into the training set end up forming

the test set. Suppose we were to try this out several times. As it turns out, on average,

63.2% of the original data tuples will end up in the bootstrap sample, and the remaining

36.8% will form the test set (hence, the name, .632 bootstrap).

Model Selection Using Statistical Tests of Significance

To determine whether Model 1 and Model 2 are significantly different, we compute t and select

a significance level, sig. In practice, a significance level of 5% or 1% is typically used.

Comparing Classifiers Based on Cost–Benefit

and ROC Curves

The true positives, true negatives, false positives, and false negatives are also useful in

assessing the costs and benefits (or risks and gains) associated with a classification

model. The cost associated with a false negative (such as incorrectly predicting that a

cancerous patient is not cancerous) is far greater than those of a false positive

(incorrectly yet conservatively labeling a noncancerous patient as cancerous). In such

cases, we can outweigh one type of error over another by assigning a different cost to

each. These costs may consider the danger to the patient, financial costs of resulting

therapies, and other hospital costs. Similarly, the benefits associated with a true positive

decision may be different than those of a true negative. Up to now, to compute classifier

accuracy, we have assumed equal costs and essentially divided the sum of true positives

and true negatives by the total number of test tuples.

## 6 Techniques to Improve Classification Accuracy

Traditional learning models assume that the data classes are well distributed. Inmany real-world data domains, however, the data are class-imbalanced, where the

main class of interest is represented by only a few tuples. This is known as the class

imbalance problem. We also study techniques for improving the classification accuracy

of class-imbalanced data.

Introducing Ensemble Methods

Bagging, boosting, and random forests are examples of ensemble methods (Figure 8.21).

An ensemble combines a series of k learned models (or base classifiers), M1, M2,..., Mk

,

with the aim of creating an improved composite classification model, M∗. A given data

set, D, is used to create k training sets, D1, D2,..., Dk

, where Di (1 ≤ i ≤ k − 1) is used

to generate classifier Mi

. Given a new data tuple to classify, the base classifiers each vote

by returning a class prediction. The ensemble returns a class prediction based on the

votes of the base classifiers.

An ensemble tends to be more accurate than its base classifiers. For example, consider

an ensemble that performs majority voting. That is, given a tuple X to classify, it

collects the class label predictions returned from the base classifiers and outputs the class

in majority. The base classifiers may make mistakes, but the ensemble will misclassify X

only if over half of the base classifiers are in error. Ensembles yield better results when

there is significant diversity among the models. That is, ideally, there is little correlation

among classifiers. The classifiers should also perform better than random guessing.

Each base classifier can be allocated to a different CPU and so ensemble methods are

parallelizable.

Bagging

Given a set, D, of d tuples, bagging works as follows. For iteration i(i = 1, 2,..., k),

a training set, Di

, of d tuples is sampled with replacement from the original set of

tuples, D. Note that the term bagging stands for bootstrap aggregation. Each training

set is a bootstrap sample, as described in Section 8.5.4. Because sampling with replacement

is used, some of the original tuples of D may not be included in Di

, whereas others

may occur more than once. A classifier model, Mi

, is learned for each training set, Di

.

To classify an unknown tuple, X, each classifier, Mi

, returns its class prediction, which

counts as one vote. The bagged classifier, M∗, counts the votes and assigns the class

with the most votes to X. Bagging can be applied to the prediction of continuous values

by taking the average value of each prediction for a given test tuple. The algorithm is

summarized in Figure 8.23.

The bagged classifier often has significantly greater accuracy than a single classifier

derived from D, the original training data.

Boosting and AdaBoost

In boosting, weights are also assigned to each training tuple. A series of k classifiers is

iteratively learned. After a classifier, Mi

, is learned, the weights are updated to allow the

subsequent classifier, Mi+1, to “pay more attention” to the training tuples that were misclassified

by Mi

. The final boosted classifier, M∗, combines the votes of each individual

classifier, where the weight of each classifier’s vote is a function of its accuracy.

AdaBoost (short for Adaptive Boosting) is a popular boosting algorithm. Suppose

we want to boost the accuracy of a learning method. We are given D, a data set of

d class-labeled tuples, (X1, y1),(X2, y2),...,(Xd, yd), where yi

is the class label of tuple

Xi

. Initially, AdaBoost assigns each training tuple an equal weight of 1/d. Generating

k classifiers for the ensemble requires k rounds through the rest of the algorithm. In

round i, the tuples from D are sampled to form a training set, Di

, of size d. Sampling with replacement is used—the same tuple may be selected more than once. Each tuple’s

chance of being selected is based on its weight. A classifier model, Mi

, is derived from

the training tuples of Di

. Its error is then calculated using Di as a test set. The weights of

the training tuples are then adjusted according to how they were classified.

If a tuple was incorrectly classified, its weight is increased. If a tuple was correctly

classified, its weight is decreased.

Random Forests

Random forests can be built using bagging (Section 8.6.2) in tandem with random

attribute selection. A training set, D, of d tuples is given. The general procedure to generate

k decision trees for the ensemble is as follows. For each iteration, i(i = 1, 2,..., k),

a training set, Di

, of d tuples is sampled with replacement from D. That is, each Di

is a

bootstrap sample of D (Section 8.5.4), so that some tuples may occur more than once

in Di

, while others may be excluded. Let F be the number of attributes to be used to

determine the split at each node, where F is much smaller than the number of available

attributes. To construct a decision tree classifier, Mi

, randomly select, at each node,

F attributes as candidates for the split at the node. The CART methodology is used to

grow the trees. The trees are grown to maximum size and are not pruned. Random

forests formed this way, with random input selection, are called Forest-RI.

Another form of random forest, called Forest-RC, uses random linear combinations

of the input attributes.

Random forests are comparable in accuracy to AdaBoost, yet are more robust to

errors and outliers.

Improving Classification Accuracy of Class-Imbalanced Data

Given two-class data, the data are class-imbalanced if the main class of interest (the

positive class) is represented by only a few tuples, while the majority of tuples represent

the negative class. For multiclass-imbalanced data, the data distribution of each class

differs substantially where, again, the main class or classes of interest are rare. The

class imbalance problem is closely related to cost-sensitive learning, wherein the costs of

errors, per class, are not equal. In medical diagnosis, for example, it is much more costly

to falsely diagnose a cancerous patient as healthy (a false negative) than to misdiagnose

a healthy patient as having cancer (a false positive). A false negative error could lead to

the loss of life and therefore is much more expensive than a false positive error. Other

applications involving class-imbalanced data include fraud detection, the detection of

oil spills from satellite radar images, and fault monitoring.

General approaches for improving the classification accuracy of class-imbalanced data. These approaches include (1) oversampling, (2) undersampling,

(3) threshold moving, and (4) ensemble techniques.

Next Chapter: Data Mining - Classification: Advanced Methods

Excerpts from the Book

## Data Mining Concepts and Techniques

Third EditionJiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber, Jian Pei

Simon Fraser University

.

## No comments:

## Post a Comment