Content to be rewritten to integrate various points.
1 Basic Concepts
What Is Classification?
Classification and numeric prediction are the two major types of prediction
General Approach to Classification
“How does classification work?” Data classification is a two-step process, consisting of a
learning step (where a classification model is constructed) and a classification step (where
the model is used to predict class labels for given data).
In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm
builds the classifier by analyzing or “learning from” a training set made up of database
tuples and their associated class labels. A tuple, X, is represented by an n-dimensional
attribute vector, X = (x1, x2,..., xn), depicting n measurements made on the tuple
from n database attributes, respectively, A1, A2,..., An.
1 Each tuple, X, is assumed to
belong to a predefined class as determined by another database attribute called the class
label attribute. The class label attribute is discrete-valued and unordered. It is categorical
(or nominal) in that each value serves as a category or class. The individual tuples
making up the training set are referred to as training tuples and are randomly sampled
from the database under analysis. In the context of classification, data tuples can be
referred to as samples, examples, instances, data points, or objects.
Because the class label of each training tuple is provided, this step is also known as
supervised learning (i.e., the learning of the classifier is “supervised” in that it is told
to which class each training tuple belongs). It contrasts with unsupervised learning (or
clustering), in which the class label of each training tuple is not known, and the number
or set of classes to be learned may not be known in advance.
The accuracy of a classifier on a given test set is the percentage of test set tuples that
are correctly classified by the classifier. The associated class label of each test tuple is compared
with the learned classifier’s class prediction for that tuple
2 Decision Tree InductionDecision tree induction is the learning of decision trees from class-labeled training
tuples. A decision tree is a flowchart-like tree structure, where each internal node (nonleaf
node) denotes a test on an attribute, each branch represents an outcome of the
test, and each leaf node (or terminal node) holds a class label. The topmost node in
a tree is the root node.
“How are decision trees used for classification?” Given a tuple, X, for which the associated
class label is unknown, the attribute values of the tuple are tested against the
decision tree. A path is traced from the root to a leaf node, which holds the class
prediction for that tuple. Decision trees can easily be converted to classification rules.
“Why are decision tree classifiers so popular?” The construction of decision tree classifiers
does not require any domain knowledge or parameter setting, and therefore is
appropriate for exploratory knowledge discovery. Decision trees can handle multidimensional
data. Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans. The learning and classification steps of decision
tree induction are simple and fast. In general, decision tree classifiers have good accuracy.
However, successful use may depend on the data at hand. Decision tree induction
algorithms have been used for classification in many application areas such as medicine,
manufacturing and production, financial analysis, astronomy, and molecular biology.
Decision trees are the basis of several commercial rule induction systems
Basic algorithm for inducing a decision tree from training tuples.
The algorithm is called with three parameters: D, attribute list, and Attribute
selection method. We refer to D as a data partition. Initially, it is the complete set
of training tuples and their associated class labels. The parameter attribute list is a
list of attributes describing the tuples. Attribute selection method specifies a heuristic
procedure for selecting the attribute that “best” discriminates the given tuples
according to class. This procedure employs an attribute selection measure such as
information gain or the Gini index. Whether the tree is strictly binary is generally
driven by the attribute selection measure. Some attribute selection measures, such as
the Gini index, enforce the resulting tree to be binary. Others, like information gain,
do not, therein allowing multiway splits (i.e., two or more branches to be grown from
The tree starts as a single node, N, representing the training tuples in D (step 1).3
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of
data partition, D.
Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the splitting criterion that “best”
partitions the data tuples into individual classes. This criterion consists of a
splitting attribute and, possibly, either a split-point or splitting subset.
Output: A decision tree.
(1) create a node N;
(2) if tuples in D are all of the same class, C, then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion;
(7) label node N with splitting criterion;
(8) if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to binary trees
(9) attribute list ← attribute list − splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12) if Dj
is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14) else attach the node returned by Generate decision tree(Dj
, attribute list) to node N;
(15) return N;
Basic algorithm for inducing a decision tree from training tuples.
Attribute Selection Measures
An attribute selection measure is a heuristic for selecting the splitting criterion that
“best” separates a given data partition, D, of class-labeled training tuples into individual
classes. If we were to split D into smaller partitions according to the outcomes of the
splitting criterion, ideally each partition would be pure (i.e., all the tuples that fall into a
given partition would belong to the same class). Conceptually, the “best” splitting criterion
is the one that most closely results in such a scenario. Attribute selection measures
are also known as splitting rules because they determine how the tuples at a given node
are to be split.
ID3 uses information gain as its attribute selection measure. This measure is based on
pioneering work by Claude Shannon on information theory, which studied the value or
“information content” of messages.
The gain ratio is defined as
GainRatio(A) = Gain(A)/SplitInfoA (D)
The attribute with the maximum gain ratio is selected as the splitting attribute. Note,
however, that as the split information approaches 0, the ratio becomes unstable. A constraint
is added to avoid this, whereby the information gain of the test selected must be
large—at least as great as the average gain over all tests examined.
The Gini index is used in CART. Using the notation previously described, the Gini index
measures the impurity of D, a data partition or set of training tuples, as
When a decision tree is built, many of the branches will reflect anomalies in the training
data due to noise or outliers. Tree pruning methods address this problem of overfitting
the data. Such methods typically use statistical measures to remove the least-reliable
Scalability and Decision Tree Induction
“What if D, the disk-resident training set of class-labeled tuples, does not fit in memory? In
other words, how scalable is decision tree induction?” The efficiency of existing decision
tree algorithms, such as ID3, C4.5, and CART, has been well established for relatively
small data sets. Efficiency becomes an issue of concern when these algorithms are applied
to the mining of very large real-world databases. The pioneering decision tree algorithms
that we have discussed so far have the restriction that the training tuples should reside
In data mining applications, very large training sets of millions of tuples are common
Several scalable decision tree induction methods have been introduced in recent studies.
RainForest, for example, adapts to the amount of main memory available and applies
to any decision tree induction algorithm. The method maintains an AVC-set (where
“AVC” stands for “Attribute-Value, Classlabel”) for each attribute, at each tree node,
describing the training tuples at the node.
BOAT (Bootstrapped Optimistic Algorithm for Tree construction) is a decision tree
algorithm that takes a completely different approach to scalability—it is not based on
the use of any special data structures. Instead, it uses a statistical technique known as
“bootstrapping” (Section 8.5.4) to create several smaller samples (or subsets) of the
given training data, each of which fits in memory. Each subset is used to construct a
tree, resulting in several trees. The trees are examined and used to construct a new tree,
T', that turns out to be “very close” to the tree that would have been generated if all the
original training data had fit in memory.
Visual Mining for Decision Tree Induction
“Are there any interactive approaches to decision tree induction that allow us to visualize
the data and the tree as it is being constructed? Can we use any knowledge of our
data to help in building the tree?”
(PBC) is an interactive approach based on multidimensional visualization techniques
and allows the user to incorporate background knowledge about the data when building
a decision tree. By visually interacting with the data, the user is also likely to develop a
deeper understanding of the data. The resulting trees tend to be smaller than those built
using traditional decision tree induction methods and so are easier to interpret, while
achieving about the same accuracy
3 Bayes Classification Methods
“What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities such as the probability that a given tuple belongs
to a particular class.
Naive Bayesian Classification
4 Rule-Based Classification
Using IF-THEN Rules for Classification
Rules are a good way of representing information or bits of knowledge. A rule-based
classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression
of the form
IF condition THEN conclusion.
That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., their
attribute values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the
tuples that it covers and see what percentage of them the rule can correctly classify.
The size ordering scheme assigns the highest priority to the triggering rule that has
the “toughest” requirements, where toughness is measured by the rule antecedent size.
That is, the triggering rule with the most attribute tests is fired.
The rule ordering scheme prioritizes the rules beforehand. The ordering may be
class-based or rule-based. With class-based ordering, the classes are sorted in order of
decreasing “importance” such as by decreasing order of prevalence. That is, all the rules
for the most prevalent (or most frequent) class come first, the rules for the next prevalent
class come next, and so on. Alternatively, they may be sorted based on the misclassification
cost per class. Within each class, the rules are not ordered—they don’t have to be
because they all predict the same class (and so there can be no class conflict!).
With rule-based ordering, the rules are organized into one long priority list, according
to some measure of rule quality, such as accuracy, coverage, or size (number of
attribute tests in the rule antecedent), or based on advice from domain experts. When
rule ordering is used, the rule set is known as a decision list. With rule ordering, the triggering
rule that appears earliest in the list has the highest priority, and so it gets to fire its
class prediction. Any other rule that satisfies X is ignored. Most rule-based classification
systems use a class-based rule-ordering strategy.
Rule Extraction from a Decision Tree
To extract rules from a decision tree, one rule is created for each path from the root
to a leaf node. Each splitting criterion along a given path is logically ANDed to form the
rule antecedent (“IF” part). The leaf node holds the class prediction, forming the rule
consequent (“THEN” part).
Rule Induction Using a Sequential Covering Algorithm
IF-THEN rules can be extracted directly from the training data (i.e., without having to
generate a decision tree first) using a sequential covering algorithm. The name comes
from the notion that the rules are learned sequentially (one at a time), where each rule
for a given class will ideally cover many of the class’s tuples (and hopefully none of
the tuples of other classes). Sequential covering algorithms are the most widely used
approach to mining disjunctive sets of classification rules, and form the topic of this
5 Model Evaluation and Selection
Metrics for Evaluating Classifier Performance
Positive tuples are tuples of the main class of interest and negative tuples (all other tuples). Given two classes, for
example, the positive tuples may be computer buyers while the negative tuples are non-buyers of computers
There are four additional terms that are the “building blocks” used in computing many evaluation measures.
True positives (TP): These refer to the positive tuples that were correctly labeled by
the classifier. Let TP be the number of true positives.
True negatives(TN): These are the negative tuples that were correctly labeled by the
classifier. Let TN be the number of true negatives.
False positives (FP): These are the negative tuples that were incorrectly labeled as
positive (e.g., tuples of class buys computer = no for which the classifier predicted
buys computer = yes). Let FP be the number of false positives.
False negatives (FN): These are the positive tuples that were mislabeled as negative
(e.g., tuples of class buys computer = yes for which the classifier predicted
buys computer = no). Let FN be the number of false negatives.
The confusion matrix is a useful tool for analyzing how well your classifier can
recognize tuples of different classes. TP and TN tell us when the classifier is getting
things right, while FP and FN tell us when the classifier is getting things wrong.
Accuracy = [TP + TN]/[P + N]
In the pattern recognition literature, this is also referred to as the overall recognition
rate of the classifier, that is, it reflects how well the classifier recognizes tuples of the various
error rate or misclassification rate of a classifier, M, which
is simply 1 − accuracy(M), where accuracy(M) is the accuracy of M. This also can be
error rate = [FP + FN]/[P + N]
Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion
of positive tuples that are correctly identified), while specificity is the true negative rate
(i.e., the proportion of negative tuples that are correctly identified). These measures are
sensitivity = TP/ P
specificity = TN/N
It can be shown that accuracy is a function of sensitivity and specificity:
accuracy = sensitivity*P/(P + N) + specificity*N(P + N)
In addition to accuracy-based measures, classifiers can also be compared with respect
to the following additional aspects:
Speed: This refers to the computational costs involved in generating and using the
Robustness: This is the ability of the classifier to make correct predictions given noisy
data or data with missing values. Robustness is typically assessed with a series of
synthetic data sets representing increasing degrees of noise and missing values.
Scalability: This refers to the ability to construct the classifier efficiently given large
amounts of data. Scalability is typically assessed with a series of data sets of increasing
Interpretability: This refers to the level of understanding and insight that is provided
by the classifier or predictor. Interpretability is subjective and therefore more difficult
to assess. Decision trees and classification rules can be easy to interpret, yet their
interpretability may diminish the more they become complex
Holdout Method and Random Subsampling
The holdout method is what we have alluded to so far in our discussions about accuracy.
In this method, the given data are randomly partitioned into two independent sets, a
training set and a test set. Typically, two-thirds of the data are allocated to the training
set, and the remaining one-third is allocated to the test set. The training set is used to
derive the model. The model’s accuracy is then estimated with the test set.
In k-fold cross-validation, the initial data are randomly partitioned into k mutually
exclusive subsets or “folds,” D1, D2,..., Dk
, each of approximately equal size. Training
and testing is performed k times. In iteration i, partition Di
is reserved as the test set,
and the remaining partitions are collectively used to train the model. That is, in the
first iteration, subsets D2,..., Dk collectively serve as the training set to obtain a first
model, which is tested on D1; the second iteration is trained on subsets D1, D3,..., Dk
and tested on D2; and so on. Unlike the holdout and random subsampling methods,
here each sample is used the same number of times for training and once for testing. For
classification, the accuracy estimate is the overall number of correct classifications from
the k iterations, divided by the total number of tuples in the initial data.
Unlike the accuracy estimation methods just mentioned, the bootstrap method samples
the given training tuples uniformly with replacement. That is, each time a tuple is
selected, it is equally likely to be selected again and re-added to the training set. For
instance, imagine a machine that randomly selects tuples for our training set. In sampling
with replacement, the machine is allowed to select the same tuple more than once.
There are several bootstrap methods. A commonly used one is the .632 bootstrap,
which works as follows. Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a bootstrap sample or training set of d
samples. It is very likely that some of the original data tuples will occur more than once
in this sample. The data tuples that did not make it into the training set end up forming
the test set. Suppose we were to try this out several times. As it turns out, on average,
63.2% of the original data tuples will end up in the bootstrap sample, and the remaining
36.8% will form the test set (hence, the name, .632 bootstrap).
Model Selection Using Statistical Tests of Significance
To determine whether Model 1 and Model 2 are significantly different, we compute t and select
a significance level, sig. In practice, a significance level of 5% or 1% is typically used.
Comparing Classifiers Based on Cost–Benefit
and ROC Curves
The true positives, true negatives, false positives, and false negatives are also useful in
assessing the costs and benefits (or risks and gains) associated with a classification
model. The cost associated with a false negative (such as incorrectly predicting that a
cancerous patient is not cancerous) is far greater than those of a false positive
(incorrectly yet conservatively labeling a noncancerous patient as cancerous). In such
cases, we can outweigh one type of error over another by assigning a different cost to
each. These costs may consider the danger to the patient, financial costs of resulting
therapies, and other hospital costs. Similarly, the benefits associated with a true positive
decision may be different than those of a true negative. Up to now, to compute classifier
accuracy, we have assumed equal costs and essentially divided the sum of true positives
and true negatives by the total number of test tuples.
6 Techniques to Improve Classification AccuracyTraditional learning models assume that the data classes are well distributed. In
many real-world data domains, however, the data are class-imbalanced, where the
main class of interest is represented by only a few tuples. This is known as the class
imbalance problem. We also study techniques for improving the classification accuracy
of class-imbalanced data.
Introducing Ensemble Methods
Bagging, boosting, and random forests are examples of ensemble methods (Figure 8.21).
An ensemble combines a series of k learned models (or base classifiers), M1, M2,..., Mk
with the aim of creating an improved composite classification model, M∗. A given data
set, D, is used to create k training sets, D1, D2,..., Dk
, where Di (1 ≤ i ≤ k − 1) is used
to generate classifier Mi
. Given a new data tuple to classify, the base classifiers each vote
by returning a class prediction. The ensemble returns a class prediction based on the
votes of the base classifiers.
An ensemble tends to be more accurate than its base classifiers. For example, consider
an ensemble that performs majority voting. That is, given a tuple X to classify, it
collects the class label predictions returned from the base classifiers and outputs the class
in majority. The base classifiers may make mistakes, but the ensemble will misclassify X
only if over half of the base classifiers are in error. Ensembles yield better results when
there is significant diversity among the models. That is, ideally, there is little correlation
among classifiers. The classifiers should also perform better than random guessing.
Each base classifier can be allocated to a different CPU and so ensemble methods are
Given a set, D, of d tuples, bagging works as follows. For iteration i(i = 1, 2,..., k),
a training set, Di
, of d tuples is sampled with replacement from the original set of
tuples, D. Note that the term bagging stands for bootstrap aggregation. Each training
set is a bootstrap sample, as described in Section 8.5.4. Because sampling with replacement
is used, some of the original tuples of D may not be included in Di
, whereas others
may occur more than once. A classifier model, Mi
, is learned for each training set, Di
To classify an unknown tuple, X, each classifier, Mi
, returns its class prediction, which
counts as one vote. The bagged classifier, M∗, counts the votes and assigns the class
with the most votes to X. Bagging can be applied to the prediction of continuous values
by taking the average value of each prediction for a given test tuple. The algorithm is
summarized in Figure 8.23.
The bagged classifier often has significantly greater accuracy than a single classifier
derived from D, the original training data.
Boosting and AdaBoost
In boosting, weights are also assigned to each training tuple. A series of k classifiers is
iteratively learned. After a classifier, Mi
, is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to “pay more attention” to the training tuples that were misclassified
. The final boosted classifier, M∗, combines the votes of each individual
classifier, where the weight of each classifier’s vote is a function of its accuracy.
AdaBoost (short for Adaptive Boosting) is a popular boosting algorithm. Suppose
we want to boost the accuracy of a learning method. We are given D, a data set of
d class-labeled tuples, (X1, y1),(X2, y2),...,(Xd, yd), where yi
is the class label of tuple
. Initially, AdaBoost assigns each training tuple an equal weight of 1/d. Generating
k classifiers for the ensemble requires k rounds through the rest of the algorithm. In
round i, the tuples from D are sampled to form a training set, Di
, of size d. Sampling with replacement is used—the same tuple may be selected more than once. Each tuple’s
chance of being selected is based on its weight. A classifier model, Mi
, is derived from
the training tuples of Di
. Its error is then calculated using Di as a test set. The weights of
the training tuples are then adjusted according to how they were classified.
If a tuple was incorrectly classified, its weight is increased. If a tuple was correctly
classified, its weight is decreased.
Random forests can be built using bagging (Section 8.6.2) in tandem with random
attribute selection. A training set, D, of d tuples is given. The general procedure to generate
k decision trees for the ensemble is as follows. For each iteration, i(i = 1, 2,..., k),
a training set, Di
, of d tuples is sampled with replacement from D. That is, each Di
bootstrap sample of D (Section 8.5.4), so that some tuples may occur more than once
, while others may be excluded. Let F be the number of attributes to be used to
determine the split at each node, where F is much smaller than the number of available
attributes. To construct a decision tree classifier, Mi
, randomly select, at each node,
F attributes as candidates for the split at the node. The CART methodology is used to
grow the trees. The trees are grown to maximum size and are not pruned. Random
forests formed this way, with random input selection, are called Forest-RI.
Another form of random forest, called Forest-RC, uses random linear combinations
of the input attributes.
Random forests are comparable in accuracy to AdaBoost, yet are more robust to
errors and outliers.
Improving Classification Accuracy of Class-Imbalanced Data
Given two-class data, the data are class-imbalanced if the main class of interest (the
positive class) is represented by only a few tuples, while the majority of tuples represent
the negative class. For multiclass-imbalanced data, the data distribution of each class
differs substantially where, again, the main class or classes of interest are rare. The
class imbalance problem is closely related to cost-sensitive learning, wherein the costs of
errors, per class, are not equal. In medical diagnosis, for example, it is much more costly
to falsely diagnose a cancerous patient as healthy (a false negative) than to misdiagnose
a healthy patient as having cancer (a false positive). A false negative error could lead to
the loss of life and therefore is much more expensive than a false positive error. Other
applications involving class-imbalanced data include fraud detection, the detection of
oil spills from satellite radar images, and fault monitoring.
General approaches for improving the classification accuracy of class-imbalanced data. These approaches include (1) oversampling, (2) undersampling,
(3) threshold moving, and (4) ensemble techniques.
Next Chapter: Data Mining - Classification: Advanced Methods
Excerpts from the Book
Data Mining Concepts and TechniquesThird Edition
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University