## 1 Bayesian Belief Networks

Concepts and Mechanisms
The na¨ıve Bayesian classifier makes the assumption of class conditional independence,
that is, given the class label of a tuple, the values of the attributes are assumed to
be conditionally independent of one another. This simplifies computation. When the
assumption holds true, then the naïve Bayesian classifier is the most accurate in comparison
with all other classifiers. In practice, however, dependencies can exist between
variables. Bayesian belief networks specify joint conditional probability distributions.
They allow class conditional independencies to be defined between subsets of variables.
They provide a graphical model of causal relationships, on which learning can be performed.
Trained Bayesian belief networks can be used for classification. Bayesian belief
networks are also known as belief networks, Bayesian networks, and probabilistic
networks.

” In the learning or training of a belief network,
a number of scenarios are possible. The network topology (or “layout” of nodes
and arcs) may be constructed by human experts or inferred from the data. The network
variables may be observable or hidden in all or some of the training tuples. The hidden
data case is also referred to as missing values or incomplete data.
Several algorithms exist for learning the network topology from the training data
given observable variables. The problem is one of discrete optimization.

If the network topology is known and the variables are observable, then training the
network is straightforward. It consists of computing the CPT entries, as is similarly done
when computing the probabilities involved in na¨ıve Bayesian classification.
When the network topology is given and some of the variables are hidden, there
are various methods to choose from for training the belief network.

A gradient descent strategy is used to search for the wijk values that best model the
data, based on the assumption that each possible setting of wijk is equally likely. Such
a strategy is iterative. It searches for a solution along the negative of the gradient (i.e.,
steepest descent) of a criterion function. We want to find the set of weights, W, that
maximize this function. To start with, the weights are initialized to random probability
values. The gradient descent method performs greedy hill-climbing in that, at each
iteration or step along the way, the algorithm moves toward what appears to be the
best solution at the moment, without backtracking. The weights are updated at each
iteration. Eventually, they converge to a local optimum solution.

## 2 Classification by Backpropagation

“What is backpropagation?” Backpropagation is a neural network learning algorithm.

A Multilayer Feed-Forward Neural Network
The backpropagation algorithm performs learning on a multilayer feed-forward neural
network. It iteratively learns a set of weights for prediction of the class label of tuples.
A multilayer feed-forward neural network consists of an input layer, one or more hidden
layers, and an output layer.

The units in the input layer are called input units. The units in the hidden layers and
output layer are sometimes referred to as neurodes, due to their symbolic biological
basis, or as output units.

Defining a Network Topology
“How can I design the neural network’s topology?” Before training can begin, the user
must decide on the network topology by specifying the number of units in the input
layer, the number of hidden layers (if more than one), the number of units in each
hidden layer, and the number of units in the output layer.
Normalizing the input values for each attribute measured in the training tuples will
help speed up the learning phase. Typically, input values are normalized so as to fall
between 0.0 and 1.0. Discrete-valued attributes may be encoded such that there is one
input unit per domain value. For example, if an attribute A has three possible or known
values, namely {a0, a1, a2}, then we may assign three input units to represent A. That
is, we may have, say, I0, I1, I2 as input units. Each unit is initialized to 0. If A = a0, then
I0 is set to 1 and the rest are 0. If A = a1, then I1 is set to 1 and the rest are 0, and
so on.
Neural networks can be used for both classification (to predict the class label of a
given tuple) and numeric prediction (to predict a continuous-valued output). For classification,
one output unit may be used to represent two classes (where the value 1
represents one class, and the value 0 represents the other).

Backpropagation
“How does backpropagation work?” Backpropagation learns by iteratively processing a
data set of training tuples, comparing the network’s prediction for each tuple with the
actual known target value. The target value may be the known class label of the training
tuple (for classification problems) or a continuous value (for numeric prediction). For
each training tuple, the weights are modified so as to minimize the mean-squared error
between the network’s prediction and the actual target value. These modifications are
made in the “backwards” direction (i.e., from the output layer) through each hidden
layer down to the first hidden layer (hence the name backpropagation). Although it is
not guaranteed, in general the weights will eventually converge, and the learning process
stops

Inside the Black Box: Backpropagation and Interpretability

Methods include extracting rules from networks and sensitivity
analysis.
Various algorithms for rule extraction have been proposed. The methods typically
impose restrictions regarding procedures used in training the given neural network, the
network topology, and the discretization of input values.
Fully connected networks are difficult to articulate. Hence, often the first step in
extracting rules from neural networks is network pruning. This consists of simplifying
the network structure by removing weighted links that have the least effect on the trained
network. For example, a weighted link may be deleted if such removal does not result in
a decrease in the classification accuracy of the network.
Once the trained network has been pruned, some approaches will then perform link,
unit, or activation value clustering. In one method, for example, clustering is used to
find the set of common activation values for each hidden unit in a given trained twolayer
neural network.

## 3 Support Vector Machines

SVM is an algorithm that
works as follows. It uses a nonlinear mapping to transform the original training data
into a higher dimension. Within this new dimension, it searches for the linear optimal
separating hyperplane (i.e., a “decision boundary” separating the tuples of one class
from another). With an appropriate nonlinear mapping to a sufficiently high dimension,
data from two classes can always be separated by a hyperplane. The SVM finds this
hyperplane using support vectors (“essential” training tuples) and margins (defined by
the support vectors). We will delve more into these new concepts later.

The first
paper on support vector machines was presented in 1992 by Vladimir Vapnik and colleagues
Bernhard Boser and Isabelle Guyon, although the groundwork for SVMs has
been around since the 1960s (including early work by Vapnik and Alexei Chervonenkis
on statistical learning theory). Although the training time of even the fastest SVMs
can be extremely slow, they are highly accurate, owing to their ability to model complex
nonlinear decision boundaries. They are much less prone to overfitting than other
methods. The support vectors found also provide a compact description of the learned
model. SVMs can be used for numeric prediction as well as classification. They have
been applied to a number of areas, including handwritten digit recognition, object
recognition, and speaker identification, as well as benchmark time-series prediction
tests.

## 4 Classification Using Frequent Patterns

Associative Classification

Association
rules are mined in a two-step process consisting of frequent itemset mining followed by
rule generation. The first step searches for patterns of attribute–value pairs that occur
repeatedly in a data set, where each attribute–value pair is considered an item. The
resulting attribute–value pairs form frequent itemsets (also referred to as frequent patterns).
The second step analyzes the frequent itemsets to generate association rules. All
association rules must satisfy certain criteria regarding their “accuracy” (or confidence)
and the proportion of the data set that they actually represent (referred to as support).

In general, associative classification consists of the following steps:
1. Mine the data for frequent itemsets, that is, find commonly occurring attribute–value
pairs in the data.
2. Analyze the frequent itemsets to generate association rules per class, which satisfy
confidence and support criteria.
3. Organize the rules to form a rule-based classifier.

Discriminative Frequent Pattern–Based Classification

The general framework for discriminative frequent pattern–based classification is as follows.
1. Feature generation: The data, D, are partitioned according to class label. Use frequent
itemset mining to discover frequent patterns in each partition, satisfying
minimum support. The collection of frequent patterns, F, makes up the feature
candidates.
2. Feature selection: Apply feature selection to F, resulting in FS, the set of selected
(more discriminating) frequent patterns. Information gain, Fisher score, or other
evaluation measures can be used for this step. Relevancy checking can also be
incorporated into this step to weed out redundant patterns. The data set D is transformed
to D0, where the feature space now includes the single features as well as the
selected frequent patterns, FS.
3. Learning of classification model: A classifier is built on the data set D
0. Any learning algorithm can be used as the classification model.

## 5 Lazy Learners (or Learning from Your Neighbors)

k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The method is
labor intensive when given large training sets, and did not gain popularity until the
1960s when increased computing power became available. It has since been widely used
in the area of pattern recognition.
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing
a given test tuple with training tuples that are similar to it. The training tuples are
described by n attributes. Each tuple represents a point in an n-dimensional space. In
this way, all the training tuples are stored in an n-dimensional pattern space. When given
an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k
training tuples that are closest to the unknown tuple. These k training tuples are the k
“nearest neighbors” of the unknown tuple.

Case-Based Reasoning
Case-based reasoning (CBR) classifiers use a database of problem solutions to solve
new problems. Unlike nearest-neighbor classifiers, which store training tuples as points
in Euclidean space, CBR stores the tuples or “cases” for problem solving as complex
symbolic descriptions. Business applications of CBR include problem resolution for
customer service help desks, where cases describe product-related diagnostic problems.
CBR has also been applied to areas such as engineering and law, where cases are either
technical designs or legal rulings, respectively. Medical education is another area for
CBR, where patient case histories and treatments are used to help diagnose and treat
new patients.

## 6 Other Classification Methods

Genetic Algorithms
Genetic algorithms attempt to incorporate ideas of natural evolution. In general,
genetic learning starts as follows. An initial population is created consisting of randomly
generated rules. Each rule can be represented by a string of bits. As a simple example,
suppose that samples in a given training set are described by two Boolean attributes,
A1 and A2, and that there are two classes, C1 and C2. The rule “IF A1 AND NOT A2
THEN C2” can be encoded as the bit string “100,” where the two leftmost bits represent
attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly,
the rule “IF NOT A1 AND NOT A2 THEN C1” can be encoded as “001.” If an attribute
has k values, where k > 2, then k bits may be used to encode the attribute’s values.
Classes can be encoded in a similar fashion.
Based on the notion of survival of the fittest, a new population is formed to consist
of the fittest rules in the current population, as well as offspring of these rules. Typically,
the fitness of a rule is assessed by its classification accuracy on a set of training samples.
Offspring are created by applying genetic operators such as crossover and mutation.
In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In
mutation, randomly selected bits in a rule’s string are inverted.
The process of generating new populations based on prior populations of rules continues
until a population, P, evolves where each rule in P satisfies a prespecified fitness
threshold.

Genetic algorithms are easily parallelizable and have been used for classification as
well as other optimization problems. In data mining, they may be used to evaluate the
fitness of other algorithms.

Rough Set Approach
Rough set theory can be used for classification to discover structural relationships within
imprecise or noisy data. It applies to discrete-valued attributes. Continuous-valued
attributes must therefore be discretized before its use.

Rough set theory is based on the establishment of equivalence classes within the
given training data. All the data tuples forming an equivalence class are indiscernible,
that is, the samples are identical with respect to the attributes describing the data. Given
real-world data, it is common that some classes cannot be distinguished in terms of the
available attributes. Rough sets can be used to approximately or “roughly” define such
classes. A rough set definition for a given class, C, is approximated by two sets—a lower
approximation of C and an upper approximation of C. The lower approximation of C
consists of all the data tuples that, based on the knowledge of the attributes, are certain to
belong to C without ambiguity. The upper approximation of C consists of all the tuples
that, based on the knowledge of the attributes, cannot be described as not belonging to
C.

Rough sets can also be used for attribute subset selection (or feature reduction, where
attributes that do not contribute to the classification of the given training data can be
identified and removed) and relevance analysis (where the contribution or significance
of each attribute is assessed with respect to the classification task). The problem of finding
the minimal subsets (reducts) of attributes that can describe all the concepts in
the given data set is NP-hard. However, algorithms to reduce the computation intensity
have been proposed. In one method, for example, a discernibility matrix is used that
stores the differences between attribute values for each pair of data tuples. Rather than

searching on the entire training set, the matrix is instead searched to detect redundant
attributes.

Fuzzy Set Approaches

Fuzzy set theory is also known as possibility theory. It was proposed by Lotfi Zadeh
in 1965 as an alternative to traditional two-value logic and probability theory. It lets
us work at a high abstraction level and offers a means for dealing with imprecise data
measurement. Most important, fuzzy set theory allows us to deal with vague or inexact
facts. For example, being a member of a set of high incomes is inexact (e.g., if \$50,000
is high, then what about \$49,000? or \$48,000?) Unlike the notion of traditional “crisp”
sets where an element belongs to either a set S or its complement, in fuzzy set theory,
elements can belong to more than one fuzzy set. For example, the income value \$49,000
belongs to both the medium and high fuzzy sets, but to differing degrees. Using fuzzy set
notation, membership of a fuzzyset can be described like mmedium income(\$49,000) = 0.15 and mhigh income(\$49,000) = 0.96,
where m denotes the membership function, that is operating on the fuzzy sets of
medium income and high income, respectively. In fuzzy set theory, membership values
for a given element, x (e.g., for \$49,000), do not have to sum to 1. This is unlike
traditional probability theory, which is constrained by a summation axiom.

Given a tuple to classify, more than one fuzzy rule may apply. Each applicable rule
contributes a vote for membership in the categories. Typically, the truth values for each
predicted category are summed, and these sums are combined. Several procedures exist
for translating the resulting fuzzy output into a defuzzified or crisp value that is returned
by the system.
Fuzzy logic systems have been used in numerous areas for classification, including
market research, finance, health care, and environmental engineering.

## 7 Additional Topics Regarding Classification

Multiclass Classification
Some classification algorithms, such as support vector machines, are designed for binary
classification. How can we extend these algorithms to allow for multiclass classification
(i.e., classification involving more than two classes)?
A simple approach is one-versus-all (OVA). Given m classes, we train m binary classifiers,
one for each class. Classifier j is trained using tuples of class j as the positive class,
and the remaining tuples as the negative class. It learns to return a positive value for class
j and a negative value for the rest. To classify an unknown tuple, X, the set of classifiers
vote as an ensemble. For example, if classifier j predicts the positive class for X, then
class j gets one vote. If it predicts the negative class for X, then each of the classes except
j gets one vote. The class with the most votes is assigned to X.

Semi-Supervised Classification
Semi-supervised classification uses labeled data and unlabeled data to build a classifier.
Let Xl = {(x1, y1),...,xl
, yl)} be the set of labeled data and Xu = {xl+1,...,xn} be the set
of unlabeled data. Here we describe a few examples of this approach for learning.
Self-training is the simplest form of semi-supervised classification. It first builds a
classifier using the labeled data. The classifier then tries to label the unlabeled data. The
tuple with the most confident label prediction is added to the set of labeled data, and the
process repeats (Figure 9.17). Although the method is easy to understand, a disadvantage
is that it may reinforce errors.
Cotraining is another form of semi-supervised classification, where two or more
classifiers teach each other. Each learner uses a different and ideally independent set
of features for each tuple. Consider web page data, for example, where attributes relating
to the images on the page may be used as one set of features, while attributes relating
to the corresponding text constitute another set of features for the same data.

Active Learning
Active learning is an iterative type of supervised learning that is suitable for situations
where data are abundant, yet the class labels are scarce or expensive to obtain. The learning
algorithm is active in that it can purposefully query a user (e.g., a human oracle) for
labels. The number of tuples used to learn a concept this way is often much smaller than
the number required in typical supervised learning.

Transfer Learning
Transfer learning aims to extract the knowledge from one or more source tasks and
apply the knowledge to a target task.

Transfer learning allows the distributions, tasks, and even the data domains used in
training and testing to be different. Transfer learning is analogous to the way humans
may apply their knowledge of a task to facilitate the learning of another task. For example,
if we know how to play the recorder, we may apply our knowledge of note reading
and music to simplify the task of learning to play the piano. Similarly, knowing Spanish
may make it easier to learn Italian.

Next Chapter

Data Mining Recent Trends and Research Frontiers