## 1 Bayesian Belief Networks

Concepts and Mechanisms

The na¨ıve Bayesian classifier makes the assumption of class conditional independence,

that is, given the class label of a tuple, the values of the attributes are assumed to

be conditionally independent of one another. This simplifies computation. When the

assumption holds true, then the naïve Bayesian classifier is the most accurate in comparison

with all other classifiers. In practice, however, dependencies can exist between

variables. Bayesian belief networks specify joint conditional probability distributions.

They allow class conditional independencies to be defined between subsets of variables.

They provide a graphical model of causal relationships, on which learning can be performed.

Trained Bayesian belief networks can be used for classification. Bayesian belief

networks are also known as belief networks, Bayesian networks, and probabilistic

networks.

” In the learning or training of a belief network,

a number of scenarios are possible. The network topology (or “layout” of nodes

and arcs) may be constructed by human experts or inferred from the data. The network

variables may be observable or hidden in all or some of the training tuples. The hidden

data case is also referred to as missing values or incomplete data.

Several algorithms exist for learning the network topology from the training data

given observable variables. The problem is one of discrete optimization.

If the network topology is known and the variables are observable, then training the

network is straightforward. It consists of computing the CPT entries, as is similarly done

when computing the probabilities involved in na¨ıve Bayesian classification.

When the network topology is given and some of the variables are hidden, there

are various methods to choose from for training the belief network.

A gradient descent strategy is used to search for the wijk values that best model the

data, based on the assumption that each possible setting of wijk is equally likely. Such

a strategy is iterative. It searches for a solution along the negative of the gradient (i.e.,

steepest descent) of a criterion function. We want to find the set of weights, W, that

maximize this function. To start with, the weights are initialized to random probability

values. The gradient descent method performs greedy hill-climbing in that, at each

iteration or step along the way, the algorithm moves toward what appears to be the

best solution at the moment, without backtracking. The weights are updated at each

iteration. Eventually, they converge to a local optimum solution.

## 2 Classification by Backpropagation

“What is backpropagation?” Backpropagation is a neural network learning algorithm.A Multilayer Feed-Forward Neural Network

The backpropagation algorithm performs learning on a multilayer feed-forward neural

network. It iteratively learns a set of weights for prediction of the class label of tuples.

A multilayer feed-forward neural network consists of an input layer, one or more hidden

layers, and an output layer.

The units in the input layer are called input units. The units in the hidden layers and

output layer are sometimes referred to as neurodes, due to their symbolic biological

basis, or as output units.

Defining a Network Topology

“How can I design the neural network’s topology?” Before training can begin, the user

must decide on the network topology by specifying the number of units in the input

layer, the number of hidden layers (if more than one), the number of units in each

hidden layer, and the number of units in the output layer.

Normalizing the input values for each attribute measured in the training tuples will

help speed up the learning phase. Typically, input values are normalized so as to fall

between 0.0 and 1.0. Discrete-valued attributes may be encoded such that there is one

input unit per domain value. For example, if an attribute A has three possible or known

values, namely {a0, a1, a2}, then we may assign three input units to represent A. That

is, we may have, say, I0, I1, I2 as input units. Each unit is initialized to 0. If A = a0, then

I0 is set to 1 and the rest are 0. If A = a1, then I1 is set to 1 and the rest are 0, and

so on.

Neural networks can be used for both classification (to predict the class label of a

given tuple) and numeric prediction (to predict a continuous-valued output). For classification,

one output unit may be used to represent two classes (where the value 1

represents one class, and the value 0 represents the other).

Backpropagation

“How does backpropagation work?” Backpropagation learns by iteratively processing a

data set of training tuples, comparing the network’s prediction for each tuple with the

actual known target value. The target value may be the known class label of the training

tuple (for classification problems) or a continuous value (for numeric prediction). For

each training tuple, the weights are modified so as to minimize the mean-squared error

between the network’s prediction and the actual target value. These modifications are

made in the “backwards” direction (i.e., from the output layer) through each hidden

layer down to the first hidden layer (hence the name backpropagation). Although it is

not guaranteed, in general the weights will eventually converge, and the learning process

stops

Inside the Black Box: Backpropagation and Interpretability

Methods include extracting rules from networks and sensitivity

analysis.

Various algorithms for rule extraction have been proposed. The methods typically

impose restrictions regarding procedures used in training the given neural network, the

network topology, and the discretization of input values.

Fully connected networks are difficult to articulate. Hence, often the first step in

extracting rules from neural networks is network pruning. This consists of simplifying

the network structure by removing weighted links that have the least effect on the trained

network. For example, a weighted link may be deleted if such removal does not result in

a decrease in the classification accuracy of the network.

Once the trained network has been pruned, some approaches will then perform link,

unit, or activation value clustering. In one method, for example, clustering is used to

find the set of common activation values for each hidden unit in a given trained twolayer

neural network.

## 3 Support Vector Machines

SVM is an algorithm that

works as follows. It uses a nonlinear mapping to transform the original training data

into a higher dimension. Within this new dimension, it searches for the linear optimal

separating hyperplane (i.e., a “decision boundary” separating the tuples of one class

from another). With an appropriate nonlinear mapping to a sufficiently high dimension,

data from two classes can always be separated by a hyperplane. The SVM finds this

hyperplane using support vectors (“essential” training tuples) and margins (defined by

the support vectors). We will delve more into these new concepts later.

The first

paper on support vector machines was presented in 1992 by Vladimir Vapnik and colleagues

Bernhard Boser and Isabelle Guyon, although the groundwork for SVMs has

been around since the 1960s (including early work by Vapnik and Alexei Chervonenkis

on statistical learning theory). Although the training time of even the fastest SVMs

can be extremely slow, they are highly accurate, owing to their ability to model complex

nonlinear decision boundaries. They are much less prone to overfitting than other

methods. The support vectors found also provide a compact description of the learned

model. SVMs can be used for numeric prediction as well as classification. They have

been applied to a number of areas, including handwritten digit recognition, object

recognition, and speaker identification, as well as benchmark time-series prediction

tests.

## 4 Classification Using Frequent Patterns

Associative Classification

Association

rules are mined in a two-step process consisting of frequent itemset mining followed by

rule generation. The first step searches for patterns of attribute–value pairs that occur

repeatedly in a data set, where each attribute–value pair is considered an item. The

resulting attribute–value pairs form frequent itemsets (also referred to as frequent patterns).

The second step analyzes the frequent itemsets to generate association rules. All

association rules must satisfy certain criteria regarding their “accuracy” (or confidence)

and the proportion of the data set that they actually represent (referred to as support).

In general, associative classification consists of the following steps:

1. Mine the data for frequent itemsets, that is, find commonly occurring attribute–value

pairs in the data.

2. Analyze the frequent itemsets to generate association rules per class, which satisfy

confidence and support criteria.

3. Organize the rules to form a rule-based classifier.

Discriminative Frequent Pattern–Based Classification

The general framework for discriminative frequent pattern–based classification is as follows.

1. Feature generation: The data, D, are partitioned according to class label. Use frequent

itemset mining to discover frequent patterns in each partition, satisfying

minimum support. The collection of frequent patterns, F, makes up the feature

candidates.

2. Feature selection: Apply feature selection to F, resulting in FS, the set of selected

(more discriminating) frequent patterns. Information gain, Fisher score, or other

evaluation measures can be used for this step. Relevancy checking can also be

incorporated into this step to weed out redundant patterns. The data set D is transformed

to D0, where the feature space now includes the single features as well as the

selected frequent patterns, FS.

3. Learning of classification model: A classifier is built on the data set D

0. Any learning algorithm can be used as the classification model.

## 5 Lazy Learners (or Learning from Your Neighbors)

k-Nearest-Neighbor Classifiers

The k-nearest-neighbor method was first described in the early 1950s. The method is

labor intensive when given large training sets, and did not gain popularity until the

1960s when increased computing power became available. It has since been widely used

in the area of pattern recognition.

Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing

a given test tuple with training tuples that are similar to it. The training tuples are

described by n attributes. Each tuple represents a point in an n-dimensional space. In

this way, all the training tuples are stored in an n-dimensional pattern space. When given

an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k

training tuples that are closest to the unknown tuple. These k training tuples are the k

“nearest neighbors” of the unknown tuple.

Case-Based Reasoning

Case-based reasoning (CBR) classifiers use a database of problem solutions to solve

new problems. Unlike nearest-neighbor classifiers, which store training tuples as points

in Euclidean space, CBR stores the tuples or “cases” for problem solving as complex

symbolic descriptions. Business applications of CBR include problem resolution for

customer service help desks, where cases describe product-related diagnostic problems.

CBR has also been applied to areas such as engineering and law, where cases are either

technical designs or legal rulings, respectively. Medical education is another area for

CBR, where patient case histories and treatments are used to help diagnose and treat

new patients.

## 6 Other Classification Methods

Genetic Algorithms

Genetic algorithms attempt to incorporate ideas of natural evolution. In general,

genetic learning starts as follows. An initial population is created consisting of randomly

generated rules. Each rule can be represented by a string of bits. As a simple example,

suppose that samples in a given training set are described by two Boolean attributes,

A1 and A2, and that there are two classes, C1 and C2. The rule “IF A1 AND NOT A2

THEN C2” can be encoded as the bit string “100,” where the two leftmost bits represent

attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly,

the rule “IF NOT A1 AND NOT A2 THEN C1” can be encoded as “001.” If an attribute

has k values, where k > 2, then k bits may be used to encode the attribute’s values.

Classes can be encoded in a similar fashion.

Based on the notion of survival of the fittest, a new population is formed to consist

of the fittest rules in the current population, as well as offspring of these rules. Typically,

the fitness of a rule is assessed by its classification accuracy on a set of training samples.

Offspring are created by applying genetic operators such as crossover and mutation.

In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In

mutation, randomly selected bits in a rule’s string are inverted.

The process of generating new populations based on prior populations of rules continues

until a population, P, evolves where each rule in P satisfies a prespecified fitness

threshold.

Genetic algorithms are easily parallelizable and have been used for classification as

well as other optimization problems. In data mining, they may be used to evaluate the

fitness of other algorithms.

Rough Set Approach

Rough set theory can be used for classification to discover structural relationships within

imprecise or noisy data. It applies to discrete-valued attributes. Continuous-valued

attributes must therefore be discretized before its use.

Rough set theory is based on the establishment of equivalence classes within the

given training data. All the data tuples forming an equivalence class are indiscernible,

that is, the samples are identical with respect to the attributes describing the data. Given

real-world data, it is common that some classes cannot be distinguished in terms of the

available attributes. Rough sets can be used to approximately or “roughly” define such

classes. A rough set definition for a given class, C, is approximated by two sets—a lower

approximation of C and an upper approximation of C. The lower approximation of C

consists of all the data tuples that, based on the knowledge of the attributes, are certain to

belong to C without ambiguity. The upper approximation of C consists of all the tuples

that, based on the knowledge of the attributes, cannot be described as not belonging to

C.

Rough sets can also be used for attribute subset selection (or feature reduction, where

attributes that do not contribute to the classification of the given training data can be

identified and removed) and relevance analysis (where the contribution or significance

of each attribute is assessed with respect to the classification task). The problem of finding

the minimal subsets (reducts) of attributes that can describe all the concepts in

the given data set is NP-hard. However, algorithms to reduce the computation intensity

have been proposed. In one method, for example, a discernibility matrix is used that

stores the differences between attribute values for each pair of data tuples. Rather than

searching on the entire training set, the matrix is instead searched to detect redundant

attributes.

Fuzzy Set Approaches

Fuzzy set theory is also known as possibility theory. It was proposed by Lotfi Zadeh

in 1965 as an alternative to traditional two-value logic and probability theory. It lets

us work at a high abstraction level and offers a means for dealing with imprecise data

measurement. Most important, fuzzy set theory allows us to deal with vague or inexact

facts. For example, being a member of a set of high incomes is inexact (e.g., if $50,000

is high, then what about $49,000? or $48,000?) Unlike the notion of traditional “crisp”

sets where an element belongs to either a set S or its complement, in fuzzy set theory,

elements can belong to more than one fuzzy set. For example, the income value $49,000

belongs to both the medium and high fuzzy sets, but to differing degrees. Using fuzzy set

notation, membership of a fuzzyset can be described like mmedium income($49,000) = 0.15 and mhigh income($49,000) = 0.96,

where m denotes the membership function, that is operating on the fuzzy sets of

medium income and high income, respectively. In fuzzy set theory, membership values

for a given element, x (e.g., for $49,000), do not have to sum to 1. This is unlike

traditional probability theory, which is constrained by a summation axiom.

Given a tuple to classify, more than one fuzzy rule may apply. Each applicable rule

contributes a vote for membership in the categories. Typically, the truth values for each

predicted category are summed, and these sums are combined. Several procedures exist

for translating the resulting fuzzy output into a defuzzified or crisp value that is returned

by the system.

Fuzzy logic systems have been used in numerous areas for classification, including

market research, finance, health care, and environmental engineering.

## 7 Additional Topics Regarding Classification

Multiclass Classification

Some classification algorithms, such as support vector machines, are designed for binary

classification. How can we extend these algorithms to allow for multiclass classification

(i.e., classification involving more than two classes)?

A simple approach is one-versus-all (OVA). Given m classes, we train m binary classifiers,

one for each class. Classifier j is trained using tuples of class j as the positive class,

and the remaining tuples as the negative class. It learns to return a positive value for class

j and a negative value for the rest. To classify an unknown tuple, X, the set of classifiers

vote as an ensemble. For example, if classifier j predicts the positive class for X, then

class j gets one vote. If it predicts the negative class for X, then each of the classes except

j gets one vote. The class with the most votes is assigned to X.

Semi-Supervised Classification

Semi-supervised classification uses labeled data and unlabeled data to build a classifier.

Let Xl = {(x1, y1),...,xl

, yl)} be the set of labeled data and Xu = {xl+1,...,xn} be the set

of unlabeled data. Here we describe a few examples of this approach for learning.

Self-training is the simplest form of semi-supervised classification. It first builds a

classifier using the labeled data. The classifier then tries to label the unlabeled data. The

tuple with the most confident label prediction is added to the set of labeled data, and the

process repeats (Figure 9.17). Although the method is easy to understand, a disadvantage

is that it may reinforce errors.

Cotraining is another form of semi-supervised classification, where two or more

classifiers teach each other. Each learner uses a different and ideally independent set

of features for each tuple. Consider web page data, for example, where attributes relating

to the images on the page may be used as one set of features, while attributes relating

to the corresponding text constitute another set of features for the same data.

Active Learning

Active learning is an iterative type of supervised learning that is suitable for situations

where data are abundant, yet the class labels are scarce or expensive to obtain. The learning

algorithm is active in that it can purposefully query a user (e.g., a human oracle) for

labels. The number of tuples used to learn a concept this way is often much smaller than

the number required in typical supervised learning.

Transfer Learning

Transfer learning aims to extract the knowledge from one or more source tasks and

apply the knowledge to a target task.

Transfer learning allows the distributions, tasks, and even the data domains used in

training and testing to be different. Transfer learning is analogous to the way humans

may apply their knowledge of a task to facilitate the learning of another task. For example,

if we know how to play the recorder, we may apply our knowledge of note reading

and music to simplify the task of learning to play the piano. Similarly, knowing Spanish

may make it easier to learn Italian.

Next Chapter

Data Mining Recent Trends and Research Frontiers

## No comments:

## Post a Comment