Wednesday, April 20, 2016

IoT related Education and Training Programmes - Video Lectures

Browse 100+ Books on Internet of Things

Prototyping Internet of Things Ideas and Networks - Book Excerpts


IoT Training in Bengaluru

Overview - Why IoT is so important

IoT - Business models involving  IoT Technology

Important Application Areas

Smart House and Smart City
Industrial Internet
Smart Cars
Home Healthcare

Business Rule Generation for IoT
3 layered architecture of Big Data — Physical (Sensors), Communication , and Data Intelligence

All about Sensors – Electronics

Basic function and architecture of a sensor — sensor body, sensor mechanism, sensor calibration, sensor maintenance, cost and pricing structure, legacy and modern sensor network — all the basics about the sensors
Development of sensor electronics — IoT vs legacy, and open source vs traditional PCB design style
Development of sensor communication protocols — history to modern days. Legacy protocols like
Modbus, relay, HART to modern day Zigbee, Zwave, X10,Bluetooth, ANT, etc.
Business driver for sensor deployment — FDA/EPA regulation, fraud/tempering detection, supervision, quality control and process management
Different Kind of Calibration Techniques — manual, automation, infield, primary and secondary calibration — and their implication in IoT
Powering options for sensors — battery, solar, Witricity, Mobile and PoE
Hands on training with single silicon and other sensors like temperature, pressure, vibration, magnetic field, power factor etc.
Fundamental of M2M communication — Sensor Network and Wireless protocol
What is a sensor network? What is ad-hoc network?
Wireless vs. Wireline network
WiFi- 802.11 families: N to S — application of standards and common vendors.
Zigbee and Zwave — advantage of low power mesh networking. Long distance Zigbee. Introduction to different Zigbee chips.
Bluetooth/BLE: Low power vs high power, speed of detection, class of BLE. Introduction of Bluetooth vendors & their review.
Creating network with Wireless protocols such as Piconet by BLE
Protocol stacks and packet structure for BLE and Zigbee
Other long distance RF communication link
LOS vs NLOS links
Capacity and throughput calculation
Application issues in wireless protocols — power consumption, reliability, PER, QoS, LOS
Hands on training with sensor network
1. PICO NET- BLE Base network
2. Zigbee network-master/slave communication
3. Data Hubs : MC and single computer ( like Beaglebone ) based datahub
Review of Electronics Platform, production and cost projection
PCB vs FPGA vs ASIC design-how to take decision
Prototyping electronics vs Production electronics
QA certificate for IoT- CE/CSA/UL/IEC/RoHS/IP65: What are those and when needed?
Basic introduction of multi-layer PCB design and its workflow
Electronics reliability-basic concept of FIT and early mortality rate
Environmental and reliability testing-basic concepts
Basic Open source platforms: Arduino, Raspberry Pi, Beaglebone, when needed?
RedBack, Diamond Back
Conceiving a new IoT product- Product requirement document for IoT
State of the present art and review of existing technology in the market place
Suggestion for new features and technologies based on market analysis and patent issues
Detailed technical specs for new products- System, software, hardware, mechanical, installation etc.
Packaging and documentation requirements
Servicing and customer support requirements
High level design (HLD) for understanding of product concept
Release plan for phase wise introduction of the new features
Skill set for the development team and proposed project plan -cost & duration
Target manufacturing price
Introduction to Mobile app platform for IoT
Protocol stack of Mobile app for IoT
Mobile to server integration –what are the factors to look out
What are the intelligent layer that can be introduced at Mobile app level ?
iBeacon in IoS
Window Azure
Linkafy Mobile platform for IoT
Machine learning for intelligent IoT
Introduction to Machine learning
Learning classification techniques
Bayesian Prediction-preparing training file
Support Vector Machine
Image and video analytic for IoT
Fraud and alert analytic through IoT
Bio –metric ID integration with IoT
Real Time Analytic/Stream Analytic
Scalability issues of IoT and machine learning
What are the architectural implementation of Machine learning for IoT
Analytic Engine for IoT
Insight analytic
Visualization analytic
Structured predictive analytic
Unstructured predictive analytic
Recommendation Engine
Pattern detection
Rule/Scenario discovery — failure, fraud, optimization
Root cause discovery
Security in IoT implementation
Why security is absolutely essential for IoT
Mechanism of security breach in IOT layer
Privacy enhancing technologies
Fundamental of network security
Encryption and cryptography implementation for IoT data
Security standard for available platform
European legislation for security in IoT platform
Secure booting
Device authentication
Firewalling and IPS
Updates and patches
Database implementation for IoT : Cloud based IoT platforms
SQL vs NoSQL-Which one is good for your IoT application
Open sourced vs. Licensed Database
Available M2M cloud platform
CISCO M2M platform
AT &T M2M platform
Google M2M platform
A few common IoT systems
Home automation
Energy optimization in Home
Smart Smoke alarm
BAC ( Blood alcohol monitoring ) for drug abusers under probation
Pet cam for Pet lovers
Wearable IOT
Mobile parking ticketing system
Indoor location tracking in Retail store
Home health care
Smart Sports Watch
Big Data for IoT
4V- Volume, velocity, variety and veracity of Big Data
Why Big Data is important in IoT
Big Data vs legacy data in IoT
Hadoop for IoT-when and why?
Storage technique for image, Geospatial and video data
Distributed database
Parallel computing basics for IoT

February 2016

PTC Academic Program Delivers Massive Open Online Courses (MOOCs) for IoT Learning
Company Launches Initial IoT MOOCs for Differentiated Educational Pathway

How It Works: Internet of Things
IBM Think Academy



The Future of IoT at Work
IBM Think Academy



IBM Watson IoT Platform Demo
IBM Internet of Things


The Internet of Things: Dr. John Barrett at TEDxCIT


TEDX Talks

Updated 20 Apr 2016, 24 March 2016

Thursday, April 14, 2016

Data Warehousing and Mining & Business Intelligence - Mumbai university Syllabus and Related Knols

Data Warehousing and Mining; Business Intelligence - Mumbai university Syllabus and Related Knols
Authors: Narayana Rao

Original URL:

Prerequisite: Data Base Management System

Objective: Today is the era characterized by Information Overload – Minimum
knowledge. Every business must rely extensively on data analysis to increase
productivity and survive competition. This course provides a comprehensive
introduction to data mining problems concepts with particular emphasis on business
intelligence applications.
The three main goals of the course are to enable students to:
1. Approach business problems data-analytically by identifying opportunities to
derive business value from data.
2. know the basics of data mining techniques and how they can be applied to extract
relevant business intelligence.

1. Introduction to Data Mining: Motivation for Data Mining, Data Mining-Definition
& Functionalities, Classification of DM systems, DM task primitives, Integration of a
Data Mining system with a Database or a Data Warehouse, Major issues in Data
2. Data Warehousing – (Overview Only): Overview of concepts like star schema, fact
and dimension tables, OLAP operations, From OLAP to Data Mining.
3. Data Preprocessing: Why? Descriptive Data Summarization, Data Cleaning:
Missing Values, Noisy Data, Data Integration and Transformation. Data Reduction:-
Data Cube Aggregation, Dimensionality reduction, Data Compression, Numerosity
Reduction, Data Discretization and Concept hierarchy generation for numerical and
categorical data.
4. Mining Frequent Patterns, Associations, and Correlations: Market Basket
Analysis, Frequent Itemsets, Closed Itemsets, and Association Rules, Frequent
Pattern Mining, Efficient and Scalable Frequent Itemset Mining Methods, The
Apriori Algorithm for finding Frequent Itemsets Using Candidate Generation,
Generating Association Rules from Frequent Itemsets, Improving the Efficiency of
Apriori, Frequent Itemsets without Candidate Generation using FP Tree, Mining
Multilevel Association Rules, Mining Multidimensional Association Rules, From
Association Mining to Correlation Analysis, Constraint-Based Association Mining.
5. Classification & Prediction: What is it? Issues regarding Classification and
• Classification methods: Decision tree, Bayesian Classification, Rule based
• Prediction: Linear and non linear regression
Accuracy and Error measures, Evaluating the accuracy of a Classifier or Predictor.
6. Cluster Analysis: What is it? Types of Data in cluster analysis, Categories of
clustering methods, Partitioning methods – K-Means, K-Mediods. Hierarchical
Clustering- Agglomerative and Divisive Clustering, BIRCH and ROCK methods,
DBSCAN, Outlier Analysis
7. Mining Stream and Sequence Data: What is stream data? Classification, Clustering
Association Mining in stream data. Mining Sequence Patterns in Transactional
8. Spatial Data and Text Mining: Spatial Data Cube Construction and Spatial OLAP,
Mining Spatial Association and Co-location Patterns, Spatial Clustering Methods,
Spatial Classification and Spatial Trend Analysis. Text Mining Text Data Analysis
and Information Retrieval, Dimensionality Reduction for Text, Text Mining
9. Web Mining: Web mining introduction, Web Content Mining, Web Structure
Mining, Web Usage mining, Automatic Classification of web Documents.
10. Data Mining for Business Intelligence Applications: Data mining for business
Applications like Balanced Scorecard, Fraud Detection, Clickstream Mining, Market
Segmentation, retail industry, telecommunications industry, banking & finance and
CRM etc.

Text Books:
1. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 2nd
Notes for the chapters of this book available below.
2. P. N. Tan, M. Steinbach, Vipin Kumar, “Introduction to Data Mining”, Pearson

Reference Books:
1. MacLennan Jamie, Tang ZhaoHui and Crivat Bogdan, “Data Mining with Microsoft
SQL Server 2008”, Wiley India Edition.
2. G. Shmueli, N.R. Patel, P.C. Bruce, “Data Mining for Business Intelligence:
Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner”,
Wiley India.
3. Michael Berry and Gordon Linoff “Data Mining Techniques”, 2nd Edition Wiley
4. Alex Berson and Smith, “Data Mining and Data Warehousing and OLAP”, McGraw
Hill Publication.
5. E. G. Mallach, “Decision Support and Data Warehouse Systems", Tata McGraw Hill.
6. Michael Berry and Gordon Linoff “Mastering Data Mining- Art & science of CRM”,
Wiley Student Edition
7. Arijay Chaudhry & P. S. Deshpande, “Multidimensional Data Analysis and Data
Mining Dreamtech Press
8. Vikram Pudi & Radha Krishna, “Data Mining”, Oxford Higher Education.
9. Chakrabarti, S., “Mining the Web: Discovering knowledge from hypertext data”,
10. M. Jarke, M. Lenzerini, Y. Vassiliou, P. Vassiliadis (ed.), “Fundamentals of Data
Warehouses”, Springer-Verlag, 1999.
Term Work:
Term work shall consist of at least 10 experiments covering all topics Term work should
consist of at least 6 programming assignments and one mini project in Business
Intelligence and two assignments covering the topics of the syllabus. One written test is
also to be conducted.
Distribution of marks for term work shall be as follows:
1. Laboratory work (Experiments and Journal) 15 Marks
2. Test (at least one) 10 Marks
The final certification and acceptance of TW ensures the satisfactory Performance of
laboratory Work and Minimum Passing in the term work.

Suggested Experiment List
1. Students can learn to use WEKA open source data mining tool and run data mining
algorithms on datasets.
2. Program for Classification – Decision tree, Naïve Bayes using languages like JAVA
3. Program for Clustering – K-means, Agglomerative, Divisive using languages like
4. Program for Association Mining using languages like JAVA
5. Web mining
6. BI projects: any one of Balanced Scorecard, Fraud detection, Market Segmentation
7. Using any commercial BI tool like SQLServer 2008, Oracle BI, SPSS, Clementine,
and XLMiner etc.

Notes - Data Mining

Data Mining Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University

Introduction to Data Mining

Data Preprocessing for Data Mining

Data Warehousing and Online Analytical Processing - Chapter of Data Mining

Data Cube Technologies for Data Mining

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

Advanced Patterns - Data Mining

Data Mining - Classification: Basic Concepts

Data Mining - Classification: Advanced Methods

Data Mining Recent Trends and Research Frontiers

Updated 14 Apr 2016

First Published  2011

Data Mining Recent Trends and Research Frontiers

1 Mining Complex Data Types

Mining Sequence Data: Time-Series, Symbolic
Sequences, and Biological Sequences
A sequence is an ordered list of events. Sequences may be categorized into three groups,
based on the characteristics of the events they describe: (1) time-series data, (2) symbolic
sequence data, and (3) biological sequences. Let’s consider each type.
In time-series data, sequence data consist of long sequences of numeric data,
recorded at equal time intervals (e.g., per minute, per hour, or per day). Time-series
data can be generated by many natural and economic processes such as stock markets,
and scientific, medical, or natural observations.
Symbolic sequence data consist of long sequences of event or nominal data, which
typically are not observed at equal time intervals. For many such sequences, gaps (i.e.,
lapses between recorded events) do not matter much. Examples include customer shopping
sequences and web click streams, as well as sequences of events in science and
engineering and in natural and social developments.
Biological sequences include DNA and protein sequences. Such sequences are typically
very long, and carry important, complicated, but hidden semantic meaning. Here,
gaps are usually important.

2 Other Methodologies of Data Mining

Not covered so far in this book

Statistical Data Mining
The data mining techniques described in this book are primarily drawn from computer
science disciplines, including data mining, machine learning, data warehousing, and
algorithms. They are designed for the efficient handling of huge amounts of data that are
typically multidimensional and possibly of various complex types. There are, however,
many well-established statistical techniques for data analysis, particularly for numeric
data. These techniques have been applied extensively to scientific data (e.g., data from
experiments in physics, engineering, manufacturing, psychology, and medicine), as well
as to data from economics and the social sciences. Some of these techniques, such as
principal components analysis and clustering, have already been addressed in this book.
Some more methods are:

Regression: In general, these methods are used to predict the value of a response
(dependent) variable from one or more predictor (independent) variables, where the
variables are numeric. There are various forms of regression, such as linear, multiple,
weighted, polynomial, nonparametric, and robust (robust methods are useful
when errors fail to satisfy normalcy conditions or when the data contain significant
Generalized linear models: These models, and their generalization (generalized additive
models), allow a categorical (nominal) response variable (or some transformation
of it) to be related to a set of predictor variables in a manner similar to the modeling
of a numeric response variable using linear regression. Generalized linear models
include logistic regression and Poisson regression.
Analysis of variance: These techniques analyze experimental data for two or more
populations described by a numeric response variable and one or more categorical
variables (factors). In general, an ANOVA (single-factor analysis of variance) problem
involves a comparison of k population or treatment means to determine if at least two
of the means are different. More complex ANOVA problems also exist.
Mixed-effect models: These models are for analyzing grouped data—data that can
be classified according to one or more grouping variables. They typically describe
relationships between a response variable and some covariates in data grouped
according to one or more factors. Common areas of application include multilevel
data, repeated measures data, block designs, and longitudinal data.
Factor analysis: This method is used to determine which variables are combined to
generate a given factor. For example, for many psychiatric data, it is not possible to
measure a certain factor of interest directly (e.g., intelligence); however, it is often
possible to measure other quantities (e.g., student test scores) that reflect the factor
of interest. Here, none of the variables is designated as dependent.

Discriminant analysis: This technique is used to predict a categorical response variable.
Unlike generalized linear models, it assumes that the independent variables
follow a multivariate normal distribution. The procedure attempts to determine
several discriminant functions (linear combinations of the independent variables)
that discriminate among the groups defined by the response variable. Discriminant
analysis is commonly used in social sciences.
Survival analysis: Several well-established statistical techniques exist for survival
analysis. These techniques originally were designed to predict the probability that
a patient undergoing a medical treatment would survive at least to time t. Methods
for survival analysis, however, are also commonly applied to manufacturing settings
to estimate the life span of industrial equipment. Popular methods include KaplanMeier
estimates of survival, Cox proportional hazards regression models, and their
Quality control: Various statistics can be used to prepare charts for quality control,
such as Shewhart charts and CUSUM charts (both of which display group summary
statistics). These statistics include the mean, standard deviation, range, count,
moving average, moving standard deviation, and moving range.

Visual and Audio Data Mining
Visual data mining discovers implicit and useful knowledge from large data sets using
data and/or knowledge visualization techniques. The human visual system is controlled
by the eyes and brain, the latter of which can be thought of as a powerful, highly parallel
processing and reasoning engine containing a large knowledge base. Visual data mining
essentially combines the power of these components, making it a highly attractive and
effective tool for the comprehension of data distributions, patterns, clusters, and outliers
in data.
Visual data mining can be viewed as an integration of two disciplines: data visualization
and data mining. It is also closely related to computer graphics, multimedia systems,
human–computer interaction, pattern recognition, and high-performance computing.
In general, data visualization and data mining can be integrated in the following ways:

Audio data mining uses audio signals to indicate the patterns of data or the features
of data mining results. Although visual data mining may disclose interesting patterns
using graphical displays, it requires users to concentrate on watching patterns and identifying
interesting or novel features within them. This can sometimes be quite tiresome.
If patterns can be transformed into sound and music, then instead of watching pictures,
we can listen to pitchs, rhythm, tune, and melody to identify anything interesting
or unusual.

3 Data Mining Applications

In this book, we have studied principles and methods for mining relational data, data
warehouses, and complex data types. Because data mining is a relatively young discipline
with wide and diverse applications, there is still a nontrivial gap between general principles
of data mining and application-specific, effective data mining tools.

Data Mining for Financial Data Analysis
Most banks and financial institutions offer a wide variety of banking, investment, and
credit services (the latter include business, mortgage, and automobile loans and credit
cards). Some also offer insurance and stock investment services.

Financial data collected in the banking and financial industry are often relatively
complete, reliable, and of high quality, which facilitates systematic data analysis and data
mining. Here we present a few typical cases.
Design and construction of data warehouses for multidimensional data analysis
and data mining:

Loan payment prediction and customer credit policy analysis: Loan payment prediction
and customer credit analysis are critical to the business of a bank. Many
factors can strongly or weakly influence loan payment performance and customer
credit rating. Data mining methods, such as attribute selection and attribute relevance
ranking, may help identify important factors and eliminate irrelevant ones

Classification and clustering of customers for targeted marketing: Classification
and clustering methods can be used for customer group identification and targeted

Detection of money laundering and other financial crimes: To detect money laundering
and other financial crimes, it is important to integrate information from
multiple, heterogeneous databases (e.g., bank transaction databases and federal or
state crime history databases), as long as they are potentially related to the study.
Multiple data analysis tools can then be used to detect unusual patterns, such as large
amounts of cash flow at certain periods, by certain groups of customers. U

Data Mining for Retail and Telecommunication Industries
The retail industry is a well-fit application area for data mining, since it collects huge
amounts of data on sales, customer shopping history, goods transportation, consumption,
and service. The quantity of data collected continues to expand rapidly, especially
due to the increasing availability, ease, and popularity of business conducted on the Web,
or e-commerce. Today, most major chain stores also have web sites where customers
can make purchases online. Some businesses, such as (,
exist solely online, without any brick-and-mortar (i.e., physical) store locations. Retail
data provide a rich source for data mining.
Retail data mining can help identify customer buying behaviors, discover customer
shopping patterns and trends, improve the quality of customer service, achieve better
customer retention and satisfaction, enhance goods consumption ratios, design more
effective goods transportation and distribution policies, and reduce the cost of business.

Data Mining in Science and Engineering

Today, scientific data can be amassed at much higher speeds and lower costs.
This has resulted in the accumulation of huge volumes of high-dimensional data,
stream data, and heterogenous data, containing rich spatial and temporal information.
Consequently, scientific applications are shifting from the “hypothesize-and-test”
paradigm toward a “collect and store data, mine for new hypotheses, confirm with data or
experimentation” process. This shift brings about new challenges for data mining.

5 Data Mining Trends

The diversity of data, data mining tasks, and data mining approaches poses many challenging
research issues in data mining. The development of efficient and effective data
mining methods, systems and services, and interactive and integrated data mining environments
is a key area of study. The use of data mining techniques to solve large or
sophisticated application problems is an important task for data mining researchers
and data mining system and application developers.

Excerpts from the Book

Data Mining Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University

Outlier Detection - Outlier Analysis

1 Outliers and Outlier Analysis

What Are Outliers?

Generally it is assumed that a statistical process is used to generate a set of data objects. An outlier
is a data object that deviates significantly from the rest of the objects, as if it were generated
by a different statistical process. In this chapter, we refer to data objects that are not outliers as “normal” or expected data. Outliers are termed as “abnormal” data

Types of Outliers

In general, outliers can be classified into three categories, namely global outliers, contextual
(or conditional) outliers, and collective outliers.

Global Outliers
In a given data set, a data object is a global outlier if it deviates significantly from the rest of the data set. Global outliers are sometimes called point anomalies, and are the simplest type of outliers. Most outlier detection methods are aimed at finding global outliers.

Contextual (or conditional) outliers

In a given data set, a data object is a contextual outlier if it deviates significantly with respect to a specific context of the object. Contextual outliers are also known as conditional outliers because they are conditional on the selected context. Therefore, in contextual outlier detection, the context has to be specified as part of the problem definition. Generally, in contextual outlier detection, the attributes of the data objects in question are divided into two groups:

Contextual attributes: The contextual attributes of a data object define the object’s
context. In the temperature example, the contextual attributes may be date and location.

Behavioral attributes: These define the object’s characteristics, and are used to evaluate
whether the object is an outlier in the context to which it belongs. In the temperature example, the behavioral attributes may be the temperature, humidity,

Collective Outliers

When various shippers are transporting items, some may be delayed. But 100 of them got delayed there is a collective outlier responsible for this.

2 Outlier Detection Methods

Supervised, Semi-Supervised, and Unsupervised Methods
If expert-labeled examples of normal and/or outlier objects can be obtained, they can be
used to build outlier detection models. The methods used can be divided into supervised
methods, semi-supervised methods, and unsupervised methods.
Supervised Methods
Supervised methods model data normality and abnormality. Domain experts examine
and label a sample of the underlying data.

Unsupervised Methods
In some application scenarios, objects labeled as “normal” or “outlier” are not available.
Thus, an unsupervised learning method has to be used.
Unsupervised outlier detection methods make an implicit assumption: The normal
objects are somewhat “clustered.” In other words, an unsupervised outlier detection
method expects that normal objects follow a pattern far more frequently than outliers.
Normal objects do not have to fall into one group sharing high similarity. Instead, they
can form multiple groups, where each group has distinct features. However, an outlier is
expected to occur far away in feature space from any of those groups of normal objects

Semi-Supervised Methods
In many applications, although obtaining some labeled examples is feasible, the number
of such labeled examples is often small. We may encounter cases where only a small set
of the normal and/or outlier objects are labeled, but most of the data are unlabeled.
Semi-supervised outlier detection methods were developed to tackle such scenarios.
Semi-supervised outlier detection methods can be regarded as applications of semisupervised
learning methods

Statistical Methods, Proximity-Based Methods,
and Clustering-Based Methods
Outlier detection methods make assumptions about outliers
versus the rest of the data. According to the assumptions made, we can categorize outlier
detection methods into three types: statistical methods, proximity-based methods, and
clustering-based methods.

Statistical Methods
Statistical methods (also known as model-based methods) make assumptions of
data normality. They assume that normal data objects are generated by a statistical
(stochastic) model, and that data not following the model are outliers.

Proximity-Based Methods
Proximity-based methods assume that an object is an outlier if the nearest neighbors
of the object are far away in feature space, that is, the proximity of the object to its
neighbors significantly deviates from the proximity of most of the other objects to their
neighbors in the same data set.

Clustering-Based Methods
Clustering-based methods assume that the normal data objects belong to large and
dense clusters,

3 Statistical Approaches

They assume that the normal objects in a data set are
generated by a stochastic process (a generative model). Consequently, normal objects
occur in regions of high probability for the stochastic model, and objects in the regions
of low probability are outliers.
The general idea behind statistical methods for outlier detection is to learn a generative
model fitting the given data set, and then identify those objects in low-probability
regions of the model as outliers. However, there are many different ways to learn generative
models. In general, statistical methods for outlier detection can be divided into two
major categories: parametric methods and nonparametric methods, according to how the
models are specified and learned.
and pressure.

4 Proximity-Based Approaches

Given a set of objects in feature space, a distance measure can be used to quantify the
similarity between objects. Intuitively, objects that are far from others can be regarded
as outliers. Proximity-based approaches assume that the proximity of an outlier object
to its nearest neighbors significantly deviates from the proximity of the object to most
of the other objects in the data set.
There are two types of proximity-based outlier detection methods: distance-based
and density-based methods. A distance-based outlier detection method consults the
neighborhood of an object, which is defined by a given radius. An object is then considered
an outlier if its neighborhood does not have enough other points. A density-based
outlier detection method investigates the density of an object and that of its neighbors.
Here, an object is identified as an outlier if its density is relatively much lower than that
of its neighbors

5 Clustering-Based Approaches

The notion of outliers is highly related to that of clusters. Clustering-based approaches
detect outliers by examining the relationship between objects and clusters. Intuitively,
an outlier is an object that belongs to a small and remote cluster, or does not belong to
any cluster.
This leads to three general approaches to clustering-based outlier detection. Consider
an object.
Does the object belong to any cluster? If not, then it is identified as an outlier.
Is there a large distance between the object and the cluster to which it is closest? If
yes, it is an outlier.
Is the object part of a small or sparse cluster? If yes, then all the objects in that cluster
are outliers.

6 Classification-Based Approaches

Outlier detection can be treated as a classification problem if a training data set with class
labels is available. The general idea of classification-based outlier detection methods is
to train a classification model that can distinguish normal data from outliers.
Consider a training set that contains samples labeled as “normal” and others labeled
as “outlier.” A classifier can then be constructed based on the training set.

7 Mining Contextual and Collective Outliers

An object in a given data set is a contextual outlier (or conditional outlier) if it deviates
significantly with respect to a specific context of the object (Section 12.1). The
context is defined using contextual attributes. These depend heavily on the application,
and are often provided by users as part of the contextual outlier detection task.
Contextual attributes can include spatial attributes, time, network locations, and sophisticated
structured attributes. In addition, behavioral attributes define characteristics of
the object, and are used to evaluate whether the object is an outlier in the context to
which it belongs.

Mining Collective Outliers
A group of data objects forms a collective outlier if the objects as a whole deviate significantly
from the entire data set, even though each individual object in the group may
not be an outlier. To detect collective outliers, we have to examine the
structure of the data set, that is, the relationships between multiple data objects. This
makes the problem more difficult than conventional and contextual outlier detection.
“How can we explore the data set structure?” This typically depends on the nature
of the data. For outlier detection in temporal data (e.g., time series and sequences), we
explore the structures formed by time, which occur in segments of the time series or subsequences.
To detect collective outliers in spatial data, we explore local areas. Similarly,
in graph and network data, we explore subgraphs. Each of these structures is inherent to
its respective data type.

8 Outlier Detection in High-Dimensional Data

In some applications, we may need to detect outliers in high-dimensional data. The
dimensionality curse poses huge challenges for effective outlier detection. As the dimensionality
increases, the distance between objects may be heavily dominated by noise.
That is, the distance and similarity between two points in a high-dimensional space
may not reflect the real relationship between the points. Consequently, conventional
outlier detection methods, which mainly use proximity or density to identify outliers,
deteriorate as dimensionality increases.

Angle-based outlier detection (ABOD)

Next Chapter:  Data Mining Recent Trends and Research Frontiers

Excerpts from the Book

Data Mining Concepts and Techniques

Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University

Data Mining - Advanced Cluster Analysis

Probabilistic model-based clustering: A general framework
and a method for deriving clusters where each object is assigned a probability of
belonging to a cluster. Probabilistic model-based clustering is widely used in many
data mining applications such as text mining.

Clustering high-dimensional data: When the dimensionality is high, conventional
distance measures can be dominated by noise.
Clustering graph and network data: Graph and network data are increasingly popular
in applications such as online social networks, the World Wide Web, and digital
libraries.  key issues in clustering graph and network data, including similarity measurement and clustering methods are to be described.
Clustering with constraints:. In some applications, however, various constraints may exist.
These constraints may rise from background knowledge or spatial distribution of
the objects.

1 Probabilistic Model-Based Clustering

In this section, we
demonstrate the need for fuzzy or flexible cluster assignment in some applications, and
introduce a general method to compute probabilistic clusters and assignments.

Fuzzy Clusters
Given a set of objects, X = {x1,...,xn}, a fuzzy set S is a subset of X that allows each
object in X to have a membership degree between 0 and 1. Formally, a fuzzy set, S, can
be modeled as a function, FS:X → [0,1].

Probabilistic Model-Based Clusters
“Fuzzy clusters  provide the flexibility of allowing an object to participate
in multiple clusters. Is there a general framework to specify clusterings where objects may
participate in multiple clusters in a probabilistic way?”

Statistically, we can assume that a hidden category is a distribution over the data
space, which can be mathematically represented using a probability density function
(or distribution function). We call such a hidden category a probabilistic cluster. For a
probabilistic cluster, C, its probability density function, f , and a point, o, in the data
space, f (o) is the relative likelihood that an instance of C appears at o.

Given data set, D, and k, the number of clusters required, the task of probabilistic
model-based cluster analysis is to infer a set of k probabilistic clusters that is most likely to
generate D using this data generation process. An important question remaining is how
we can measure the likelihood that a set of k probabilistic clusters and their probabilities
will generate an observed data set

Expectation-Maximization Algorithm

It can easily be shown that k-means clustering is a special case of fuzzy clustering
The k-means algorithm iterates until the clustering cannot be improved.

Each iteration consists of two steps:
The expectation step (E-step): Given the current cluster centers, each object is assigned
to the cluster with a center that is closest to the object. Here, an object is expected to
belong to the closest cluster.
The maximization step (M-step): Given the cluster assignment, for each cluster, the
algorithm adjusts the center so that the sum of the distances from the objects
assigned to this cluster and the new center is minimized. That is, the similarity of
objects assigned to a cluster is maximized.

We can generalize this two-step method to tackle fuzzy clustering and probabilistic
model-based clustering. In general, an expectation-maximization (EM) algorithm is
a framework that approaches maximum likelihood or maximum a posteriori estimates
of parameters in statistical models. In the context of fuzzy or probabilistic model-based
clustering, an EM algorithm starts with an initial set of parameters and iterates until
the clustering cannot be improved, that is, until the clustering converges or the change
is sufficiently small (less than a preset threshold). Each iteration also consists of two

2 Clustering High-Dimensional Data

Methods for
high-dimensional data clustering can be divided into two categories: subspace clustering
methods and dimensionality reduction methods.

In some applications, a data object may be described by 10 or more attributes. Such
objects are referred to as a high-dimensional data space.

3 Clustering Graph and Network Data
Cluster analysis on graph and network data extracts valuable knowledge and information.
Such data are increasingly popular in many applications.

Similarity Measures

Geodesic Distance
A simple measure of the distance between two vertices in a graph is the shortest path
between the vertices. Formally, the geodesic distance between two vertices is the length
in terms of the number of edges of the shortest path between the vertices. For two
vertices that are not connected in a graph, the geodesic distance is defined as infinite

SimRank: Similarity Based on Random Walk
and Structural Context
For some applications, geodesic distance may be inappropriate in measuring the similarity
between vertices in a graph. Here we introduce SimRank, a similarity measure
based on random walk and on the structural context of the graph. In mathematics, a
random walk is a trajectory that consists of taking successive random steps.

4 Clustering with Constraints

Users often have background knowledge that they want to integrate into cluster analysis.
There may also be application-specific requirements. Such information can be modeled
as clustering constraints

Next Chapter
Data Mining Recent Trends and Research Frontiers

Excerpts from the Book

Data Mining Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University

Data Mining - Cluster Analysis: Basic Concepts and Methods

To be rewritten to integrate the material which is now in the form excerpts.

Partitioned Clusters - Hierarchical clusters - Grid based cluster - Density based Clusters

Clustering is the process of grouping a set of data objects into multiple groups or clusters
so that objects within a cluster have high similarity, but are very dissimilar to objects
in other clusters. Dissimilarities and similarities are assessed based on the attribute values
describing the objects and often involve distance measures.1 Clustering as a data
mining tool has its roots in many application areas such as biology, security, business
intelligence, and Web search.

1 Cluster Analysis Basics

What Is Cluster Analysis?
Cluster analysis or simply clustering is the process of partitioning a set of data objects
(or observations) into subsets. Each subset is a cluster, such that objects in a cluster
are similar to one another, yet dissimilar to objects in other clusters. The set of clusters
resulting from a cluster analysis can be referred to as a clustering.

Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity. Clustering can
also be used for outlier detection, where outliers (values that are “far away” from any
cluster) may be more interesting than common cases.

Data clustering is under vigorous development. Contributing areas of research
include data mining, statistics, machine learning, spatial database technology, information
retrieval, Web search, biology, marketing, and many other application areas. Owing
to the huge amounts of data collected in databases, cluster analysis has recently become
a highly active topic in data mining research.

As a branch of statistics, cluster analysis has been extensively studied, with the
main focus on distance-based cluster analysis. Cluster analysis tools based on k-means,
k-medoids, and several other methods also have been built into many statistical analysis
software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning, recall
that classification is known as supervised learning because the class label information is
given, that is, the learning algorithm is supervised in that it is told the class membership
of each training tuple. Clustering is known as unsupervised learning because the
class label information is not present. For this reason, clustering is a form of learning
by observation, rather than learning by examples. In data mining, efforts have focused
on finding methods for efficient and effective cluster analysis in large databases. Active
themes of research focus on the scalability of clustering methods, the effectiveness of
methods for clustering complex shapes (e.g., nonconvex) and types of data (e.g., text,
graphs, and images), high-dimensional clustering techniques (e.g., clustering objects
with thousands of features), and methods for clustering mixed numerical and nominal
data in large databases.

Requirements for Cluster Analysis

The following are typical requirements of clustering in data mining.
Scalability: Many clustering algorithms work well on small data sets containing fewer
than several hundred data objects; however, a large database may contain millions or
even billions of objects, particularly in Web search scenarios. Clustering on only a
sample of a given large data set may lead to biased results. Therefore, highly scalable
clustering algorithms are needed.
Ability to deal with different types of attributes: Many algorithms are designed to
cluster numeric (interval-based) data. However, applications may require clustering
other data types, such as binary, nominal (categorical), and ordinal data, or mixtures
of these data types. Recently, more and more applications need clustering techniques
for complex data types such as graphs, sequences, images, and documents.
Discovery of clusters with arbitrary shape: Many clustering algorithms determine
clusters based on Euclidean or Manhattan distance measures (Chapter 2). Algorithms
based on such distance measures tend to find spherical clusters with similar size and
density. However, a cluster could be of any shape. Consider sensors, for example,
which are often deployed for environment surveillance. Cluster analysis on sensor
readings can detect interesting phenomena. We may want to use clustering to find
the frontier of a running forest fire, which is often not spherical. It is important to
develop algorithms that can detect clusters of arbitrary shape

Requirements for domain knowledge to determine input parameters: Many clustering
algorithms require users to provide domain knowledge in the form of input
parameters such as the desired number of clusters. Consequently, the clustering
results may be sensitive to such parameters. Parameters are often hard to determine,
especially for high-dimensionality data sets and where users have yet to grasp a deep
understanding of their data. Requiring the specification of domain knowledge not
only burdens users, but also makes the quality of clustering difficult to control.
Ability to deal with noisy data: Most real-world data sets contain outliers and/or
missing, unknown, or erroneous data. Sensor readings, for example, are often
noisy—some readings may be inaccurate due to the sensing mechanisms, and some
readings may be erroneous due to interferences from surrounding transient objects.
Clustering algorithms can be sensitive to such noise and may produce poor-quality
clusters. Therefore, we need clustering methods that are robust to noise.
Incremental clustering and insensitivity to input order: In many applications,
incremental updates (representing newer data) may arrive at any time. Some clustering
algorithms cannot incorporate incremental updates into existing clustering
structures and, instead, have to recompute a new clustering from scratch. Clustering
algorithms may also be sensitive to the input data order. That is, given a set
of data objects, clustering algorithms may return dramatically different clusterings
depending on the order in which the objects are presented. Incremental clustering
algorithms and algorithms that are insensitive to the input order are needed.

Capability of clustering high-dimensionality data: A data set can contain numerous
dimensions or attributes. When clustering documents, for example, each keyword
can be regarded as a dimension, and there are often thousands of keywords. Most
clustering algorithms are good at handling low-dimensional data such as data sets
involving only two or three dimensions. Finding clusters of data objects in a highdimensional
space is challenging, especially considering that such data can be very
sparse and highly skewed.
Constraint-based clustering: Real-world applications may need to perform clustering
under various kinds of constraints. Suppose that your job is to choose the
locations for a given number of new automatic teller machines (ATMs) in a city. To
decide upon this, you may cluster households while considering constraints such as
the city’s rivers and highway networks and the types and number of customers per
cluster. A challenging task is to find data groups with good clustering behavior that
satisfy specified constraints.
Interpretability and usability: Users want clustering results to be interpretable,
comprehensible, and usable. That is, clustering may need to be tied in with specific
semantic interpretations and applications. It is important to study how an
application goal may influence the selection of clustering features and clustering

Overview of Basic Clustering Methods

Partitioning methods: Given a set of n objects, a partitioning method constructs k
partitions of the data, where each partition represents a cluster and k ≤ n. That is, it
divides the data into k groups such that each group must contain at least one object.
In other words, partitioning methods conduct one-level partitioning on data sets.
The basic partitioning methods typically adopt exclusive cluster separation. That is,
each object must belong to exactly one group.

iterative relocation technique that attempts to improve the partitioning by moving
objects from one group to another. The general criterion of a good partitioning is
that objects in the same cluster are “close” or related to each other, whereas objects
in different clusters are “far apart” or very different.

Hierarchical methods: A hierarchical method creates a hierarchical decomposition of
the given set of data objects. A hierarchical method can be classified as being either
agglomerative or divisive, based on how the hierarchical decomposition is formed.
The agglomerative approach, also called the bottom-up approach, starts with each
object forming a separate group. It successively merges the objects or groups close
to one another, until all the groups are merged into one (the topmost level of the
hierarchy), or a termination condition holds. The divisive approach, also called the
top-down approach, starts with all the objects in the same cluster. In each successive
iteration, a cluster is split into smaller clusters, until eventually each object is in one
cluster, or a termination condition holds.

Density-based methods: Some  clustering methods have been developed based on the notion of density. Their general idea
is to continue growing a given cluster as long as the density (number of objects or
data points) in the “neighborhood” exceeds some threshold. For example, for each
data point within a given cluster, the neighborhood of a given radius has to contain
at least a minimum number of points. Such a method can be used to filter out noise
or outliers and discover clusters of arbitrary shape.

Grid-based methods: Grid-based methods quantize the object space into a finite
number of cells that form a grid structure. All the clustering operations are performed
on the grid structure (i.e., on the quantized space). The main advantage of
this approach is its fast processing time, which is typically independent of the number
of data objects and dependent only on the number of cells in each dimension in
the quantized space.

2 Partitioning Methods

Formally, given a data set, D, of n objects, and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition
represents a cluster. The clusters are formed to optimize an objective partitioning
criterion, such as a dissimilarity function based on distance, so that the objects within a
cluster are “similar” to one another and “dissimilar” to objects in other clusters in terms
of the data set attributes.

k-Means: A Centroid-Based Technique
Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute
the objects in D into k clusters, C1,...,Ck
, that is, Ci ⊂ D and Ci ∩Cj = ∅ for
(1 ≤ i,j ≤ k). An objective function is used to assess the partitioning quality so that
objects within a cluster are similar to one another but dissimilar to objects in other
clusters. This is, the objective function aims for high intracluster similarity and low
intercluster similarity.
A centroid-based partitioning technique uses the centroid of a cluster, Ci
, to represent
that cluster. Conceptually, the centroid of a cluster is its center point. The centroid can
be defined in various ways such as by the mean or medoid of the objects (or points)
assigned to the cluster

k-Medoids: A Representative Object-Based Technique
The k-means algorithm is sensitive to outliers because such objects are far away from the
majority of the data, and thus, when assigned to a cluster, they can dramatically distort
the mean value of the cluster.

we can
pick actual objects to represent the clusters, using one representative object per cluster.
Each remaining object is assigned to the cluster of which the representative object is
the most similar. The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its corresponding
representative object

. This is the basis for the k-medoids method, which groups n
objects into k clusters by minimizing the absolute error.
The Partitioning Around Medoids (PAM) algorithm is a popular
realization of k-medoids clustering.

3 Hierarchical Methods

While partitioning methods meet the basic clustering requirement of organizing a set of
objects into a number of exclusive groups, in some situations we may want to partition
our data into groups at different levels such as in a hierarchy. A hierarchical clustering
method works by grouping data objects into a hierarchy or “tree” of clusters

Employees may be clusted nto major groups such as executives, managers, and
staff. One can further partition these groups into smaller subgroups. For instance, the
general group of staff can be further divided into subgroups of senior officers, officers,
and trainees. All these groups form a hierarchy. We can easily summarize or characterize
the data that are organized into a hierarchy, which can be used to find, say, the average
salary of managers and of officers.

A promising direction for improving the clustering quality of hierarchical methods
is to integrate hierarchical clustering with other clustering techniques, resulting in
multiple-phase (or multiphase) clustering. Two such methods are BIRCH and Chameleon. BIRCH begins by partitioning objects hierarchically
using tree structures, where the leaf or low-level nonleaf nodes can be
viewed as “microclusters” depending on the resolution scale. It then applies other clustering algorithms to perform macroclustering on the microclusters. Chameleon
 explores dynamic modeling in hierarchical clustering.

A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering.

Distance Measures in Algorithmic Methods
Whether using an agglomerative method or a divisive method, a core need is to measure
the distance between two clusters, where each cluster is generally a set of objects.

BIRCH: Multiphase Hierarchical Clustering
Using Clustering Feature Trees
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is designed for
clustering a large amount of numeric data by integrating hierarchical clustering (at the
initial microclustering stage) and other clustering methods such as iterative partitioning
(at the later macroclustering stage). It overcomes the two difficulties in agglomerative
clustering methods: (1) scalability and (2) the inability to undo what was done in the
previous step.

A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical

y. The
primary phases are
Phase 1: BIRCH scans the database to build an initial in-memory CF-tree, which
can be viewed as a multilevel compression of the data that tries to preserve the data’s
inherent clustering structure.
Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of
the CF-tree, which removes sparse clusters as outliers and groups dense clusters into
larger ones.

Chameleon: Multiphase Hierarchical Clustering
Using Dynamic Modeling
Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine
the similarity between pairs of clusters. In Chameleon, cluster similarity is assessed
based on (1) how well connected objects are within a cluster and (2) the proximity of
clusters. That is, two clusters are merged if their interconnectivity is high and they are
close together. Thus, Chameleon does not depend on a static, user-supplied model and
can automatically adapt to the internal characteristics of the clusters being merged. The
merge process facilitates the discovery of natural and homogeneous clusters and applies
to all data types as long as a similarity function can be specified.

 Probabilistic Hierarchical Clustering

Probabilistic hierarchical clustering aims to overcome some of these disadvantages
by using probabilistic models to measure distances between clusters.

In practice, we can assume that the data generative models adopt common distribution
functions, such as Gaussian distribution or Bernoulli distribution, which are
governed by parameters. The task of learning a generative model is then reduced to
finding the parameter values for which the model best fits the observed data set.

4 Density-Based Methods

Partitioning and hierarchical methods are designed to find spherical-shaped clusters.
They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval
clusters in Figure 10.13. Given such data, they would likely inaccurately identify convex
regions, where noise or outliers are included in the clusters.
To find clusters of arbitrary shape, alternatively, we can model clusters as dense
regions in the data space, separated by sparse regions. This is the main strategy behind
density-based clustering methods, which can discover clusters of nonspherical shape.

DBSCAN: Density-Based Clustering Based on Connected
Regions with High Density
“How can we find dense regions in density-based clustering?” The density of an object o
can be measured by the number of objects close to o. DBSCAN (Density-Based Spatial
Clustering of Applications with Noise) finds core objects, that is, objects that have dense
neighborhoods. It connects core objects and their neighborhoods to form dense regions
as clusters.
“How does DBSCAN quantify the neighborhood of an object?” A user-specified parameter
> 0 is used to specify the radius of a neighborhood we consider for every object.
The -neighborhood of an object o is the space within a radius centered at o.
Due to the fixed neighborhood size parameterized by , the density of a neighborhood
can be measured simply by the number of objects in the neighborhood.

OPTICS: Ordering Points to Identify
the Clustering Structure
Although DBSCAN can cluster objects given input parameters such as (the maximum
radius of a neighborhood) and MinPts (the minimum number of points required
in the neighborhood of a core object), it encumbers users with the responsibility of
selecting parameter values that will lead to the discovery of acceptable clusters. This is
a problem associated with many other clustering algorithms. Such parameter settings
are usually empirically set and difficult to determine, especially for real-world, highdimensional
data sets. Most algorithms are sensitive to these parameter values: Slightly
different settings may lead to very different clusterings of the data. Moreover, real-world,
high-dimensional data sets often have very skewed distributions such that their intrinsic

OPTICS computes an ordering of all objects in a given database and, for each object
in the database, stores the core-distance and a suitable reachability-distance. OPTICS
maintains a list called OrderSeeds to generate the output ordering. Objects in OrderSeeds
are sorted by the reachability-distance from their respective closest core objects,
that is, by the smallest reachability-distance of each object.
OPTICS begins with an arbitrary object from the input database as the current
object, p. It retrieves the -neighborhood of p, determines the core-distance, and sets
the reachability-distance to undefined.

DENCLUE: Clustering Based on Density
Distribution Functions
Density estimation is a core issue in density-based clustering methods. DENCLUE
(DENsity-based CLUstEring) is a clustering method based on a set of density distribution

5 Grid-Based Methods

The clustering methods discussed so far are data-driven—they partition the set of
objects and adapt to the distribution of the objects in the embedding space. Alternatively,
a grid-based clustering method takes a space-driven approach by partitioning
the embedding space into cells independent of the distribution of the input objects

STING: STatistical INformation Grid
STING is a grid-based multiresolution clustering technique in which the embedding
spatial area of the input objects is divided into rectangular cells. The space can be divided
in a hierarchical and recursive way. Several levels of such rectangular cells correspond to
different levels of resolution and form a hierarchical structure: Each cell at a high level
is partitioned to form a number of cells at the next lower level. Statistical information
regarding the attributes in each grid cell, such as the mean, maximum, and minimum
values, is precomputed and stored as statistical parameters. These statistical parameters
are useful for query processing and for other data analysis tasks.

CLIQUE: An Apriori-like Subspace Clustering Method

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding densitybased
clusters in subspaces. CLIQUE partitions each dimension into nonoverlapping
intervals, thereby partitioning the entire embedding space of the data objects into cells.
It uses a density threshold to identify dense cells and sparse ones. A cell is dense if the
number of objects mapped to it exceeds the density threshold.
The main strategy behind CLIQUE for identifying a candidate search space uses the
monotonicity of dense cells with respect to dimensionality. This is based on the Apriori
property used in frequent pattern and association rule mining.

6 Evaluation of Clustering

Assessing clustering tendency. In this task, for a given data set, we assess whether a
nonrandom structure exists in the data. Blindly applying a clustering method on a
data set will return clusters; however, the clusters mined may be misleading. Clustering
analysis on a data set is meaningful only when there is a nonrandom structure in
the data.
Determining the number of clusters in a data set. A few algorithms, such as k-means,
require the number of clusters in a data set as the parameter. Moreover, the number
of clusters can be regarded as an interesting and important summary statistic of a
data set. Therefore, it is desirable to estimate this number even before a clustering
algorithm is used to derive detailed clusters.
Measuring clustering quality. After applying a clustering method on a data set, we
want to assess how good the resulting clusters are. A number of measures can be used.
Some methods measure how well the clusters fit the data set, while others measure
how well the clusters match the ground truth, if such truth is available. There are also
measures that score clusterings and thus can compare two sets of clustering results
on the same data set.

Next Chapter

Data Mining Recent Trends and Research Frontiers

Excerpts From the Book

Data Mining Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University

Data Mining - Classification: Advanced Methods

1 Bayesian Belief Networks

Concepts and Mechanisms
The na¨ıve Bayesian classifier makes the assumption of class conditional independence,
that is, given the class label of a tuple, the values of the attributes are assumed to
be conditionally independent of one another. This simplifies computation. When the
assumption holds true, then the naïve Bayesian classifier is the most accurate in comparison
with all other classifiers. In practice, however, dependencies can exist between
variables. Bayesian belief networks specify joint conditional probability distributions.
They allow class conditional independencies to be defined between subsets of variables.
They provide a graphical model of causal relationships, on which learning can be performed.
Trained Bayesian belief networks can be used for classification. Bayesian belief
networks are also known as belief networks, Bayesian networks, and probabilistic

” In the learning or training of a belief network,
a number of scenarios are possible. The network topology (or “layout” of nodes
and arcs) may be constructed by human experts or inferred from the data. The network
variables may be observable or hidden in all or some of the training tuples. The hidden
data case is also referred to as missing values or incomplete data.
Several algorithms exist for learning the network topology from the training data
given observable variables. The problem is one of discrete optimization.

If the network topology is known and the variables are observable, then training the
network is straightforward. It consists of computing the CPT entries, as is similarly done
when computing the probabilities involved in na¨ıve Bayesian classification.
When the network topology is given and some of the variables are hidden, there
are various methods to choose from for training the belief network.

A gradient descent strategy is used to search for the wijk values that best model the
data, based on the assumption that each possible setting of wijk is equally likely. Such
a strategy is iterative. It searches for a solution along the negative of the gradient (i.e.,
steepest descent) of a criterion function. We want to find the set of weights, W, that
maximize this function. To start with, the weights are initialized to random probability
values. The gradient descent method performs greedy hill-climbing in that, at each
iteration or step along the way, the algorithm moves toward what appears to be the
best solution at the moment, without backtracking. The weights are updated at each
iteration. Eventually, they converge to a local optimum solution.

2 Classification by Backpropagation

“What is backpropagation?” Backpropagation is a neural network learning algorithm.

A Multilayer Feed-Forward Neural Network
The backpropagation algorithm performs learning on a multilayer feed-forward neural
network. It iteratively learns a set of weights for prediction of the class label of tuples.
A multilayer feed-forward neural network consists of an input layer, one or more hidden
layers, and an output layer.

The units in the input layer are called input units. The units in the hidden layers and
output layer are sometimes referred to as neurodes, due to their symbolic biological
basis, or as output units.

Defining a Network Topology
“How can I design the neural network’s topology?” Before training can begin, the user
must decide on the network topology by specifying the number of units in the input
layer, the number of hidden layers (if more than one), the number of units in each
hidden layer, and the number of units in the output layer.
Normalizing the input values for each attribute measured in the training tuples will
help speed up the learning phase. Typically, input values are normalized so as to fall
between 0.0 and 1.0. Discrete-valued attributes may be encoded such that there is one
input unit per domain value. For example, if an attribute A has three possible or known
values, namely {a0, a1, a2}, then we may assign three input units to represent A. That
is, we may have, say, I0, I1, I2 as input units. Each unit is initialized to 0. If A = a0, then
I0 is set to 1 and the rest are 0. If A = a1, then I1 is set to 1 and the rest are 0, and
so on.
Neural networks can be used for both classification (to predict the class label of a
given tuple) and numeric prediction (to predict a continuous-valued output). For classification,
one output unit may be used to represent two classes (where the value 1
represents one class, and the value 0 represents the other).

“How does backpropagation work?” Backpropagation learns by iteratively processing a
data set of training tuples, comparing the network’s prediction for each tuple with the
actual known target value. The target value may be the known class label of the training
tuple (for classification problems) or a continuous value (for numeric prediction). For
each training tuple, the weights are modified so as to minimize the mean-squared error
between the network’s prediction and the actual target value. These modifications are
made in the “backwards” direction (i.e., from the output layer) through each hidden
layer down to the first hidden layer (hence the name backpropagation). Although it is
not guaranteed, in general the weights will eventually converge, and the learning process

Inside the Black Box: Backpropagation and Interpretability

Methods include extracting rules from networks and sensitivity
Various algorithms for rule extraction have been proposed. The methods typically
impose restrictions regarding procedures used in training the given neural network, the
network topology, and the discretization of input values.
Fully connected networks are difficult to articulate. Hence, often the first step in
extracting rules from neural networks is network pruning. This consists of simplifying
the network structure by removing weighted links that have the least effect on the trained
network. For example, a weighted link may be deleted if such removal does not result in
a decrease in the classification accuracy of the network.
Once the trained network has been pruned, some approaches will then perform link,
unit, or activation value clustering. In one method, for example, clustering is used to
find the set of common activation values for each hidden unit in a given trained twolayer
neural network.

3 Support Vector Machines

 SVM is an algorithm that
works as follows. It uses a nonlinear mapping to transform the original training data
into a higher dimension. Within this new dimension, it searches for the linear optimal
separating hyperplane (i.e., a “decision boundary” separating the tuples of one class
from another). With an appropriate nonlinear mapping to a sufficiently high dimension,
data from two classes can always be separated by a hyperplane. The SVM finds this
hyperplane using support vectors (“essential” training tuples) and margins (defined by
the support vectors). We will delve more into these new concepts later.

The first
paper on support vector machines was presented in 1992 by Vladimir Vapnik and colleagues
Bernhard Boser and Isabelle Guyon, although the groundwork for SVMs has
been around since the 1960s (including early work by Vapnik and Alexei Chervonenkis
on statistical learning theory). Although the training time of even the fastest SVMs
can be extremely slow, they are highly accurate, owing to their ability to model complex
nonlinear decision boundaries. They are much less prone to overfitting than other
methods. The support vectors found also provide a compact description of the learned
model. SVMs can be used for numeric prediction as well as classification. They have
been applied to a number of areas, including handwritten digit recognition, object
recognition, and speaker identification, as well as benchmark time-series prediction

4 Classification Using Frequent Patterns

Associative Classification

rules are mined in a two-step process consisting of frequent itemset mining followed by
rule generation. The first step searches for patterns of attribute–value pairs that occur
repeatedly in a data set, where each attribute–value pair is considered an item. The
resulting attribute–value pairs form frequent itemsets (also referred to as frequent patterns).
The second step analyzes the frequent itemsets to generate association rules. All
association rules must satisfy certain criteria regarding their “accuracy” (or confidence)
and the proportion of the data set that they actually represent (referred to as support).

In general, associative classification consists of the following steps:
1. Mine the data for frequent itemsets, that is, find commonly occurring attribute–value
pairs in the data.
2. Analyze the frequent itemsets to generate association rules per class, which satisfy
confidence and support criteria.
3. Organize the rules to form a rule-based classifier.

Discriminative Frequent Pattern–Based Classification

The general framework for discriminative frequent pattern–based classification is as follows.
1. Feature generation: The data, D, are partitioned according to class label. Use frequent
itemset mining to discover frequent patterns in each partition, satisfying
minimum support. The collection of frequent patterns, F, makes up the feature
2. Feature selection: Apply feature selection to F, resulting in FS, the set of selected
(more discriminating) frequent patterns. Information gain, Fisher score, or other
evaluation measures can be used for this step. Relevancy checking can also be
incorporated into this step to weed out redundant patterns. The data set D is transformed
to D0, where the feature space now includes the single features as well as the
selected frequent patterns, FS.
3. Learning of classification model: A classifier is built on the data set D
0. Any learning algorithm can be used as the classification model.

5 Lazy Learners (or Learning from Your Neighbors)

k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The method is
labor intensive when given large training sets, and did not gain popularity until the
1960s when increased computing power became available. It has since been widely used
in the area of pattern recognition.
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing
a given test tuple with training tuples that are similar to it. The training tuples are
described by n attributes. Each tuple represents a point in an n-dimensional space. In
this way, all the training tuples are stored in an n-dimensional pattern space. When given
an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k
training tuples that are closest to the unknown tuple. These k training tuples are the k
“nearest neighbors” of the unknown tuple.

Case-Based Reasoning
Case-based reasoning (CBR) classifiers use a database of problem solutions to solve
new problems. Unlike nearest-neighbor classifiers, which store training tuples as points
in Euclidean space, CBR stores the tuples or “cases” for problem solving as complex
symbolic descriptions. Business applications of CBR include problem resolution for
customer service help desks, where cases describe product-related diagnostic problems.
CBR has also been applied to areas such as engineering and law, where cases are either
technical designs or legal rulings, respectively. Medical education is another area for
CBR, where patient case histories and treatments are used to help diagnose and treat
new patients.

6 Other Classification Methods

Genetic Algorithms
Genetic algorithms attempt to incorporate ideas of natural evolution. In general,
genetic learning starts as follows. An initial population is created consisting of randomly
generated rules. Each rule can be represented by a string of bits. As a simple example,
suppose that samples in a given training set are described by two Boolean attributes,
A1 and A2, and that there are two classes, C1 and C2. The rule “IF A1 AND NOT A2
THEN C2” can be encoded as the bit string “100,” where the two leftmost bits represent
attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly,
the rule “IF NOT A1 AND NOT A2 THEN C1” can be encoded as “001.” If an attribute
has k values, where k > 2, then k bits may be used to encode the attribute’s values.
Classes can be encoded in a similar fashion.
Based on the notion of survival of the fittest, a new population is formed to consist
of the fittest rules in the current population, as well as offspring of these rules. Typically,
the fitness of a rule is assessed by its classification accuracy on a set of training samples.
Offspring are created by applying genetic operators such as crossover and mutation.
In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In
mutation, randomly selected bits in a rule’s string are inverted.
The process of generating new populations based on prior populations of rules continues
until a population, P, evolves where each rule in P satisfies a prespecified fitness

Genetic algorithms are easily parallelizable and have been used for classification as
well as other optimization problems. In data mining, they may be used to evaluate the
fitness of other algorithms.

Rough Set Approach
Rough set theory can be used for classification to discover structural relationships within
imprecise or noisy data. It applies to discrete-valued attributes. Continuous-valued
attributes must therefore be discretized before its use.

Rough set theory is based on the establishment of equivalence classes within the
given training data. All the data tuples forming an equivalence class are indiscernible,
that is, the samples are identical with respect to the attributes describing the data. Given
real-world data, it is common that some classes cannot be distinguished in terms of the
available attributes. Rough sets can be used to approximately or “roughly” define such
classes. A rough set definition for a given class, C, is approximated by two sets—a lower
approximation of C and an upper approximation of C. The lower approximation of C
consists of all the data tuples that, based on the knowledge of the attributes, are certain to
belong to C without ambiguity. The upper approximation of C consists of all the tuples
that, based on the knowledge of the attributes, cannot be described as not belonging to

Rough sets can also be used for attribute subset selection (or feature reduction, where
attributes that do not contribute to the classification of the given training data can be
identified and removed) and relevance analysis (where the contribution or significance
of each attribute is assessed with respect to the classification task). The problem of finding
the minimal subsets (reducts) of attributes that can describe all the concepts in
the given data set is NP-hard. However, algorithms to reduce the computation intensity
have been proposed. In one method, for example, a discernibility matrix is used that
stores the differences between attribute values for each pair of data tuples. Rather than

searching on the entire training set, the matrix is instead searched to detect redundant

Fuzzy Set Approaches

Fuzzy set theory is also known as possibility theory. It was proposed by Lotfi Zadeh
in 1965 as an alternative to traditional two-value logic and probability theory. It lets
us work at a high abstraction level and offers a means for dealing with imprecise data
measurement. Most important, fuzzy set theory allows us to deal with vague or inexact
facts. For example, being a member of a set of high incomes is inexact (e.g., if $50,000
is high, then what about $49,000? or $48,000?) Unlike the notion of traditional “crisp”
sets where an element belongs to either a set S or its complement, in fuzzy set theory,
elements can belong to more than one fuzzy set. For example, the income value $49,000
belongs to both the medium and high fuzzy sets, but to differing degrees. Using fuzzy set
notation, membership of a fuzzyset can be described like mmedium income($49,000) = 0.15 and mhigh income($49,000) = 0.96,
where m denotes the membership function, that is operating on the fuzzy sets of
medium income and high income, respectively. In fuzzy set theory, membership values
for a given element, x (e.g., for $49,000), do not have to sum to 1. This is unlike
traditional probability theory, which is constrained by a summation axiom.

Given a tuple to classify, more than one fuzzy rule may apply. Each applicable rule
contributes a vote for membership in the categories. Typically, the truth values for each
predicted category are summed, and these sums are combined. Several procedures exist
for translating the resulting fuzzy output into a defuzzified or crisp value that is returned
by the system.
Fuzzy logic systems have been used in numerous areas for classification, including
market research, finance, health care, and environmental engineering.

7 Additional Topics Regarding Classification

Multiclass Classification
Some classification algorithms, such as support vector machines, are designed for binary
classification. How can we extend these algorithms to allow for multiclass classification
(i.e., classification involving more than two classes)?
A simple approach is one-versus-all (OVA). Given m classes, we train m binary classifiers,
one for each class. Classifier j is trained using tuples of class j as the positive class,
and the remaining tuples as the negative class. It learns to return a positive value for class
j and a negative value for the rest. To classify an unknown tuple, X, the set of classifiers
vote as an ensemble. For example, if classifier j predicts the positive class for X, then
class j gets one vote. If it predicts the negative class for X, then each of the classes except
j gets one vote. The class with the most votes is assigned to X.

Semi-Supervised Classification
Semi-supervised classification uses labeled data and unlabeled data to build a classifier.
Let Xl = {(x1, y1),...,xl
, yl)} be the set of labeled data and Xu = {xl+1,...,xn} be the set
of unlabeled data. Here we describe a few examples of this approach for learning.
Self-training is the simplest form of semi-supervised classification. It first builds a
classifier using the labeled data. The classifier then tries to label the unlabeled data. The
tuple with the most confident label prediction is added to the set of labeled data, and the
process repeats (Figure 9.17). Although the method is easy to understand, a disadvantage
is that it may reinforce errors.
Cotraining is another form of semi-supervised classification, where two or more
classifiers teach each other. Each learner uses a different and ideally independent set
of features for each tuple. Consider web page data, for example, where attributes relating
to the images on the page may be used as one set of features, while attributes relating
to the corresponding text constitute another set of features for the same data.

Active Learning
Active learning is an iterative type of supervised learning that is suitable for situations
where data are abundant, yet the class labels are scarce or expensive to obtain. The learning
algorithm is active in that it can purposefully query a user (e.g., a human oracle) for
labels. The number of tuples used to learn a concept this way is often much smaller than
the number required in typical supervised learning.

Transfer Learning
Transfer learning aims to extract the knowledge from one or more source tasks and
apply the knowledge to a target task.

Transfer learning allows the distributions, tasks, and even the data domains used in
training and testing to be different. Transfer learning is analogous to the way humans
may apply their knowledge of a task to facilitate the learning of another task. For example,
if we know how to play the recorder, we may apply our knowledge of note reading
and music to simplify the task of learning to play the piano. Similarly, knowing Spanish
may make it easier to learn Italian.

Next Chapter

Data Mining Recent Trends and Research Frontiers

Data Mining - Classification: Basic Concepts

Content to be rewritten to integrate various points.

1 Basic Concepts

What Is Classification?

Classification and numeric prediction are the two major types of prediction

General Approach to Classification
“How does classification work?” Data classification is a two-step process, consisting of a
learning step (where a classification model is constructed) and a classification step (where
the model is used to predict class labels for given data).

In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm
builds the classifier by analyzing or “learning from” a training set made up of database
tuples and their associated class labels. A tuple, X, is represented by an n-dimensional
attribute vector, X = (x1, x2,..., xn), depicting n measurements made on the tuple
from n database attributes, respectively, A1, A2,..., An.
1 Each tuple, X, is assumed to
belong to a predefined class as determined by another database attribute called the class
label attribute. The class label attribute is discrete-valued and unordered. It is categorical
(or nominal) in that each value serves as a category or class. The individual tuples
making up the training set are referred to as training tuples and are randomly sampled
from the database under analysis. In the context of classification, data tuples can be
referred to as samples, examples, instances, data points, or objects.
Because the class label of each training tuple is provided, this step is also known as
supervised learning (i.e., the learning of the classifier is “supervised” in that it is told
to which class each training tuple belongs). It contrasts with unsupervised learning (or
clustering), in which the class label of each training tuple is not known, and the number
or set of classes to be learned may not be known in advance.

The accuracy of a classifier on a given test set is the percentage of test set tuples that
are correctly classified by the classifier. The associated class label of each test tuple is compared
with the learned classifier’s class prediction for that tuple

2 Decision Tree Induction

Decision tree induction is the learning of decision trees from class-labeled training
tuples. A decision tree is a flowchart-like tree structure, where each internal node (nonleaf
node) denotes a test on an attribute, each branch represents an outcome of the
test, and each leaf node (or terminal node) holds a class label. The topmost node in
a tree is the root node.

“How are decision trees used for classification?” Given a tuple, X, for which the associated
class label is unknown, the attribute values of the tuple are tested against the
decision tree. A path is traced from the root to a leaf node, which holds the class
prediction for that tuple. Decision trees can easily be converted to classification rules.
“Why are decision tree classifiers so popular?” The construction of decision tree classifiers
does not require any domain knowledge or parameter setting, and therefore is
appropriate for exploratory knowledge discovery. Decision trees can handle multidimensional
data. Their representation of acquired knowledge in tree form is intuitive and
generally easy to assimilate by humans. The learning and classification steps of decision
tree induction are simple and fast. In general, decision tree classifiers have good accuracy.
However, successful use may depend on the data at hand. Decision tree induction
algorithms have been used for classification in many application areas such as medicine,
manufacturing and production, financial analysis, astronomy, and molecular biology.
Decision trees are the basis of several commercial rule induction systems

Basic algorithm for inducing a decision tree from training tuples.

The algorithm is called with three parameters: D, attribute list, and Attribute
selection method. We refer to D as a data partition. Initially, it is the complete set
of training tuples and their associated class labels. The parameter attribute list is a
list of attributes describing the tuples. Attribute selection method specifies a heuristic
procedure for selecting the attribute that “best” discriminates the given tuples
according to class. This procedure employs an attribute selection measure such as
information gain or the Gini index. Whether the tree is strictly binary is generally
driven by the attribute selection measure. Some attribute selection measures, such as
the Gini index, enforce the resulting tree to be binary. Others, like information gain,
do not, therein allowing multiway splits (i.e., two or more branches to be grown from
a node).
The tree starts as a single node, N, representing the training tuples in D (step 1).3

Algorithm: Generate decision tree. Generate a decision tree from the training tuples of
data partition, D.
Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the splitting criterion that “best”
partitions the data tuples into individual classes. This criterion consists of a
splitting attribute and, possibly, either a split-point or splitting subset.
Output: A decision tree.
(1) create a node N;
(2) if tuples in D are all of the same class, C, then
(3) return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5) return N as a leaf node labeled with the majority class in D; // majority voting
(6) apply Attribute selection method(D, attribute list) to find the “best” splitting criterion;
(7) label node N with splitting criterion;
(8) if splitting attribute is discrete-valued and
multiway splits allowed then // not restricted to binary trees
(9) attribute list ← attribute list − splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition
(12) if Dj
is empty then
(13) attach a leaf labeled with the majority class in D to node N;
(14) else attach the node returned by Generate decision tree(Dj
, attribute list) to node N;
(15) return N;
Basic algorithm for inducing a decision tree from training tuples.

Attribute Selection Measures
An attribute selection measure is a heuristic for selecting the splitting criterion that
“best” separates a given data partition, D, of class-labeled training tuples into individual
classes. If we were to split D into smaller partitions according to the outcomes of the
splitting criterion, ideally each partition would be pure (i.e., all the tuples that fall into a
given partition would belong to the same class). Conceptually, the “best” splitting criterion
is the one that most closely results in such a scenario. Attribute selection measures
are also known as splitting rules because they determine how the tuples at a given node
are to be split.

Information Gain
ID3 uses information gain as its attribute selection measure. This measure is based on
pioneering work by Claude Shannon on information theory, which studied the value or
“information content” of messages.

The gain ratio is defined as
GainRatio(A) =  Gain(A)/SplitInfoA (D)

The attribute with the maximum gain ratio is selected as the splitting attribute. Note,
however, that as the split information approaches 0, the ratio becomes unstable. A constraint
is added to avoid this, whereby the information gain of the test selected must be
large—at least as great as the average gain over all tests examined.

Gini Index
The Gini index is used in CART. Using the notation previously described, the Gini index
measures the impurity of D, a data partition or set of training tuples, as

Tree Pruning
When a decision tree is built, many of the branches will reflect anomalies in the training
data due to noise or outliers. Tree pruning methods address this problem of overfitting
the data. Such methods typically use statistical measures to remove the least-reliable

Scalability and Decision Tree Induction
“What if D, the disk-resident training set of class-labeled tuples, does not fit in memory? In
other words, how scalable is decision tree induction?” The efficiency of existing decision
tree algorithms, such as ID3, C4.5, and CART, has been well established for relatively
small data sets. Efficiency becomes an issue of concern when these algorithms are applied
to the mining of very large real-world databases. The pioneering decision tree algorithms
that we have discussed so far have the restriction that the training tuples should reside
in memory.
In data mining applications, very large training sets of millions of tuples are common

Several scalable decision tree induction methods have been introduced in recent studies.
RainForest, for example, adapts to the amount of main memory available and applies
to any decision tree induction algorithm. The method maintains an AVC-set (where
“AVC” stands for “Attribute-Value, Classlabel”) for each attribute, at each tree node,
describing the training tuples at the node.

BOAT (Bootstrapped Optimistic Algorithm for Tree construction) is a decision tree
algorithm that takes a completely different approach to scalability—it is not based on
the use of any special data structures. Instead, it uses a statistical technique known as
“bootstrapping” (Section 8.5.4) to create several smaller samples (or subsets) of the
given training data, each of which fits in memory. Each subset is used to construct a
tree, resulting in several trees. The trees are examined and used to construct a new tree,
T', that turns out to be “very close” to the tree that would have been generated if all the
original training data had fit in memory.

Visual Mining for Decision Tree Induction
“Are there any interactive approaches to decision tree induction that allow us to visualize
the data and the tree as it is being constructed? Can we use any knowledge of our
data to help in building the tree?”
Perception-based classification
(PBC) is an interactive approach based on multidimensional visualization techniques
and allows the user to incorporate background knowledge about the data when building
a decision tree. By visually interacting with the data, the user is also likely to develop a
deeper understanding of the data. The resulting trees tend to be smaller than those built
using traditional decision tree induction methods and so are easier to interpret, while
achieving about the same accuracy

3 Bayes Classification Methods

“What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities such as the probability that a given tuple belongs
to a particular class.

Naive Bayesian Classification

4 Rule-Based Classification

Using IF-THEN Rules for Classification
Rules are a good way of representing information or bits of knowledge. A rule-based
classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression
of the form
IF condition THEN conclusion.

That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., their
attribute values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the
tuples that it covers and see what percentage of them the rule can correctly classify.

The size ordering scheme assigns the highest priority to the triggering rule that has
the “toughest” requirements, where toughness is measured by the rule antecedent size.
That is, the triggering rule with the most attribute tests is fired.
The rule ordering scheme prioritizes the rules beforehand. The ordering may be
class-based or rule-based. With class-based ordering, the classes are sorted in order of
decreasing “importance” such as by decreasing order of prevalence. That is, all the rules
for the most prevalent (or most frequent) class come first, the rules for the next prevalent
class come next, and so on. Alternatively, they may be sorted based on the misclassification
cost per class. Within each class, the rules are not ordered—they don’t have to be
because they all predict the same class (and so there can be no class conflict!).
With rule-based ordering, the rules are organized into one long priority list, according
to some measure of rule quality, such as accuracy, coverage, or size (number of
attribute tests in the rule antecedent), or based on advice from domain experts. When
rule ordering is used, the rule set is known as a decision list. With rule ordering, the triggering
rule that appears earliest in the list has the highest priority, and so it gets to fire its
class prediction. Any other rule that satisfies X is ignored. Most rule-based classification
systems use a class-based rule-ordering strategy.

Rule Extraction from a Decision Tree
To extract rules from a decision tree, one rule is created for each path from the root
to a leaf node. Each splitting criterion along a given path is logically ANDed to form the
rule antecedent (“IF” part). The leaf node holds the class prediction, forming the rule
consequent (“THEN” part).

Rule Induction Using a Sequential Covering Algorithm
IF-THEN rules can be extracted directly from the training data (i.e., without having to
generate a decision tree first) using a sequential covering algorithm. The name comes
from the notion that the rules are learned sequentially (one at a time), where each rule
for a given class will ideally cover many of the class’s tuples (and hopefully none of
the tuples of other classes). Sequential covering algorithms are the most widely used
approach to mining disjunctive sets of classification rules, and form the topic of this

5 Model Evaluation and Selection

 Metrics for Evaluating Classifier Performance

some terminology.
Positive tuples are tuples of the main class of interest and negative tuples (all other tuples). Given two classes, for
example, the positive tuples may be computer buyers while the negative tuples are non-buyers of computers

There are four additional terms  that are the “building blocks” used in computing many evaluation measures.

True positives (TP): These refer to the positive tuples that were correctly labeled by
the classifier. Let TP be the number of true positives.
True negatives(TN): These are the negative tuples that were correctly labeled by the
classifier. Let TN be the number of true negatives.
False positives (FP): These are the negative tuples that were incorrectly labeled as
positive (e.g., tuples of class buys computer = no for which the classifier predicted
buys computer = yes). Let FP be the number of false positives.
False negatives (FN): These are the positive tuples that were mislabeled as negative
(e.g., tuples of class buys computer = yes for which the classifier predicted
buys computer = no). Let FN be the number of false negatives.

The confusion matrix is a useful tool for analyzing how well your classifier can
recognize tuples of different classes. TP and TN tell us when the classifier is getting
things right, while FP and FN tell us when the classifier is getting things wrong.

Accuracy =  [TP + TN]/[P + N]

In the pattern recognition literature, this is also referred to as the overall recognition
rate of the classifier, that is, it reflects how well the classifier recognizes tuples of the various

error rate or misclassification rate of a classifier, M, which
is simply 1 − accuracy(M), where accuracy(M) is the accuracy of M. This also can be
computed as
error rate = [FP + FN]/[P + N]

Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion
of positive tuples that are correctly identified), while specificity is the true negative rate
(i.e., the proportion of negative tuples that are correctly identified). These measures are
defined as
sensitivity = TP/ P

specificity = TN/N

It can be shown that accuracy is a function of sensitivity and specificity:
accuracy = sensitivity*P/(P + N)   +  specificity*N(P + N)

In addition to accuracy-based measures, classifiers can also be compared with respect
to the following additional aspects:
Speed: This refers to the computational costs involved in generating and using the
given classifier.
Robustness: This is the ability of the classifier to make correct predictions given noisy
data or data with missing values. Robustness is typically assessed with a series of
synthetic data sets representing increasing degrees of noise and missing values.
Scalability: This refers to the ability to construct the classifier efficiently given large
amounts of data. Scalability is typically assessed with a series of data sets of increasing
Interpretability: This refers to the level of understanding and insight that is provided
by the classifier or predictor. Interpretability is subjective and therefore more difficult
to assess. Decision trees and classification rules can be easy to interpret, yet their
interpretability may diminish the more they become complex

Holdout Method and Random Subsampling
The holdout method is what we have alluded to so far in our discussions about accuracy.
In this method, the given data are randomly partitioned into two independent sets, a
training set and a test set. Typically, two-thirds of the data are allocated to the training
set, and the remaining one-third is allocated to the test set. The training set is used to
derive the model. The model’s accuracy is then estimated with the test set.

In k-fold cross-validation, the initial data are randomly partitioned into k mutually
exclusive subsets or “folds,” D1, D2,..., Dk
, each of approximately equal size. Training
and testing is performed k times. In iteration i, partition Di
is reserved as the test set,
and the remaining partitions are collectively used to train the model. That is, in the
first iteration, subsets D2,..., Dk collectively serve as the training set to obtain a first
model, which is tested on D1; the second iteration is trained on subsets D1, D3,..., Dk
and tested on D2; and so on. Unlike the holdout and random subsampling methods,
here each sample is used the same number of times for training and once for testing. For
classification, the accuracy estimate is the overall number of correct classifications from
the k iterations, divided by the total number of tuples in the initial data.

Unlike the accuracy estimation methods just mentioned, the bootstrap method samples
the given training tuples uniformly with replacement. That is, each time a tuple is
selected, it is equally likely to be selected again and re-added to the training set. For
instance, imagine a machine that randomly selects tuples for our training set. In sampling
with replacement, the machine is allowed to select the same tuple more than once.
There are several bootstrap methods. A commonly used one is the .632 bootstrap,
which works as follows. Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a bootstrap sample or training set of d
samples. It is very likely that some of the original data tuples will occur more than once
in this sample. The data tuples that did not make it into the training set end up forming
the test set. Suppose we were to try this out several times. As it turns out, on average,
63.2% of the original data tuples will end up in the bootstrap sample, and the remaining
36.8% will form the test set (hence, the name, .632 bootstrap).

Model Selection Using Statistical Tests of Significance

To determine whether Model 1 and Model 2 are significantly different, we compute t and select
a significance level, sig. In practice, a significance level of 5% or 1% is typically used.

Comparing Classifiers Based on Cost–Benefit
and ROC Curves
The true positives, true negatives, false positives, and false negatives are also useful in
assessing the costs and benefits (or risks and gains) associated with a classification

model. The cost associated with a false negative (such as incorrectly predicting that a
cancerous patient is not cancerous) is far greater than those of a false positive
(incorrectly yet conservatively labeling a noncancerous patient as cancerous). In such
cases, we can outweigh one type of error over another by assigning a different cost to
each. These costs may consider the danger to the patient, financial costs of resulting
therapies, and other hospital costs. Similarly, the benefits associated with a true positive
decision may be different than those of a true negative. Up to now, to compute classifier
accuracy, we have assumed equal costs and essentially divided the sum of true positives
and true negatives by the total number of test tuples.

6 Techniques to Improve Classification Accuracy

Traditional learning models assume that the data classes are well distributed. In
many real-world data domains, however, the data are class-imbalanced, where the
main class of interest is represented by only a few tuples. This is known as the class
imbalance problem. We also study techniques for improving the classification accuracy
of class-imbalanced data.

Introducing Ensemble Methods
Bagging, boosting, and random forests are examples of ensemble methods (Figure 8.21).
An ensemble combines a series of k learned models (or base classifiers), M1, M2,..., Mk
with the aim of creating an improved composite classification model, M∗. A given data
set, D, is used to create k training sets, D1, D2,..., Dk
, where Di (1 ≤ i ≤ k − 1) is used
to generate classifier Mi
. Given a new data tuple to classify, the base classifiers each vote
by returning a class prediction. The ensemble returns a class prediction based on the
votes of the base classifiers.
An ensemble tends to be more accurate than its base classifiers. For example, consider
an ensemble that performs majority voting. That is, given a tuple X to classify, it
collects the class label predictions returned from the base classifiers and outputs the class
in majority. The base classifiers may make mistakes, but the ensemble will misclassify X
only if over half of the base classifiers are in error. Ensembles yield better results when
there is significant diversity among the models. That is, ideally, there is little correlation
among classifiers. The classifiers should also perform better than random guessing.
Each base classifier can be allocated to a different CPU and so ensemble methods are


Given a set, D, of d tuples, bagging works as follows. For iteration i(i = 1, 2,..., k),
a training set, Di
, of d tuples is sampled with replacement from the original set of
tuples, D. Note that the term bagging stands for bootstrap aggregation. Each training
set is a bootstrap sample, as described in Section 8.5.4. Because sampling with replacement
is used, some of the original tuples of D may not be included in Di
, whereas others
may occur more than once. A classifier model, Mi
, is learned for each training set, Di
To classify an unknown tuple, X, each classifier, Mi
, returns its class prediction, which
counts as one vote. The bagged classifier, M∗, counts the votes and assigns the class
with the most votes to X. Bagging can be applied to the prediction of continuous values
by taking the average value of each prediction for a given test tuple. The algorithm is
summarized in Figure 8.23.
The bagged classifier often has significantly greater accuracy than a single classifier
derived from D, the original training data.

Boosting and AdaBoost

In boosting, weights are also assigned to each training tuple. A series of k classifiers is
iteratively learned. After a classifier, Mi
, is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to “pay more attention” to the training tuples that were misclassified
by Mi
. The final boosted classifier, M∗, combines the votes of each individual
classifier, where the weight of each classifier’s vote is a function of its accuracy.
AdaBoost (short for Adaptive Boosting) is a popular boosting algorithm. Suppose
we want to boost the accuracy of a learning method. We are given D, a data set of
d class-labeled tuples, (X1, y1),(X2, y2),...,(Xd, yd), where yi
is the class label of tuple
. Initially, AdaBoost assigns each training tuple an equal weight of 1/d. Generating
k classifiers for the ensemble requires k rounds through the rest of the algorithm. In
round i, the tuples from D are sampled to form a training set, Di
, of size d. Sampling with replacement is used—the same tuple may be selected more than once. Each tuple’s
chance of being selected is based on its weight. A classifier model, Mi
, is derived from
the training tuples of Di
. Its error is then calculated using Di as a test set. The weights of
the training tuples are then adjusted according to how they were classified.
If a tuple was incorrectly classified, its weight is increased. If a tuple was correctly
classified, its weight is decreased.

Random Forests
Random forests can be built using bagging (Section 8.6.2) in tandem with random
attribute selection. A training set, D, of d tuples is given. The general procedure to generate
k decision trees for the ensemble is as follows. For each iteration, i(i = 1, 2,..., k),
a training set, Di
, of d tuples is sampled with replacement from D. That is, each Di
is a
bootstrap sample of D (Section 8.5.4), so that some tuples may occur more than once
in Di
, while others may be excluded. Let F be the number of attributes to be used to
determine the split at each node, where F is much smaller than the number of available
attributes. To construct a decision tree classifier, Mi
, randomly select, at each node,
F attributes as candidates for the split at the node. The CART methodology is used to
grow the trees. The trees are grown to maximum size and are not pruned. Random
forests formed this way, with random input selection, are called Forest-RI.

Another form of random forest, called Forest-RC, uses random linear combinations
of the input attributes.

Random forests are comparable in accuracy to AdaBoost, yet are more robust to
errors and outliers.

Improving Classification Accuracy of Class-Imbalanced Data
Given two-class data, the data are class-imbalanced if the main class of interest (the
positive class) is represented by only a few tuples, while the majority of tuples represent
the negative class. For multiclass-imbalanced data, the data distribution of each class
differs substantially where, again, the main class or classes of interest are rare. The
class imbalance problem is closely related to cost-sensitive learning, wherein the costs of
errors, per class, are not equal. In medical diagnosis, for example, it is much more costly
to falsely diagnose a cancerous patient as healthy (a false negative) than to misdiagnose
a healthy patient as having cancer (a false positive). A false negative error could lead to
the loss of life and therefore is much more expensive than a false positive error. Other
applications involving class-imbalanced data include fraud detection, the detection of
oil spills from satellite radar images, and fault monitoring.

General approaches for improving the classification accuracy of class-imbalanced data. These approaches include (1) oversampling, (2) undersampling,
(3) threshold moving, and (4) ensemble techniques.

Next Chapter:  Data Mining - Classification: Advanced Methods

Excerpts from the Book

Data Mining Concepts and Techniques

Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University