1 Mining Complex Data Types
Mining Sequence Data: Time-Series, Symbolic
Sequences, and Biological Sequences
A sequence is an ordered list of events. Sequences may be categorized into three groups,
based on the characteristics of the events they describe: (1) time-series data, (2) symbolic
sequence data, and (3) biological sequences. Let’s consider each type.
In time-series data, sequence data consist of long sequences of numeric data,
recorded at equal time intervals (e.g., per minute, per hour, or per day). Time-series
data can be generated by many natural and economic processes such as stock markets,
and scientific, medical, or natural observations.
Symbolic sequence data consist of long sequences of event or nominal data, which
typically are not observed at equal time intervals. For many such sequences, gaps (i.e.,
lapses between recorded events) do not matter much. Examples include customer shopping
sequences and web click streams, as well as sequences of events in science and
engineering and in natural and social developments.
Biological sequences include DNA and protein sequences. Such sequences are typically
very long, and carry important, complicated, but hidden semantic meaning. Here,
gaps are usually important.
2 Other Methodologies of Data MiningNot covered so far in this book
Statistical Data Mining
The data mining techniques described in this book are primarily drawn from computer
science disciplines, including data mining, machine learning, data warehousing, and
algorithms. They are designed for the efficient handling of huge amounts of data that are
typically multidimensional and possibly of various complex types. There are, however,
many well-established statistical techniques for data analysis, particularly for numeric
data. These techniques have been applied extensively to scientific data (e.g., data from
experiments in physics, engineering, manufacturing, psychology, and medicine), as well
as to data from economics and the social sciences. Some of these techniques, such as
principal components analysis and clustering, have already been addressed in this book.
Some more methods are:
Regression: In general, these methods are used to predict the value of a response
(dependent) variable from one or more predictor (independent) variables, where the
variables are numeric. There are various forms of regression, such as linear, multiple,
weighted, polynomial, nonparametric, and robust (robust methods are useful
when errors fail to satisfy normalcy conditions or when the data contain significant
Generalized linear models: These models, and their generalization (generalized additive
models), allow a categorical (nominal) response variable (or some transformation
of it) to be related to a set of predictor variables in a manner similar to the modeling
of a numeric response variable using linear regression. Generalized linear models
include logistic regression and Poisson regression.
Analysis of variance: These techniques analyze experimental data for two or more
populations described by a numeric response variable and one or more categorical
variables (factors). In general, an ANOVA (single-factor analysis of variance) problem
involves a comparison of k population or treatment means to determine if at least two
of the means are different. More complex ANOVA problems also exist.
Mixed-effect models: These models are for analyzing grouped data—data that can
be classified according to one or more grouping variables. They typically describe
relationships between a response variable and some covariates in data grouped
according to one or more factors. Common areas of application include multilevel
data, repeated measures data, block designs, and longitudinal data.
Factor analysis: This method is used to determine which variables are combined to
generate a given factor. For example, for many psychiatric data, it is not possible to
measure a certain factor of interest directly (e.g., intelligence); however, it is often
possible to measure other quantities (e.g., student test scores) that reflect the factor
of interest. Here, none of the variables is designated as dependent.
Discriminant analysis: This technique is used to predict a categorical response variable.
Unlike generalized linear models, it assumes that the independent variables
follow a multivariate normal distribution. The procedure attempts to determine
several discriminant functions (linear combinations of the independent variables)
that discriminate among the groups defined by the response variable. Discriminant
analysis is commonly used in social sciences.
Survival analysis: Several well-established statistical techniques exist for survival
analysis. These techniques originally were designed to predict the probability that
a patient undergoing a medical treatment would survive at least to time t. Methods
for survival analysis, however, are also commonly applied to manufacturing settings
to estimate the life span of industrial equipment. Popular methods include KaplanMeier
estimates of survival, Cox proportional hazards regression models, and their
Quality control: Various statistics can be used to prepare charts for quality control,
such as Shewhart charts and CUSUM charts (both of which display group summary
statistics). These statistics include the mean, standard deviation, range, count,
moving average, moving standard deviation, and moving range.
Visual and Audio Data Mining
Visual data mining discovers implicit and useful knowledge from large data sets using
data and/or knowledge visualization techniques. The human visual system is controlled
by the eyes and brain, the latter of which can be thought of as a powerful, highly parallel
processing and reasoning engine containing a large knowledge base. Visual data mining
essentially combines the power of these components, making it a highly attractive and
effective tool for the comprehension of data distributions, patterns, clusters, and outliers
Visual data mining can be viewed as an integration of two disciplines: data visualization
and data mining. It is also closely related to computer graphics, multimedia systems,
human–computer interaction, pattern recognition, and high-performance computing.
In general, data visualization and data mining can be integrated in the following ways:
Audio data mining uses audio signals to indicate the patterns of data or the features
of data mining results. Although visual data mining may disclose interesting patterns
using graphical displays, it requires users to concentrate on watching patterns and identifying
interesting or novel features within them. This can sometimes be quite tiresome.
If patterns can be transformed into sound and music, then instead of watching pictures,
we can listen to pitchs, rhythm, tune, and melody to identify anything interesting
3 Data Mining ApplicationsIn this book, we have studied principles and methods for mining relational data, data
warehouses, and complex data types. Because data mining is a relatively young discipline
with wide and diverse applications, there is still a nontrivial gap between general principles
of data mining and application-specific, effective data mining tools.
Data Mining for Financial Data Analysis
Most banks and financial institutions offer a wide variety of banking, investment, and
credit services (the latter include business, mortgage, and automobile loans and credit
cards). Some also offer insurance and stock investment services.
Financial data collected in the banking and financial industry are often relatively
complete, reliable, and of high quality, which facilitates systematic data analysis and data
mining. Here we present a few typical cases.
Design and construction of data warehouses for multidimensional data analysis
and data mining:
Loan payment prediction and customer credit policy analysis: Loan payment prediction
and customer credit analysis are critical to the business of a bank. Many
factors can strongly or weakly influence loan payment performance and customer
credit rating. Data mining methods, such as attribute selection and attribute relevance
ranking, may help identify important factors and eliminate irrelevant ones
Classification and clustering of customers for targeted marketing: Classification
and clustering methods can be used for customer group identification and targeted
Detection of money laundering and other financial crimes: To detect money laundering
and other financial crimes, it is important to integrate information from
multiple, heterogeneous databases (e.g., bank transaction databases and federal or
state crime history databases), as long as they are potentially related to the study.
Multiple data analysis tools can then be used to detect unusual patterns, such as large
amounts of cash flow at certain periods, by certain groups of customers. U
Data Mining for Retail and Telecommunication Industries
The retail industry is a well-fit application area for data mining, since it collects huge
amounts of data on sales, customer shopping history, goods transportation, consumption,
and service. The quantity of data collected continues to expand rapidly, especially
due to the increasing availability, ease, and popularity of business conducted on the Web,
or e-commerce. Today, most major chain stores also have web sites where customers
can make purchases online. Some businesses, such as Amazon.com (www.amazon.com),
exist solely online, without any brick-and-mortar (i.e., physical) store locations. Retail
data provide a rich source for data mining.
Retail data mining can help identify customer buying behaviors, discover customer
shopping patterns and trends, improve the quality of customer service, achieve better
customer retention and satisfaction, enhance goods consumption ratios, design more
effective goods transportation and distribution policies, and reduce the cost of business.
Data Mining in Science and Engineering
Today, scientific data can be amassed at much higher speeds and lower costs.
This has resulted in the accumulation of huge volumes of high-dimensional data,
stream data, and heterogenous data, containing rich spatial and temporal information.
Consequently, scientific applications are shifting from the “hypothesize-and-test”
paradigm toward a “collect and store data, mine for new hypotheses, confirm with data or
experimentation” process. This shift brings about new challenges for data mining.
5 Data Mining TrendsThe diversity of data, data mining tasks, and data mining approaches poses many challenging
research issues in data mining. The development of efficient and effective data
mining methods, systems and services, and interactive and integrated data mining environments
is a key area of study. The use of data mining techniques to solve large or
sophisticated application problems is an important task for data mining researchers
and data mining system and application developers.
Excerpts from the Book
Data Mining Concepts and Techniques
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University