## 1 Mining Complex Data Types

Mining Sequence Data: Time-Series, Symbolic

Sequences, and Biological Sequences

A sequence is an ordered list of events. Sequences may be categorized into three groups,

based on the characteristics of the events they describe: (1) time-series data, (2) symbolic

sequence data, and (3) biological sequences. Let’s consider each type.

In time-series data, sequence data consist of long sequences of numeric data,

recorded at equal time intervals (e.g., per minute, per hour, or per day). Time-series

data can be generated by many natural and economic processes such as stock markets,

and scientific, medical, or natural observations.

Symbolic sequence data consist of long sequences of event or nominal data, which

typically are not observed at equal time intervals. For many such sequences, gaps (i.e.,

lapses between recorded events) do not matter much. Examples include customer shopping

sequences and web click streams, as well as sequences of events in science and

engineering and in natural and social developments.

Biological sequences include DNA and protein sequences. Such sequences are typically

very long, and carry important, complicated, but hidden semantic meaning. Here,

gaps are usually important.

## 2 Other Methodologies of Data Mining

Not covered so far in this bookStatistical Data Mining

The data mining techniques described in this book are primarily drawn from computer

science disciplines, including data mining, machine learning, data warehousing, and

algorithms. They are designed for the efficient handling of huge amounts of data that are

typically multidimensional and possibly of various complex types. There are, however,

many well-established statistical techniques for data analysis, particularly for numeric

data. These techniques have been applied extensively to scientific data (e.g., data from

experiments in physics, engineering, manufacturing, psychology, and medicine), as well

as to data from economics and the social sciences. Some of these techniques, such as

principal components analysis and clustering, have already been addressed in this book.

Some more methods are:

Regression: In general, these methods are used to predict the value of a response

(dependent) variable from one or more predictor (independent) variables, where the

variables are numeric. There are various forms of regression, such as linear, multiple,

weighted, polynomial, nonparametric, and robust (robust methods are useful

when errors fail to satisfy normalcy conditions or when the data contain significant

outliers).

Generalized linear models: These models, and their generalization (generalized additive

models), allow a categorical (nominal) response variable (or some transformation

of it) to be related to a set of predictor variables in a manner similar to the modeling

of a numeric response variable using linear regression. Generalized linear models

include logistic regression and Poisson regression.

Analysis of variance: These techniques analyze experimental data for two or more

populations described by a numeric response variable and one or more categorical

variables (factors). In general, an ANOVA (single-factor analysis of variance) problem

involves a comparison of k population or treatment means to determine if at least two

of the means are different. More complex ANOVA problems also exist.

Mixed-effect models: These models are for analyzing grouped data—data that can

be classified according to one or more grouping variables. They typically describe

relationships between a response variable and some covariates in data grouped

according to one or more factors. Common areas of application include multilevel

data, repeated measures data, block designs, and longitudinal data.

Factor analysis: This method is used to determine which variables are combined to

generate a given factor. For example, for many psychiatric data, it is not possible to

measure a certain factor of interest directly (e.g., intelligence); however, it is often

possible to measure other quantities (e.g., student test scores) that reflect the factor

of interest. Here, none of the variables is designated as dependent.

Discriminant analysis: This technique is used to predict a categorical response variable.

Unlike generalized linear models, it assumes that the independent variables

follow a multivariate normal distribution. The procedure attempts to determine

several discriminant functions (linear combinations of the independent variables)

that discriminate among the groups defined by the response variable. Discriminant

analysis is commonly used in social sciences.

Survival analysis: Several well-established statistical techniques exist for survival

analysis. These techniques originally were designed to predict the probability that

a patient undergoing a medical treatment would survive at least to time t. Methods

for survival analysis, however, are also commonly applied to manufacturing settings

to estimate the life span of industrial equipment. Popular methods include KaplanMeier

estimates of survival, Cox proportional hazards regression models, and their

extensions.

Quality control: Various statistics can be used to prepare charts for quality control,

such as Shewhart charts and CUSUM charts (both of which display group summary

statistics). These statistics include the mean, standard deviation, range, count,

moving average, moving standard deviation, and moving range.

Visual and Audio Data Mining

Visual data mining discovers implicit and useful knowledge from large data sets using

data and/or knowledge visualization techniques. The human visual system is controlled

by the eyes and brain, the latter of which can be thought of as a powerful, highly parallel

processing and reasoning engine containing a large knowledge base. Visual data mining

essentially combines the power of these components, making it a highly attractive and

effective tool for the comprehension of data distributions, patterns, clusters, and outliers

in data.

Visual data mining can be viewed as an integration of two disciplines: data visualization

and data mining. It is also closely related to computer graphics, multimedia systems,

human–computer interaction, pattern recognition, and high-performance computing.

In general, data visualization and data mining can be integrated in the following ways:

Audio data mining uses audio signals to indicate the patterns of data or the features

of data mining results. Although visual data mining may disclose interesting patterns

using graphical displays, it requires users to concentrate on watching patterns and identifying

interesting or novel features within them. This can sometimes be quite tiresome.

If patterns can be transformed into sound and music, then instead of watching pictures,

we can listen to pitchs, rhythm, tune, and melody to identify anything interesting

or unusual.

## 3 Data Mining Applications

In this book, we have studied principles and methods for mining relational data, datawarehouses, and complex data types. Because data mining is a relatively young discipline

with wide and diverse applications, there is still a nontrivial gap between general principles

of data mining and application-specific, effective data mining tools.

Data Mining for Financial Data Analysis

Most banks and financial institutions offer a wide variety of banking, investment, and

credit services (the latter include business, mortgage, and automobile loans and credit

cards). Some also offer insurance and stock investment services.

Financial data collected in the banking and financial industry are often relatively

complete, reliable, and of high quality, which facilitates systematic data analysis and data

mining. Here we present a few typical cases.

Design and construction of data warehouses for multidimensional data analysis

and data mining:

Loan payment prediction and customer credit policy analysis: Loan payment prediction

and customer credit analysis are critical to the business of a bank. Many

factors can strongly or weakly influence loan payment performance and customer

credit rating. Data mining methods, such as attribute selection and attribute relevance

ranking, may help identify important factors and eliminate irrelevant ones

Classification and clustering of customers for targeted marketing: Classification

and clustering methods can be used for customer group identification and targeted

marketing.

Detection of money laundering and other financial crimes: To detect money laundering

and other financial crimes, it is important to integrate information from

multiple, heterogeneous databases (e.g., bank transaction databases and federal or

state crime history databases), as long as they are potentially related to the study.

Multiple data analysis tools can then be used to detect unusual patterns, such as large

amounts of cash flow at certain periods, by certain groups of customers. U

Data Mining for Retail and Telecommunication Industries

The retail industry is a well-fit application area for data mining, since it collects huge

amounts of data on sales, customer shopping history, goods transportation, consumption,

and service. The quantity of data collected continues to expand rapidly, especially

due to the increasing availability, ease, and popularity of business conducted on the Web,

or e-commerce. Today, most major chain stores also have web sites where customers

can make purchases online. Some businesses, such as Amazon.com (www.amazon.com),

exist solely online, without any brick-and-mortar (i.e., physical) store locations. Retail

data provide a rich source for data mining.

Retail data mining can help identify customer buying behaviors, discover customer

shopping patterns and trends, improve the quality of customer service, achieve better

customer retention and satisfaction, enhance goods consumption ratios, design more

effective goods transportation and distribution policies, and reduce the cost of business.

Data Mining in Science and Engineering

Today, scientific data can be amassed at much higher speeds and lower costs.

This has resulted in the accumulation of huge volumes of high-dimensional data,

stream data, and heterogenous data, containing rich spatial and temporal information.

Consequently, scientific applications are shifting from the “hypothesize-and-test”

paradigm toward a “collect and store data, mine for new hypotheses, confirm with data or

experimentation” process. This shift brings about new challenges for data mining.

## 5 Data Mining Trends

The diversity of data, data mining tasks, and data mining approaches poses many challengingresearch issues in data mining. The development of efficient and effective data

mining methods, systems and services, and interactive and integrated data mining environments

is a key area of study. The use of data mining techniques to solve large or

sophisticated application problems is an important task for data mining researchers

and data mining system and application developers.

Excerpts from the Book

Data Mining Concepts and Techniques

Third Edition

Jiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber, Jian Pei

Simon Fraser University

## No comments:

## Post a Comment