Wednesday, April 13, 2016

Introduction to Data Mining

Data mining turns a large collection of data into knowledge. A search engine (e.g.,
Google) receives hundreds of millions of queries every day. Each query can be viewed
as a transaction where the user describes her or his information need.

Interestingly, some patterns found in user search queries can disclose invaluable knowledge.  It found a close relationship between the number of people who search for flu-related information and the number of people who actually have flu symptoms. A pattern emerges when all of the search queries related to flu are aggregated. Using aggregated Google search data, one can estimate flu activity up to two weeks faster than traditional systems can.  This example shows how data mining can turn a large
collection of data into knowledge that can help meet a current global challenge.

Many people treat data mining as a synonym for another popularly used term, knowledge discovery from data, or KDD, while others view data mining as merely an essential step in the process of knowledge discovery. The knowledge discovery process is  an iterative sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)4
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)

Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks. In general, such tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize properties of the data in a target data set. Predictive mining tasks
perform induction on the current data in order to make predictions.

Data characterization is a summarization of the general characteristics or features
of a target class of data. The data corresponding to the user-specified class are typically
collected by a query. For example, to study the characteristics of software products with
sales that increased by 10% in the previous year, the data related to such products can
be collected by executing an SQL query on the sales database.

The output of data characterization can be presented in various forms. Examples
include pie charts, bar charts, curves, multidimensional data cubes, and multidimensional
tables, including crosstabs. The resulting descriptions can also be presented as
generalized relations or in rule form (called characteristic rules).

Data discrimination is a comparison of the general features of the target class data
objects against the general features of objects from one or multiple contrasting classes.
The target and contrasting classes can be specified by a user, and the corresponding
data objects can be retrieved through database queries. For example, a user may want to
compare the general features of software products with sales that increased by 10% last
year against those with sales that decreased by at least 30% during the same period. The
methods used for data discrimination are similar to those used for data characterization.

Using data discrimination, the success factors can be identified by comparing successful people with unsuccessful people in any activity.

Frequent patterns, as the name suggests, are patterns that occur frequently in data.

There are many kinds of frequent patterns, including frequent itemsets, frequent subsequences (also known as sequential patterns), and frequent substructures. A frequent itemset typically refers to a set of items that often appear together in a transactional data set—for example, milk and bread, which are frequently bought together in grocery stores by many customers. A frequently occurring subsequence, such as the pattern that customers, tend to purchase first a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern. A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern. Mining frequent patterns leads to the discovery of interesting associations and correlations within data.

Classification and Regression for Predictive Analysis

Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts. The model are derived based on the analysis of a set of training data (i.e., data objects for which the class labels are known). The model is used to predict the class label of objects for which the the class label is unknown.

Whereas classification predicts categorical (discrete, unordered) labels, regression
models continuous-valued functions. That is, regression is used to predict missing or
unavailable numerical data values rather than (discrete) class labels. The term prediction
refers to both numeric prediction and class label prediction. Regression analysis is a
statistical methodology that is most often used for numeric prediction, although other
methods exist as well. Regression also encompasses the identification of distribution
trends based on the available data.
Classification and regression may need to be preceded by relevance analysis, which
attempts to identify attributes that are significantly relevant to the classification and
regression process. Such attributes will be selected for the classification and regression
process. Other attributes, which are irrelevant, can then be excluded from consideration.

Cluster Analysis

Unlike classification and regression, which analyze class-labeled (training) data sets, clustering analyzes data objects without consulting class labels. In many cases, class labeled data may simply not exist at the beginning. Clustering can be used to generate class labels for a group of data. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. That is, clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are rather dissimilar to objects in other clusters. Each cluster so formed can be viewed as a class of objects, from which rules can be derived.

Outlier Analysis

A data set may contain objects that do not comply with the general behavior or model
of the data. These data objects are outliers. Many data mining methods discard outliers
as noise or exceptions. However, in some applications (e.g., fraud detection) the rare
events can be more interesting than the more regularly occurring ones. The analysis of

Excerpts from
Data Mining Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber
Jian Pei
Simon Fraser University

No comments:

Post a Comment