Data Preprocessing

Data Quality: Why Preprocess the Data?

Data have quality if they satisfy the requirements of the intended use. There are many

factors comprising data quality, including accuracy, completeness, consistency, timeliness,

believability, and interpretability

Major Tasks in Data Preprocessing

In this section, we look at the major steps involved in data preprocessing, namely, data

cleaning, data integration, data reduction, and data transformation.

Data cleaning routines work to “clean” the data by filling in missing values, smoothing

noisy data, identifying or removing outliers, and resolving inconsistencies.

Suppose that you would like to include

data from multiple sources in your analysis. This would involve integrating multiple

databases, data cubes, or files (i.e., data integration).

Data reduction obtains a reduced representation of the data set that is much smaller in

volume, yet produces the same (or almost the same) analytical results. Data reduction

strategies include dimensionality reduction and numerosity reduction.

In dimensionality reduction, data encoding schemes are applied so as to obtain a

reduced or “compressed” representation of the original data. Examples include data

compression techniques (e.g., wavelet transforms and principal components analysis),

attribute subset selection (e.g., removing irrelevant attributes), and attribute construction

(e.g., where a small set of more useful attributes is derived from the original set).

In numerosity reduction, the data are replaced by alternative, smaller representations

using parametric models (e.g., regression or log-linear models) or nonparametric

models (e.g., histograms, clusters, sampling, or data aggregation).

Discretization and concept hierarchy generation are powerful tools for data mining

in that they allow data mining at multiple abstraction levels. Normalization, data

discretization, and concept hierarchy generation are forms of data transformation.

Data transformation operations are additional data preprocessing

procedures that would contribute toward the success of the mining process.

Missing Values

Alternatives

1. Ignore the tuple: This is usually done when the class label is missing (assuming the

mining task involves classification). This method is not very effective, unless the tuple

contains several attributes with missing values. It is especially poor when the percentage

of missing values per attribute varies considerably. By ignoring the tuple, we do

not make use of the remaining attributes’ values in the tuple. Such data could have

been useful to the task at hand.

2. Fill in the missing value manually: In general, this approach is time consuming and

may not be feasible given a large data set with many missing values.

3. Use a global constant to fill in the missing value: Replace all missing attribute values

by the same constant such as a label like “Unknown” or −∞. If missing values are

replaced by, say, “Unknown,” then the mining program may mistakenly think that

they form an interesting concept, since they all have a value in common—that of

“Unknown.” Hence, although this method is simple, it is not foolproof.

4. Use a measure of central tendency for the attribute (e.g., the mean or median) to

fill in the missing value: For normal (symmetric) data distributions,

the mean can be used, while skewed data distribution should employ

the median.

5. Use the attribute mean or median for all samples belonging to the same class as

the given tuple: For example, if classifying customers according to credit risk, we

may replace the missing value with the mean income value for customers in the same

credit risk category as that of the given tuple. If the data distribution for a given class

is skewed, the median value is a better choice.

6. Use the most probable value to fill in the missing value: This may be determined

with regression, inference-based tools using a Bayesian formalism, or decision

induction. For example, using the other customer attributes in your data set, you

may construct a decision tree to predict the missing values for income.

Noisy Data

“What is noise?” Noise is a random error or variance in a measured variable

Data smoothing techniques.

Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,”

that is, the values around it. The sorted values are distributed into a number

of “buckets,” or bins.

Similarly, smoothing by bin medians can be employed, in which each bin value

is replaced by the bin median. In smoothing by bin boundaries, the minimum and

maximum values in a given bin are identified as the bin boundaries

Regression: Data smoothing can also be done by regression, a technique that conforms

data values to a function. Linear regression involves finding the “best” line to

fit two attributes (or variables) so that one attribute can be used to predict the other.

Outlier analysis: Outliers may be detected by clustering, for example, where similar

values are organized into groups, or “clusters.” Intuitively, values that fall outside of

the set of clusters may be considered outliers.

Data Cleaning as a Process

The first step in data cleaning as a process is discrepancy detection. Discrepancies can

be caused by several factors, including poorly designed data entry forms that have many

optional fields, human error in data entry, deliberate errors (e.g., respondents not wanting

to divulge information about themselves), and data decay (e.g., outdated addresses).

There are a number of different commercial tools that can aid in the discrepancy

detection step. Data scrubbing tools use simple domain knowledge (e.g., knowledge

of postal addresses and spell-checking) to detect errors and make corrections in the

data. These tools rely on parsing and fuzzy matching techniques when cleaning data

from multiple sources. Data auditing tools find discrepancies by analyzing the data to

discover rules and relationships, and detecting data that violate such conditions. They

are variants of data mining tools.

stores.

Entity Identification Problem

How can equivalent real-world entities from multiple

data sources be matched up? This is referred to as the entity identification problem.

When matching attributes from one database to another during integration, special

attention must be paid to the structure of the data. This is to ensure that any attribute

functional dependencies and referential constraints in the source system match those in

the target system.

Redundancy and Correlation Analysis

Redundancy is another important issue in data integration. An attribute (such as annual

revenue, for instance) may be redundant if it can be “derived” from another attribute

or set of attributes. Inconsistencies in attribute or dimension naming can also cause

redundancies in the resulting data set.

Some redundancies can be detected by correlation analysis.

Tuple Duplication

In addition to detecting redundancies between attributes, duplication should also be

detected at the tuple level (e.g., where there are two or more identical tuples for a given

unique data entry case).

Data Value Conflict Detection and Resolution

Data integration also involves the detection and resolution of data value conflicts. For

example, for the same real-world entity, attribute values from different sources may differ.

Dimensionality reduction is the process of reducing the number of random variables

or attributes under consideration.

Numerosity reduction techniques replace the original data volume by alternative,

smaller forms of data representation. These techniques may be parametric or nonparametric

In data compression, transformations are applied so as to obtain a reduced or “compressed”

representation of the original data.

Wavelet Transforms

The discrete wavelet transform (DWT) is a linear signal processing technique that,

when applied to a data vector X, transforms it to a numerically different vector, X

0, of wavelet coefficients. The two vectors are of the same length.

Principal Components Analysis

PCA “combines” the essence of all attributes by creating an alternative,

smaller set of variables. The initial data can then be converted into this smaller

set through projection process. PCA often reveals relationships that were not previously suspected and thereby

allows interpretations that would not ordinarily result.

The components are sorted in decreasing order of “significance,” the data size

can be reduced by eliminating the weaker components, that is, those with low variance.

Using the strongest principal components, it should be possible to reconstruct

a good approximation of the original data.

Attribute Subset Selection

Attribute subset selection reduces the data set size by removing irrelevant or

redundant attributes (or dimensions). The goal of attribute subset selection is to find

a minimum set of attributes such that the resulting probability distribution of the data

classes is as close as possible to the original distribution obtained using all attributes.

Regression and Log-Linear Models: Parametric Data Reduction

Regression and log-linear models can be used to approximate the given data.

Histograms

Histograms use binning to approximate data distributions and are a popular form

of data reduction

Clustering

In data reduction, the cluster representations of the data are used to replace the actual

data

Sampling

Sampling can be used as a data reduction technique because it allows a large data set to

be represented by a much smaller random data sample (or subset).

Data Cube Aggregation

Example description: Data cubes store for multidimensional analysis of sales data with respect to annual sales per item type

for each branch of a company. Each cell holds an aggregate data value, corresponding

to the data point in multidimensional space. (For readability, only some cell values are

shown.) Concept hierarchies may exist for each attribute, allowing the analysis of data

at multiple abstraction levels. For example, a hierarchy for branch could allow branches

to be grouped into regions, based on their address. Data cubes provide fast access to

precomputed, summarized data, thereby benefiting online analytical processing as well

as data mining.

The cube created at the lowest abstraction level is referred to as the base cuboid. The

base cuboid should correspond to an individual entity of interest such as sales or customer.

In other words, the lowest level should be usable, or useful for the analysis. A cube

at the highest level of abstraction is the apex cuboid.

5 Data Transformation and Data Discretization

Data Transformation Strategies Overview

In data transformation, the data are transformed or consolidated into forms appropriate

for mining. Strategies for data transformation include the following:

1. Smoothing, which works to remove noise from the data. Techniques include binning,

regression, and clustering.

2. Attribute construction (or feature construction), where new attributes are constructed

and added from the given set of attributes to help the mining process.

3. Aggregation, where summary or aggregation operations are applied to the data. For

example, the daily sales data may be aggregated so as to compute monthly and annual

total amounts. This step is typically used in constructing a data cube for data analysis

at multiple abstraction levels.

4. Normalization, where the attribute data are scaled so as to fall within a smaller range,

such as −1.0 to 1.0, or 0.0 to 1.0.

5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by

interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).

The labels, in turn, can be recursively organized into higher-level concepts, resulting

in a concept hierarchy for the numeric attribute. More than one concept hierarchy can be defined for the same

attribute to accommodate the needs of various users.

6. Concept hierarchy generation for nominal data, where attributes such as street can

be generalized to higher-level concepts, like city or country. Many hierarchies for

nominal attributes are implicit within the database schema and can be automatically

defined at the schema definition level.

Data Transformation by Normalization

To help avoid dependence on the choice of measurement units, the

data should be normalized or standardized. This involves transforming the data to fall

within a smaller or common range such as [−1,1] or [0.0, 1.0]. (The terms standardize

and normalize are used interchangeably in d. Normalizing the data attempts to give all attributes an equal weight

Data normalization methods described in the book min-max normalization,

z-score normalization, and normalization by decimal scaling.

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins.

Binning methods for data smoothing are also

used as discretization methods for data reduction and concept hierarchy generation.

Discretization by Histogram Analysis

Like binning, histogram analysis is an unsupervised discretization technique because it

does not use class information.

Discretization by Cluster, Decision Tree,

and Correlation Analyses

Clustering, decision tree analysis, and correlation analysis can be used for data discretization.

Concept Hierarchy Generation for Nominal Data

Nominal attributes have a finite (but possibly

large) number of distinct values, with no ordering

Four methods for the generation of concept hierarchies for nominal data,

1. Specification of a partial ordering of attributes explicitly at the schema level by

users or experts: Concept hierarchies for nominal attributes or dimensions typically

involve a group of attributes. A user or expert can easily define a concept hierarchy by

specifying a partial or total ordering of the attributes at the schema level.

2. Specification of a portion of a hierarchy by explicit data grouping: This is essentially

the manual definition of a portion of a concept hierarchy.

On the contrary, we can easily specify explicit groupings for a small portion

of intermediate-level data. For example, after specifying that province and country

form a hierarchy at the schema level, a user could define some intermediate levels

manually, such as “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and

“{British Columbia, prairies Canada} ⊂ Western Canada.”

3. Specification of a set of attributes, but not of their partial ordering: A user may

specify a set of attributes forming a concept hierarchy, but omit to explicitly state

their partial ordering. The system can then try to automatically generate the attribute

ordering so as to construct a meaningful concept hierarchy.

4. Specification of only a partial set of attributes: The user may have included only a small subset of the

relevant attributes in the hierarchy specification. For example, the user may specify

only street and city for a location. If data semantics are embedded in the database schema, attributes with tight semantic

connections can be pinned together. In this way, the specification of one attribute

may trigger a whole group of semantically tightly linked attributes to be “dragged in”

to form a complete hierarchy.

Next Chapter

Data Warehousing and Online Analytical Processing - Chapter of Data Mining

Data Cube Technologies for Data Mining

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

Advanced Patterns - Data Mining

Data Mining - Classification: Basic Concepts

Data Mining - Classification: Advanced Methods

Data Mining Recent Trends and Research Frontiers

Excerpts from

Data Mining Concepts and Techniques

Third Edition

Jiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber

Jian Pei

Simon Fraser University

## 1 Data Preprocessing: An Overview

Data Quality: Why Preprocess the Data?

Data have quality if they satisfy the requirements of the intended use. There are many

factors comprising data quality, including accuracy, completeness, consistency, timeliness,

believability, and interpretability

Major Tasks in Data Preprocessing

In this section, we look at the major steps involved in data preprocessing, namely, data

cleaning, data integration, data reduction, and data transformation.

Data cleaning routines work to “clean” the data by filling in missing values, smoothing

noisy data, identifying or removing outliers, and resolving inconsistencies.

Suppose that you would like to include

data from multiple sources in your analysis. This would involve integrating multiple

databases, data cubes, or files (i.e., data integration).

Data reduction obtains a reduced representation of the data set that is much smaller in

volume, yet produces the same (or almost the same) analytical results. Data reduction

strategies include dimensionality reduction and numerosity reduction.

In dimensionality reduction, data encoding schemes are applied so as to obtain a

reduced or “compressed” representation of the original data. Examples include data

compression techniques (e.g., wavelet transforms and principal components analysis),

attribute subset selection (e.g., removing irrelevant attributes), and attribute construction

(e.g., where a small set of more useful attributes is derived from the original set).

In numerosity reduction, the data are replaced by alternative, smaller representations

using parametric models (e.g., regression or log-linear models) or nonparametric

models (e.g., histograms, clusters, sampling, or data aggregation).

Discretization and concept hierarchy generation are powerful tools for data mining

in that they allow data mining at multiple abstraction levels. Normalization, data

discretization, and concept hierarchy generation are forms of data transformation.

Data transformation operations are additional data preprocessing

procedures that would contribute toward the success of the mining process.

## 2 Data Cleaning

Missing Values

Alternatives

1. Ignore the tuple: This is usually done when the class label is missing (assuming the

mining task involves classification). This method is not very effective, unless the tuple

contains several attributes with missing values. It is especially poor when the percentage

of missing values per attribute varies considerably. By ignoring the tuple, we do

not make use of the remaining attributes’ values in the tuple. Such data could have

been useful to the task at hand.

2. Fill in the missing value manually: In general, this approach is time consuming and

may not be feasible given a large data set with many missing values.

3. Use a global constant to fill in the missing value: Replace all missing attribute values

by the same constant such as a label like “Unknown” or −∞. If missing values are

replaced by, say, “Unknown,” then the mining program may mistakenly think that

they form an interesting concept, since they all have a value in common—that of

“Unknown.” Hence, although this method is simple, it is not foolproof.

4. Use a measure of central tendency for the attribute (e.g., the mean or median) to

fill in the missing value: For normal (symmetric) data distributions,

the mean can be used, while skewed data distribution should employ

the median.

5. Use the attribute mean or median for all samples belonging to the same class as

the given tuple: For example, if classifying customers according to credit risk, we

may replace the missing value with the mean income value for customers in the same

credit risk category as that of the given tuple. If the data distribution for a given class

is skewed, the median value is a better choice.

6. Use the most probable value to fill in the missing value: This may be determined

with regression, inference-based tools using a Bayesian formalism, or decision

induction. For example, using the other customer attributes in your data set, you

may construct a decision tree to predict the missing values for income.

Noisy Data

“What is noise?” Noise is a random error or variance in a measured variable

Data smoothing techniques.

Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,”

that is, the values around it. The sorted values are distributed into a number

of “buckets,” or bins.

Similarly, smoothing by bin medians can be employed, in which each bin value

is replaced by the bin median. In smoothing by bin boundaries, the minimum and

maximum values in a given bin are identified as the bin boundaries

Regression: Data smoothing can also be done by regression, a technique that conforms

data values to a function. Linear regression involves finding the “best” line to

fit two attributes (or variables) so that one attribute can be used to predict the other.

Outlier analysis: Outliers may be detected by clustering, for example, where similar

values are organized into groups, or “clusters.” Intuitively, values that fall outside of

the set of clusters may be considered outliers.

Data Cleaning as a Process

The first step in data cleaning as a process is discrepancy detection. Discrepancies can

be caused by several factors, including poorly designed data entry forms that have many

optional fields, human error in data entry, deliberate errors (e.g., respondents not wanting

to divulge information about themselves), and data decay (e.g., outdated addresses).

There are a number of different commercial tools that can aid in the discrepancy

detection step. Data scrubbing tools use simple domain knowledge (e.g., knowledge

of postal addresses and spell-checking) to detect errors and make corrections in the

data. These tools rely on parsing and fuzzy matching techniques when cleaning data

from multiple sources. Data auditing tools find discrepancies by analyzing the data to

discover rules and relationships, and detecting data that violate such conditions. They

are variants of data mining tools.

## 3 Data Integration

Data mining often requires data integration—the merging of data from multiple datastores.

Entity Identification Problem

How can equivalent real-world entities from multiple

data sources be matched up? This is referred to as the entity identification problem.

When matching attributes from one database to another during integration, special

attention must be paid to the structure of the data. This is to ensure that any attribute

functional dependencies and referential constraints in the source system match those in

the target system.

Redundancy and Correlation Analysis

Redundancy is another important issue in data integration. An attribute (such as annual

revenue, for instance) may be redundant if it can be “derived” from another attribute

or set of attributes. Inconsistencies in attribute or dimension naming can also cause

redundancies in the resulting data set.

Some redundancies can be detected by correlation analysis.

Tuple Duplication

In addition to detecting redundancies between attributes, duplication should also be

detected at the tuple level (e.g., where there are two or more identical tuples for a given

unique data entry case).

Data Value Conflict Detection and Resolution

Data integration also involves the detection and resolution of data value conflicts. For

example, for the same real-world entity, attribute values from different sources may differ.

## 4 Data Reduction

Dimensionality reduction is the process of reducing the number of random variables

or attributes under consideration.

Numerosity reduction techniques replace the original data volume by alternative,

smaller forms of data representation. These techniques may be parametric or nonparametric

In data compression, transformations are applied so as to obtain a reduced or “compressed”

representation of the original data.

Wavelet Transforms

The discrete wavelet transform (DWT) is a linear signal processing technique that,

when applied to a data vector X, transforms it to a numerically different vector, X

0, of wavelet coefficients. The two vectors are of the same length.

Principal Components Analysis

PCA “combines” the essence of all attributes by creating an alternative,

smaller set of variables. The initial data can then be converted into this smaller

set through projection process. PCA often reveals relationships that were not previously suspected and thereby

allows interpretations that would not ordinarily result.

The components are sorted in decreasing order of “significance,” the data size

can be reduced by eliminating the weaker components, that is, those with low variance.

Using the strongest principal components, it should be possible to reconstruct

a good approximation of the original data.

Attribute Subset Selection

Attribute subset selection reduces the data set size by removing irrelevant or

redundant attributes (or dimensions). The goal of attribute subset selection is to find

a minimum set of attributes such that the resulting probability distribution of the data

classes is as close as possible to the original distribution obtained using all attributes.

Regression and Log-Linear Models: Parametric Data Reduction

Regression and log-linear models can be used to approximate the given data.

Histograms

Histograms use binning to approximate data distributions and are a popular form

of data reduction

Clustering

In data reduction, the cluster representations of the data are used to replace the actual

data

Sampling

Sampling can be used as a data reduction technique because it allows a large data set to

be represented by a much smaller random data sample (or subset).

Data Cube Aggregation

Example description: Data cubes store for multidimensional analysis of sales data with respect to annual sales per item type

for each branch of a company. Each cell holds an aggregate data value, corresponding

to the data point in multidimensional space. (For readability, only some cell values are

shown.) Concept hierarchies may exist for each attribute, allowing the analysis of data

at multiple abstraction levels. For example, a hierarchy for branch could allow branches

to be grouped into regions, based on their address. Data cubes provide fast access to

precomputed, summarized data, thereby benefiting online analytical processing as well

as data mining.

The cube created at the lowest abstraction level is referred to as the base cuboid. The

base cuboid should correspond to an individual entity of interest such as sales or customer.

In other words, the lowest level should be usable, or useful for the analysis. A cube

at the highest level of abstraction is the apex cuboid.

5 Data Transformation and Data Discretization

Data Transformation Strategies Overview

In data transformation, the data are transformed or consolidated into forms appropriate

for mining. Strategies for data transformation include the following:

1. Smoothing, which works to remove noise from the data. Techniques include binning,

regression, and clustering.

2. Attribute construction (or feature construction), where new attributes are constructed

and added from the given set of attributes to help the mining process.

3. Aggregation, where summary or aggregation operations are applied to the data. For

example, the daily sales data may be aggregated so as to compute monthly and annual

total amounts. This step is typically used in constructing a data cube for data analysis

at multiple abstraction levels.

4. Normalization, where the attribute data are scaled so as to fall within a smaller range,

such as −1.0 to 1.0, or 0.0 to 1.0.

5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by

interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).

The labels, in turn, can be recursively organized into higher-level concepts, resulting

in a concept hierarchy for the numeric attribute. More than one concept hierarchy can be defined for the same

attribute to accommodate the needs of various users.

6. Concept hierarchy generation for nominal data, where attributes such as street can

be generalized to higher-level concepts, like city or country. Many hierarchies for

nominal attributes are implicit within the database schema and can be automatically

defined at the schema definition level.

Data Transformation by Normalization

To help avoid dependence on the choice of measurement units, the

data should be normalized or standardized. This involves transforming the data to fall

within a smaller or common range such as [−1,1] or [0.0, 1.0]. (The terms standardize

and normalize are used interchangeably in d. Normalizing the data attempts to give all attributes an equal weight

Data normalization methods described in the book min-max normalization,

z-score normalization, and normalization by decimal scaling.

Discretization by Binning

Binning is a top-down splitting technique based on a specified number of bins.

Binning methods for data smoothing are also

used as discretization methods for data reduction and concept hierarchy generation.

Discretization by Histogram Analysis

Like binning, histogram analysis is an unsupervised discretization technique because it

does not use class information.

Discretization by Cluster, Decision Tree,

and Correlation Analyses

Clustering, decision tree analysis, and correlation analysis can be used for data discretization.

Concept Hierarchy Generation for Nominal Data

Nominal attributes have a finite (but possibly

large) number of distinct values, with no ordering

Four methods for the generation of concept hierarchies for nominal data,

1. Specification of a partial ordering of attributes explicitly at the schema level by

users or experts: Concept hierarchies for nominal attributes or dimensions typically

involve a group of attributes. A user or expert can easily define a concept hierarchy by

specifying a partial or total ordering of the attributes at the schema level.

2. Specification of a portion of a hierarchy by explicit data grouping: This is essentially

the manual definition of a portion of a concept hierarchy.

On the contrary, we can easily specify explicit groupings for a small portion

of intermediate-level data. For example, after specifying that province and country

form a hierarchy at the schema level, a user could define some intermediate levels

manually, such as “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and

“{British Columbia, prairies Canada} ⊂ Western Canada.”

3. Specification of a set of attributes, but not of their partial ordering: A user may

specify a set of attributes forming a concept hierarchy, but omit to explicitly state

their partial ordering. The system can then try to automatically generate the attribute

ordering so as to construct a meaningful concept hierarchy.

4. Specification of only a partial set of attributes: The user may have included only a small subset of the

relevant attributes in the hierarchy specification. For example, the user may specify

only street and city for a location. If data semantics are embedded in the database schema, attributes with tight semantic

connections can be pinned together. In this way, the specification of one attribute

may trigger a whole group of semantically tightly linked attributes to be “dragged in”

to form a complete hierarchy.

Next Chapter

Data Warehousing and Online Analytical Processing - Chapter of Data Mining

Data Cube Technologies for Data Mining

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

Advanced Patterns - Data Mining

Data Mining - Classification: Basic Concepts

Data Mining - Classification: Advanced Methods

Data Mining Recent Trends and Research Frontiers

Excerpts from

Data Mining Concepts and Techniques

Third Edition

Jiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber

Jian Pei

Simon Fraser University

## No comments:

## Post a Comment