Data Preprocessing
Data Quality: Why Preprocess the Data?
Data have quality if they satisfy the requirements of the intended use. There are many
factors comprising data quality, including accuracy, completeness, consistency, timeliness,
believability, and interpretability
Major Tasks in Data Preprocessing
In this section, we look at the major steps involved in data preprocessing, namely, data
cleaning, data integration, data reduction, and data transformation.
Data cleaning routines work to “clean” the data by filling in missing values, smoothing
noisy data, identifying or removing outliers, and resolving inconsistencies.
Suppose that you would like to include
data from multiple sources in your analysis. This would involve integrating multiple
databases, data cubes, or files (i.e., data integration).
Data reduction obtains a reduced representation of the data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results. Data reduction
strategies include dimensionality reduction and numerosity reduction.
In dimensionality reduction, data encoding schemes are applied so as to obtain a
reduced or “compressed” representation of the original data. Examples include data
compression techniques (e.g., wavelet transforms and principal components analysis),
attribute subset selection (e.g., removing irrelevant attributes), and attribute construction
(e.g., where a small set of more useful attributes is derived from the original set).
In numerosity reduction, the data are replaced by alternative, smaller representations
using parametric models (e.g., regression or log-linear models) or nonparametric
models (e.g., histograms, clusters, sampling, or data aggregation).
Discretization and concept hierarchy generation are powerful tools for data mining
in that they allow data mining at multiple abstraction levels. Normalization, data
discretization, and concept hierarchy generation are forms of data transformation.
Data transformation operations are additional data preprocessing
procedures that would contribute toward the success of the mining process.
Missing Values
Alternatives
1. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage
of missing values per attribute varies considerably. By ignoring the tuple, we do
not make use of the remaining attributes’ values in the tuple. Such data could have
been useful to the task at hand.
2. Fill in the missing value manually: In general, this approach is time consuming and
may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant such as a label like “Unknown” or −∞. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value: For normal (symmetric) data distributions,
the mean can be used, while skewed data distribution should employ
the median.
5. Use the attribute mean or median for all samples belonging to the same class as
the given tuple: For example, if classifying customers according to credit risk, we
may replace the missing value with the mean income value for customers in the same
credit risk category as that of the given tuple. If the data distribution for a given class
is skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism, or decision
induction. For example, using the other customer attributes in your data set, you
may construct a decision tree to predict the missing values for income.
Noisy Data
“What is noise?” Noise is a random error or variance in a measured variable
Data smoothing techniques.
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,”
that is, the values around it. The sorted values are distributed into a number
of “buckets,” or bins.
Similarly, smoothing by bin medians can be employed, in which each bin value
is replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries
Regression: Data smoothing can also be done by regression, a technique that conforms
data values to a function. Linear regression involves finding the “best” line to
fit two attributes (or variables) so that one attribute can be used to predict the other.
Outlier analysis: Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.” Intuitively, values that fall outside of
the set of clusters may be considered outliers.
Data Cleaning as a Process
The first step in data cleaning as a process is discrepancy detection. Discrepancies can
be caused by several factors, including poorly designed data entry forms that have many
optional fields, human error in data entry, deliberate errors (e.g., respondents not wanting
to divulge information about themselves), and data decay (e.g., outdated addresses).
There are a number of different commercial tools that can aid in the discrepancy
detection step. Data scrubbing tools use simple domain knowledge (e.g., knowledge
of postal addresses and spell-checking) to detect errors and make corrections in the
data. These tools rely on parsing and fuzzy matching techniques when cleaning data
from multiple sources. Data auditing tools find discrepancies by analyzing the data to
discover rules and relationships, and detecting data that violate such conditions. They
are variants of data mining tools.
stores.
Entity Identification Problem
How can equivalent real-world entities from multiple
data sources be matched up? This is referred to as the entity identification problem.
When matching attributes from one database to another during integration, special
attention must be paid to the structure of the data. This is to ensure that any attribute
functional dependencies and referential constraints in the source system match those in
the target system.
Redundancy and Correlation Analysis
Redundancy is another important issue in data integration. An attribute (such as annual
revenue, for instance) may be redundant if it can be “derived” from another attribute
or set of attributes. Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
Tuple Duplication
In addition to detecting redundancies between attributes, duplication should also be
detected at the tuple level (e.g., where there are two or more identical tuples for a given
unique data entry case).
Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value conflicts. For
example, for the same real-world entity, attribute values from different sources may differ.
Dimensionality reduction is the process of reducing the number of random variables
or attributes under consideration.
Numerosity reduction techniques replace the original data volume by alternative,
smaller forms of data representation. These techniques may be parametric or nonparametric
In data compression, transformations are applied so as to obtain a reduced or “compressed”
representation of the original data.
Wavelet Transforms
The discrete wavelet transform (DWT) is a linear signal processing technique that,
when applied to a data vector X, transforms it to a numerically different vector, X
0, of wavelet coefficients. The two vectors are of the same length.
Principal Components Analysis
PCA “combines” the essence of all attributes by creating an alternative,
smaller set of variables. The initial data can then be converted into this smaller
set through projection process. PCA often reveals relationships that were not previously suspected and thereby
allows interpretations that would not ordinarily result.
The components are sorted in decreasing order of “significance,” the data size
can be reduced by eliminating the weaker components, that is, those with low variance.
Using the strongest principal components, it should be possible to reconstruct
a good approximation of the original data.
Attribute Subset Selection
Attribute subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions). The goal of attribute subset selection is to find
a minimum set of attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using all attributes.
Regression and Log-Linear Models: Parametric Data Reduction
Regression and log-linear models can be used to approximate the given data.
Histograms
Histograms use binning to approximate data distributions and are a popular form
of data reduction
Clustering
In data reduction, the cluster representations of the data are used to replace the actual
data
Sampling
Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random data sample (or subset).
Data Cube Aggregation
Example description: Data cubes store for multidimensional analysis of sales data with respect to annual sales per item type
for each branch of a company. Each cell holds an aggregate data value, corresponding
to the data point in multidimensional space. (For readability, only some cell values are
shown.) Concept hierarchies may exist for each attribute, allowing the analysis of data
at multiple abstraction levels. For example, a hierarchy for branch could allow branches
to be grouped into regions, based on their address. Data cubes provide fast access to
precomputed, summarized data, thereby benefiting online analytical processing as well
as data mining.
The cube created at the lowest abstraction level is referred to as the base cuboid. The
base cuboid should correspond to an individual entity of interest such as sales or customer.
In other words, the lowest level should be usable, or useful for the analysis. A cube
at the highest level of abstraction is the apex cuboid.
5 Data Transformation and Data Discretization
Data Transformation Strategies Overview
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for data analysis
at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as −1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
The labels, in turn, can be recursively organized into higher-level concepts, resulting
in a concept hierarchy for the numeric attribute. More than one concept hierarchy can be defined for the same
attribute to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can
be generalized to higher-level concepts, like city or country. Many hierarchies for
nominal attributes are implicit within the database schema and can be automatically
defined at the schema definition level.
Data Transformation by Normalization
To help avoid dependence on the choice of measurement units, the
data should be normalized or standardized. This involves transforming the data to fall
within a smaller or common range such as [−1,1] or [0.0, 1.0]. (The terms standardize
and normalize are used interchangeably in d. Normalizing the data attempts to give all attributes an equal weight
Data normalization methods described in the book min-max normalization,
z-score normalization, and normalization by decimal scaling.
Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins.
Binning methods for data smoothing are also
used as discretization methods for data reduction and concept hierarchy generation.
Discretization by Histogram Analysis
Like binning, histogram analysis is an unsupervised discretization technique because it
does not use class information.
Discretization by Cluster, Decision Tree,
and Correlation Analyses
Clustering, decision tree analysis, and correlation analysis can be used for data discretization.
Concept Hierarchy Generation for Nominal Data
Nominal attributes have a finite (but possibly
large) number of distinct values, with no ordering
Four methods for the generation of concept hierarchies for nominal data,
1. Specification of a partial ordering of attributes explicitly at the schema level by
users or experts: Concept hierarchies for nominal attributes or dimensions typically
involve a group of attributes. A user or expert can easily define a concept hierarchy by
specifying a partial or total ordering of the attributes at the schema level.
2. Specification of a portion of a hierarchy by explicit data grouping: This is essentially
the manual definition of a portion of a concept hierarchy.
On the contrary, we can easily specify explicit groupings for a small portion
of intermediate-level data. For example, after specifying that province and country
form a hierarchy at the schema level, a user could define some intermediate levels
manually, such as “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and
“{British Columbia, prairies Canada} ⊂ Western Canada.”
3. Specification of a set of attributes, but not of their partial ordering: A user may
specify a set of attributes forming a concept hierarchy, but omit to explicitly state
their partial ordering. The system can then try to automatically generate the attribute
ordering so as to construct a meaningful concept hierarchy.
4. Specification of only a partial set of attributes: The user may have included only a small subset of the
relevant attributes in the hierarchy specification. For example, the user may specify
only street and city for a location. If data semantics are embedded in the database schema, attributes with tight semantic
connections can be pinned together. In this way, the specification of one attribute
may trigger a whole group of semantically tightly linked attributes to be “dragged in”
to form a complete hierarchy.
Next Chapter
Data Warehousing and Online Analytical Processing - Chapter of Data Mining
Data Cube Technologies for Data Mining
Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
Advanced Patterns - Data Mining
Data Mining - Classification: Basic Concepts
Data Mining - Classification: Advanced Methods
Data Mining Recent Trends and Research Frontiers
Excerpts from
Data Mining Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber
Jian Pei
Simon Fraser University
1 Data Preprocessing: An Overview
Data Quality: Why Preprocess the Data?
Data have quality if they satisfy the requirements of the intended use. There are many
factors comprising data quality, including accuracy, completeness, consistency, timeliness,
believability, and interpretability
Major Tasks in Data Preprocessing
In this section, we look at the major steps involved in data preprocessing, namely, data
cleaning, data integration, data reduction, and data transformation.
Data cleaning routines work to “clean” the data by filling in missing values, smoothing
noisy data, identifying or removing outliers, and resolving inconsistencies.
Suppose that you would like to include
data from multiple sources in your analysis. This would involve integrating multiple
databases, data cubes, or files (i.e., data integration).
Data reduction obtains a reduced representation of the data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results. Data reduction
strategies include dimensionality reduction and numerosity reduction.
In dimensionality reduction, data encoding schemes are applied so as to obtain a
reduced or “compressed” representation of the original data. Examples include data
compression techniques (e.g., wavelet transforms and principal components analysis),
attribute subset selection (e.g., removing irrelevant attributes), and attribute construction
(e.g., where a small set of more useful attributes is derived from the original set).
In numerosity reduction, the data are replaced by alternative, smaller representations
using parametric models (e.g., regression or log-linear models) or nonparametric
models (e.g., histograms, clusters, sampling, or data aggregation).
Discretization and concept hierarchy generation are powerful tools for data mining
in that they allow data mining at multiple abstraction levels. Normalization, data
discretization, and concept hierarchy generation are forms of data transformation.
Data transformation operations are additional data preprocessing
procedures that would contribute toward the success of the mining process.
2 Data Cleaning
Missing Values
Alternatives
1. Ignore the tuple: This is usually done when the class label is missing (assuming the
mining task involves classification). This method is not very effective, unless the tuple
contains several attributes with missing values. It is especially poor when the percentage
of missing values per attribute varies considerably. By ignoring the tuple, we do
not make use of the remaining attributes’ values in the tuple. Such data could have
been useful to the task at hand.
2. Fill in the missing value manually: In general, this approach is time consuming and
may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values
by the same constant such as a label like “Unknown” or −∞. If missing values are
replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value: For normal (symmetric) data distributions,
the mean can be used, while skewed data distribution should employ
the median.
5. Use the attribute mean or median for all samples belonging to the same class as
the given tuple: For example, if classifying customers according to credit risk, we
may replace the missing value with the mean income value for customers in the same
credit risk category as that of the given tuple. If the data distribution for a given class
is skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value: This may be determined
with regression, inference-based tools using a Bayesian formalism, or decision
induction. For example, using the other customer attributes in your data set, you
may construct a decision tree to predict the missing values for income.
Noisy Data
“What is noise?” Noise is a random error or variance in a measured variable
Data smoothing techniques.
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,”
that is, the values around it. The sorted values are distributed into a number
of “buckets,” or bins.
Similarly, smoothing by bin medians can be employed, in which each bin value
is replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries
Regression: Data smoothing can also be done by regression, a technique that conforms
data values to a function. Linear regression involves finding the “best” line to
fit two attributes (or variables) so that one attribute can be used to predict the other.
Outlier analysis: Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.” Intuitively, values that fall outside of
the set of clusters may be considered outliers.
Data Cleaning as a Process
The first step in data cleaning as a process is discrepancy detection. Discrepancies can
be caused by several factors, including poorly designed data entry forms that have many
optional fields, human error in data entry, deliberate errors (e.g., respondents not wanting
to divulge information about themselves), and data decay (e.g., outdated addresses).
There are a number of different commercial tools that can aid in the discrepancy
detection step. Data scrubbing tools use simple domain knowledge (e.g., knowledge
of postal addresses and spell-checking) to detect errors and make corrections in the
data. These tools rely on parsing and fuzzy matching techniques when cleaning data
from multiple sources. Data auditing tools find discrepancies by analyzing the data to
discover rules and relationships, and detecting data that violate such conditions. They
are variants of data mining tools.
3 Data Integration
Data mining often requires data integration—the merging of data from multiple datastores.
Entity Identification Problem
How can equivalent real-world entities from multiple
data sources be matched up? This is referred to as the entity identification problem.
When matching attributes from one database to another during integration, special
attention must be paid to the structure of the data. This is to ensure that any attribute
functional dependencies and referential constraints in the source system match those in
the target system.
Redundancy and Correlation Analysis
Redundancy is another important issue in data integration. An attribute (such as annual
revenue, for instance) may be redundant if it can be “derived” from another attribute
or set of attributes. Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
Some redundancies can be detected by correlation analysis.
Tuple Duplication
In addition to detecting redundancies between attributes, duplication should also be
detected at the tuple level (e.g., where there are two or more identical tuples for a given
unique data entry case).
Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value conflicts. For
example, for the same real-world entity, attribute values from different sources may differ.
4 Data Reduction
Dimensionality reduction is the process of reducing the number of random variables
or attributes under consideration.
Numerosity reduction techniques replace the original data volume by alternative,
smaller forms of data representation. These techniques may be parametric or nonparametric
In data compression, transformations are applied so as to obtain a reduced or “compressed”
representation of the original data.
Wavelet Transforms
The discrete wavelet transform (DWT) is a linear signal processing technique that,
when applied to a data vector X, transforms it to a numerically different vector, X
0, of wavelet coefficients. The two vectors are of the same length.
Principal Components Analysis
PCA “combines” the essence of all attributes by creating an alternative,
smaller set of variables. The initial data can then be converted into this smaller
set through projection process. PCA often reveals relationships that were not previously suspected and thereby
allows interpretations that would not ordinarily result.
The components are sorted in decreasing order of “significance,” the data size
can be reduced by eliminating the weaker components, that is, those with low variance.
Using the strongest principal components, it should be possible to reconstruct
a good approximation of the original data.
Attribute Subset Selection
Attribute subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions). The goal of attribute subset selection is to find
a minimum set of attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using all attributes.
Regression and Log-Linear Models: Parametric Data Reduction
Regression and log-linear models can be used to approximate the given data.
Histograms
Histograms use binning to approximate data distributions and are a popular form
of data reduction
Clustering
In data reduction, the cluster representations of the data are used to replace the actual
data
Sampling
Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random data sample (or subset).
Data Cube Aggregation
Example description: Data cubes store for multidimensional analysis of sales data with respect to annual sales per item type
for each branch of a company. Each cell holds an aggregate data value, corresponding
to the data point in multidimensional space. (For readability, only some cell values are
shown.) Concept hierarchies may exist for each attribute, allowing the analysis of data
at multiple abstraction levels. For example, a hierarchy for branch could allow branches
to be grouped into regions, based on their address. Data cubes provide fast access to
precomputed, summarized data, thereby benefiting online analytical processing as well
as data mining.
The cube created at the lowest abstraction level is referred to as the base cuboid. The
base cuboid should correspond to an individual entity of interest such as sales or customer.
In other words, the lowest level should be usable, or useful for the analysis. A cube
at the highest level of abstraction is the apex cuboid.
5 Data Transformation and Data Discretization
Data Transformation Strategies Overview
In data transformation, the data are transformed or consolidated into forms appropriate
for mining. Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.
2. Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for data analysis
at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range,
such as −1.0 to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
The labels, in turn, can be recursively organized into higher-level concepts, resulting
in a concept hierarchy for the numeric attribute. More than one concept hierarchy can be defined for the same
attribute to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can
be generalized to higher-level concepts, like city or country. Many hierarchies for
nominal attributes are implicit within the database schema and can be automatically
defined at the schema definition level.
Data Transformation by Normalization
To help avoid dependence on the choice of measurement units, the
data should be normalized or standardized. This involves transforming the data to fall
within a smaller or common range such as [−1,1] or [0.0, 1.0]. (The terms standardize
and normalize are used interchangeably in d. Normalizing the data attempts to give all attributes an equal weight
Data normalization methods described in the book min-max normalization,
z-score normalization, and normalization by decimal scaling.
Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins.
Binning methods for data smoothing are also
used as discretization methods for data reduction and concept hierarchy generation.
Discretization by Histogram Analysis
Like binning, histogram analysis is an unsupervised discretization technique because it
does not use class information.
Discretization by Cluster, Decision Tree,
and Correlation Analyses
Clustering, decision tree analysis, and correlation analysis can be used for data discretization.
Concept Hierarchy Generation for Nominal Data
Nominal attributes have a finite (but possibly
large) number of distinct values, with no ordering
Four methods for the generation of concept hierarchies for nominal data,
1. Specification of a partial ordering of attributes explicitly at the schema level by
users or experts: Concept hierarchies for nominal attributes or dimensions typically
involve a group of attributes. A user or expert can easily define a concept hierarchy by
specifying a partial or total ordering of the attributes at the schema level.
2. Specification of a portion of a hierarchy by explicit data grouping: This is essentially
the manual definition of a portion of a concept hierarchy.
On the contrary, we can easily specify explicit groupings for a small portion
of intermediate-level data. For example, after specifying that province and country
form a hierarchy at the schema level, a user could define some intermediate levels
manually, such as “{Alberta, Saskatchewan, Manitoba} ⊂ prairies Canada” and
“{British Columbia, prairies Canada} ⊂ Western Canada.”
3. Specification of a set of attributes, but not of their partial ordering: A user may
specify a set of attributes forming a concept hierarchy, but omit to explicitly state
their partial ordering. The system can then try to automatically generate the attribute
ordering so as to construct a meaningful concept hierarchy.
4. Specification of only a partial set of attributes: The user may have included only a small subset of the
relevant attributes in the hierarchy specification. For example, the user may specify
only street and city for a location. If data semantics are embedded in the database schema, attributes with tight semantic
connections can be pinned together. In this way, the specification of one attribute
may trigger a whole group of semantically tightly linked attributes to be “dragged in”
to form a complete hierarchy.
Next Chapter
Data Warehousing and Online Analytical Processing - Chapter of Data Mining
Data Cube Technologies for Data Mining
Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
Advanced Patterns - Data Mining
Data Mining - Classification: Basic Concepts
Data Mining - Classification: Advanced Methods
Data Mining Recent Trends and Research Frontiers
Excerpts from
Data Mining Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber
Jian Pei
Simon Fraser University
No comments:
Post a Comment