Wednesday, April 13, 2016

Knowing and Understanding the Dataset for Data Mining


1. Data Objects and Attributes


Data objects are typically described by attributes. Data objects can also be
referred to as samples, examples, instances, data points, or objects. If the data objects are
stored in a database, they are data tuples. That is, the rows of a database correspond to
the data objects, and the columns correspond to the attributes. In this section, we define
attributes and look at the various attribute types.

What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a data object. The
nouns attribute, dimension, feature, and variable are often used interchangeably in the
literature. The term dimension is commonly used in data warehousing. Machine learning
literature tends to use the term feature, while statisticians prefer the term variable. Data
mining and database professionals commonly use the term attribute

Nominal Attributes
Nominal means “relating to names.” The values of a nominal attribute are symbols or
names of things.

Binary Attributes
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where
0 typically means that the attribute is absent, and 1 means that it is present. Binary
attributes are referred to as Boolean if the two states correspond to true and false.


Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.

Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in
integer or real values. Numeric attributes can be interval-scaled or ratio-scaled.
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of
interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition
to providing a ranking of values, such attributes allow us to compare and quantify the
difference between values.


Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if
a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio)
of another value. In addition, the values are ordered, and we can also compute the
difference between values, as well as the mean, median, and mode.

Classification algorithms developed from the field of machine learning often talk of
attributes as being either discrete or continuous. Each type may be processed differently.
A discrete attribute has a finite or countably infinite set of values, which may or may not
be represented as integers.

If an attribute is not discrete, it is continuous.

2. Basic Statistical Descriptions of Data

 (You learnt these things in basic statistics course long back)

Measuring the Central Tendency: Mean, Median, and Mode
Measuring the Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Range


Graphic Displays of Basic Statistical Descriptions of Data
Histograms
Scatter Plots and Data Correlation


3. Data Visualization

How can we convey data to users effectively? Data visualization aims to communicate
data clearly and effectively through graphical representation. Data visualization has been
used extensively in many applications—for example, at work for reporting, managing
business operations, and tracking progress of tasks. More popularly, we can take advantage
of visualization techniques to discover data relationships that are otherwise not
easily observable by looking at the raw data.

Pixel-Oriented Visualization Techniques

Geometric Projection Visualization Techniques

 Icon-Based Visualization Techniques
Hierarchical Visualization Techniques

Visualizing Complex Data and Relations





4. Measuring Data Similarity and Dissimilarity

Data Matrix versus Dissimilarity Matrix
Data matrix (or object-by-attribute structure): This structure stores the n data objects
in the form of a relational table, or n-by-p matrix (n objects ×p attributes):


Dissimilarity matrix (or object-by-object structure): This structure stores a collection
of proximities that are available for all pairs of n objects. It is often represented by an
n-by-n table of d(i, j):


where d(i, j) is the measured dissimilarity or “difference” between objects i and j. In
general, d(i, j) is a non-negative number that is close to 0 when objects i and j are
highly similar or “near” each other, and becomes larger the more they differ.

Proximity Measures for Nominal Attributes
Proximity Measures for Binary Attributes
Dissimilarity of Numeric Data: Minkowski Distance


Distance measures are commonly used for computing
the dissimilarity of objects described by numeric attributes. These measures include the
Euclidean, Manhattan, and Minkowski distances.

Proximity Measures for Ordinal Attributes
The values of an ordinal attribute have a meaningful order or ranking about them,
yet the magnitude between successive values is unknown (Section 2.1.4). An example
includes the sequence small, medium, large for a size attribute.

Dissimilarity for Attributes of Mixed Types


Cosine Similarity
A document can be represented by thousands of attributes, each recording the frequency
of a particular word (such as a keyword) or phrase in the document. Thus, each document
is an object represented by what is called a term-frequency vector. For example, in
Table 2.5, we see that Document1 contains five instances of the word team, while hockey
occurs three times. The word coach is absent from the entire document, as indicated by
a count value of 0. Such data can be highly asymmetric.
Term-frequency vectors are typically very long and sparse (i.e., they have many 0 values).


Cosine similarity is a measure of similarity that can be used to compare documents
or, say, give a ranking of documents with respect to a given vector of query
words. Let x and y be two vectors for comparison.


Next Topic:  Data Preprocessing for Data Mining
Data Warehousing and Online Analytical Processing - Chapter of Data Mining
Data Cube Technologies for Data Mining
Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

Data Cube Technologies for Data Mining
Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
Advanced Patterns - Data Mining
Data Mining - Classification: Basic Concepts
Data Mining - Classification: Advanced Methods

Data Mining Recent Trends and Research Frontiers

Excerpts from

Data Mining
Concepts and Techniques
Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber
Jian Pei
Simon Fraser University

No comments:

Post a Comment