## 1. Data Objects and Attributes

Data objects are typically described by attributes. Data objects can also be

referred to as samples, examples, instances, data points, or objects. If the data objects are

stored in a database, they are data tuples. That is, the rows of a database correspond to

the data objects, and the columns correspond to the attributes. In this section, we define

attributes and look at the various attribute types.

What Is an Attribute?

An attribute is a data field, representing a characteristic or feature of a data object. The

nouns attribute, dimension, feature, and variable are often used interchangeably in the

literature. The term dimension is commonly used in data warehousing. Machine learning

literature tends to use the term feature, while statisticians prefer the term variable. Data

mining and database professionals commonly use the term attribute

Nominal Attributes

Nominal means “relating to names.” The values of a nominal attribute are symbols or

names of things.

Binary Attributes

A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where

0 typically means that the attribute is absent, and 1 means that it is present. Binary

attributes are referred to as Boolean if the two states correspond to true and false.

Ordinal Attributes

An ordinal attribute is an attribute with possible values that have a meaningful order or

ranking among them, but the magnitude between successive values is not known.

Numeric Attributes

A numeric attribute is quantitative; that is, it is a measurable quantity, represented in

integer or real values. Numeric attributes can be interval-scaled or ratio-scaled.

Interval-Scaled Attributes

Interval-scaled attributes are measured on a scale of equal-size units. The values of

interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition

to providing a ranking of values, such attributes allow us to compare and quantify the

difference between values.

Ratio-Scaled Attributes

A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if

a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio)

of another value. In addition, the values are ordered, and we can also compute the

difference between values, as well as the mean, median, and mode.

Classification algorithms developed from the field of machine learning often talk of

attributes as being either discrete or continuous. Each type may be processed differently.

A discrete attribute has a finite or countably infinite set of values, which may or may not

be represented as integers.

If an attribute is not discrete, it is continuous.

## 2. Basic Statistical Descriptions of Data

(You learnt these things in basic statistics course long back)Measuring the Central Tendency: Mean, Median, and Mode

Measuring the Dispersion of Data: Range, Quartiles, Variance,

Standard Deviation, and Interquartile Range

Graphic Displays of Basic Statistical Descriptions of Data

Histograms

Scatter Plots and Data Correlation

## 3. Data Visualization

How can we convey data to users effectively? Data visualization aims to communicatedata clearly and effectively through graphical representation. Data visualization has been

used extensively in many applications—for example, at work for reporting, managing

business operations, and tracking progress of tasks. More popularly, we can take advantage

of visualization techniques to discover data relationships that are otherwise not

easily observable by looking at the raw data.

Pixel-Oriented Visualization Techniques

Geometric Projection Visualization Techniques

Icon-Based Visualization Techniques

Hierarchical Visualization Techniques

Visualizing Complex Data and Relations

## 4. Measuring Data Similarity and Dissimilarity

Data Matrix versus Dissimilarity MatrixData matrix (or object-by-attribute structure): This structure stores the n data objects

in the form of a relational table, or n-by-p matrix (n objects ×p attributes):

Dissimilarity matrix (or object-by-object structure): This structure stores a collection

of proximities that are available for all pairs of n objects. It is often represented by an

n-by-n table of d(i, j):

where d(i, j) is the measured dissimilarity or “difference” between objects i and j. In

general, d(i, j) is a non-negative number that is close to 0 when objects i and j are

highly similar or “near” each other, and becomes larger the more they differ.

Proximity Measures for Nominal Attributes

Proximity Measures for Binary Attributes

Dissimilarity of Numeric Data: Minkowski Distance

Distance measures are commonly used for computing

the dissimilarity of objects described by numeric attributes. These measures include the

Euclidean, Manhattan, and Minkowski distances.

Proximity Measures for Ordinal Attributes

The values of an ordinal attribute have a meaningful order or ranking about them,

yet the magnitude between successive values is unknown (Section 2.1.4). An example

includes the sequence small, medium, large for a size attribute.

Dissimilarity for Attributes of Mixed Types

Cosine Similarity

A document can be represented by thousands of attributes, each recording the frequency

of a particular word (such as a keyword) or phrase in the document. Thus, each document

is an object represented by what is called a term-frequency vector. For example, in

Table 2.5, we see that Document1 contains five instances of the word team, while hockey

occurs three times. The word coach is absent from the entire document, as indicated by

a count value of 0. Such data can be highly asymmetric.

Term-frequency vectors are typically very long and sparse (i.e., they have many 0 values).

Cosine similarity is a measure of similarity that can be used to compare documents

or, say, give a ranking of documents with respect to a given vector of query

words. Let x and y be two vectors for comparison.

Next Topic: Data Preprocessing for Data Mining

Data Warehousing and Online Analytical Processing - Chapter of Data Mining

Data Cube Technologies for Data Mining

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

Data Cube Technologies for Data Mining

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

Advanced Patterns - Data Mining

Data Mining - Classification: Basic Concepts

Data Mining - Classification: Advanced Methods

Data Mining Recent Trends and Research Frontiers

Excerpts from

Data Mining

Concepts and Techniques

Third Edition

Jiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber

Jian Pei

Simon Fraser University

## No comments:

## Post a Comment