Thursday, April 14, 2016

Data Warehousing and Online Analytical Processing - Chapter of Data Mining



1 Data Warehouse: Basic Concepts


What Is a Data Warehouse?
Data warehousing provides architectures and tools for business executives to systematically
organize, understand, and use their data to make strategic decisions.

According to William H. Inmon, a leading architect in the construction of data
warehouse systems, “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s decision making process”
. This short but comprehensive definition presents the major features of
a data warehouse. The four keywords—subject-oriented, integrated, time-variant, and
nonvolatile—distinguish data warehouses from other data repository systems, such as
relational database systems, transaction processing systems, and file systems.

Differences between Operational Database Systems
and Data Warehouses


The major distinguishing features of OLTP and OLAP:
Users and system orientation: An OLTP system is customer-oriented and is used
for transaction and query processing by clerks, clients, and information technology
professionals. An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.
Data contents: An OLTP system manages current data that, typically, are too detailed
to be easily used for decision making. An OLAP system manages large amounts of
historic data, provides facilities for summarization and aggregation, and stores and
manages information at different levels of granularity. These features make the data
easier to use for informed decision making.

Database design: An OLTP system usually adopts an entity-relationship (ER) data
model and an application-oriented database design. An OLAP system typically
adopts either a star or a snowflake model  and a subject-oriented
database design.
View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historic data or data in different organizations. In
contrast, an OLAP system often spans multiple versions of a database schema, due to
the evolutionary process of an organization. OLAP systems also deal with information
that originates from different organizations, integrating information from many
data stores. Because of their huge volume, OLAP data are stored on multiple storage
media.
Access patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery mechanisms.
However, accesses to OLAP systems are mostly read-only operations (because
most data warehouses store historic rather than up-to-date information), although
many could be complex queries.

Data Warehousing: A Multitiered Architecture
Data warehouses often adopt a three-tier architecture,

1. The bottom tier is a warehouse database server that is almost always a relational
database system.
2. The middle tier is an OLAP server that is typically implemented using either (1) a
relational OLAP (ROLAP) model (i.e., an extended relational DBMS that maps operations
on multidimensional data to standard relational operations); or (2) a multidimensional
OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).
3. The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).


Data Warehouse Models: Enterprise Warehouse,
Data Mart, and Virtual Warehouse
From the architecture point of view, there are three data warehouse models: the
enterprise warehouse, the data mart, and the virtual warehouse.
Enterprise warehouse: An enterprise warehouse collects all of the information about
subjects spanning the entire organization. It provides corporate-wide data integration,
usually from one or more operational systems or external information
providers, and is cross-functional in scope. It typically contains detailed data as
well as summarized data, and can range in size from a few gigabytes to hundreds
of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented
on traditional mainframes, computer superservers, or parallel architecture
platforms. It requires extensive business modeling and may take years to design
and build.
Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects. For example,
a marketing data mart may confine its subjects to customer, item, and sales. The
data contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based. The implementation cycle of a data mart is more
likely to be measured in weeks rather than months or years. However, it may
involve complex integration in the long run if its design and planning were not
enterprise-wide.
Depending on the source of data, data marts can be categorized as independent
or dependent. Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data generated
locally within a particular department or geographic area. Dependent data marts are
sourced directly from enterprise data warehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational databases.
For efficient query processing, only some of the possible summary views may be
materialized.

Extraction, Transformation, and Loading
Data warehouse systems use back-end tools and utilities to populate and refresh their
data. These tools and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and
external sources.
Data cleaning, which detects errors in the data and rectifies them when possible.
Data transformation, which converts data from legacy or host format to warehouse
format.
Load, which sorts, summarizes, consolidates, computes views, checks integrity, and
builds indices and partitions.
Refresh, which propagates the updates from the data sources to the warehouse.
Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse
systems usually provide a good set of data warehouse management tools.

Metadata Repository
Metadata are data about data. When used in a data warehouse, metadata are the data
that define warehouse objects.

2 Data Warehouse Modeling: Data Cube and OLAP



Data warehouses and OLAP tools are based on a multidimensional data model. This
model views data in the form of a data cube.

Data Cube: A Multidimensional Data Model
“What is a data cube?” A data cube allows data to be modeled and viewed in multiple
dimensions. It is defined by dimensions and facts.
In general terms, dimensions are the perspectives or entities with respect to which
an organization wants to keep records. . Each dimension may have a table associated with it, called a dimension
table, which further describes the dimension.

A multidimensional data model is typically organized around a central theme, such
as sales. This theme is represented by a fact table. Facts are numeric measures. Think of
them as the quantities by which we want to analyze relationships between dimensions

Although we usually think of cubes as 3-D geometric structures, in data warehousing
the data cube is n-dimensional.

In the data
warehousing research literature, a data cuboid can be generated for
each of the possible subsets of the given dimensions. From the all possible cuboids, a lattice of
cuboids, each showing the data at a different level of summarization can be visualized. The
lattice of cuboids is then referred to as a data cube.

The cuboid that holds the lowest level of summarization is called the base cuboid.
. The 0-D cuboid, which holds the highest level
of summarization, is called the apex cuboid. The apex cuboid is typically denoted
by all.

Stars, Snowflakes, and Fact Constellations: Schemas for
Multidimensional Data Models
The entity-relationship data model is commonly used in the design of relational
databases, where a database schema consists of a set of entities and the relationships
between them.

A data warehouse, however, requires a concise, subject-oriented schema that facilitates
online data analysis.
The most popular data model for a data warehouse is a multidimensional model,
which can exist in the form of a star schema, a snowflake schema, or a fact constellation
schema

Dimensions: The Role of Concept Hierarchies
A concept hierarchy defines a sequence of mappings from a set of low-level concepts
to higher-level, more general concepts.

 Measures: Their Categorization and Computation

 A data cube measure is a numeric function that can be evaluated
at each point in the data cube space. A measure value is computed for a given point by
aggregating the data corresponding to the respective dimension–value pairs defining the
given point. Measures can be organized into three categories—distributive, algebraic, and holistic—based
on the kind of aggregate functions used.

Distributive: An aggregate function is distributive if it can be computed in a distributed
manner as follows. Suppose the data are partitioned into n sets. We apply the function
to each partition, resulting in n aggregate values. If the result derived by applying
the function to the n aggregate values is the same as that derived by applying the function
to the entire data set (without partitioning), the function can be computed in a
distributed manner

Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function
with M arguments (where M is a bounded positive integer), each of which
is obtained by applying a distributive aggregate function.

Holistic: An aggregate function is holistic if there is no constant bound on the storage
size needed to describe a subaggregate

Most large data cube applications require efficient computation of distributive and
algebraic measures. Many efficient techniques for this exist.

Typical OLAP Operations

Roll-up
Drill-down
Slice and dice
Pivot

 A Starnet Query Model for Querying
Multidimensional Databases
The querying of multidimensional databases can be based on a starnet model, which
consists of radial lines emanating from a central point, where each line represents a
concept hierarchy for a dimension. Each abstraction level in the hierarchy is called a
footprint. These represent the granularities available for use by OLAP operations such
as drill-down and roll-up.

.3 Data Warehouse Design and Usage


 A Business Analysis Framework for Data
Warehouse Design


To design an effective data warehouse we need to understand and analyze business
needs and construct a business analysis framework. The construction of a large
and complex information system can be viewed as the construction of a large and
complex building, for which the owner, architect, and builder have different views.
These views are combined to form a complex framework that represents the top-down,
business-driven, or owner’s perspective, as well as the bottom-up, builder-driven, or
implementor’s view of the information system.
Four different views regarding a data warehouse design must be considered: the topdown
view, the data source view, the data warehouse view, and the business query view.

Data Warehouse Design Process
Let’s look at various approaches to the data warehouse design process and the steps
involved.
A data warehouse can be built using a top-down approach, a bottom-up approach,
or a combination of both. The top-down approach starts with overall design and planning.
It is useful in cases where the technology is mature and well known, and where
the business problems that must be solved are clear and well understood. The bottomup
approach starts with experiments and prototypes. This is useful in the early stage
of business modeling and technology development.

In general, the warehouse design process consists of the following steps:

1. Choose a business process to model (e.g., orders, invoices, shipments, inventory,
account administration, sales, or the general ledger). If the business process is organizational
and involves multiple complex object collections, a data warehouse model
should be followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be chosen.
2. Choose the business process grain, which is the fundamental, atomic level of data
to be represented in the fact table for this process (e.g., individual transactions,
individual daily snapshots, and so on).
3. Choose the dimensions that will apply to each fact table record. Typical dimensions
are time, item, customer, supplier, warehouse, transaction type, and status.
4. Choose the measures that will populate each fact table record. Typical measures are
numeric additive quantities like dollars sold and units sold.


Data Warehouse Usage for Information Processing
Data warehouses and data marts are used in a wide range of applications. Business
executives use the data in data warehouses and data marts to perform data analysis
and make strategic decisions. In many firms, data warehouses are used as an integral
part of a plan-execute-assess “closed-loop” feedback system for enterprise management.
Data warehouses are used extensively in banking and financial services, consumer goods
and retail distribution sectors, and controlled manufacturing such as demand-based
production.

There are three kinds of data warehouse applications: information processing, analytical
processing, and data mining.
Information processing supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts, or graphs. A current trend in data warehouse information
processing is to construct low-cost web-based accessing tools that are then
integrated with web browsers.
Analytical processing supports basic OLAP operations, including slice-and-dice,
drill-down, roll-up, and pivoting. It generally operates on historic data in both summarized
and detailed forms. The major strength of online analytical processing over
information processing is the multidimensional data analysis of data warehouse data.
Data mining supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models, performing classific

Data mining is not confined to the analysis of data stored in data warehouses. It may
analyze data existing at more detailed granularities than the summarized data provided
in a data warehouse. It may also analyze transactional, spatial, textual, and multimedia
data that are difficult to model with current multidimensional database technology. In
this context, data mining covers a broader spectrum than OLAP with respect to data
mining functionality and the complexity of the data handled.

From Online Analytical Processing
to Multidimensional Data Mining
The data mining field has conducted substantial research regarding mining on various
data types, including relational data, data from data warehouses, transaction data,
time-series data, spatial data, text data, and flat files. Multidimensional data mining
(also known as exploratory multidimensional data mining, online analytical mining,
or OLAM) integrates OLAP with data mining to uncover knowledge in multidimensional
databases. Among the many different paradigms and architectures of data mining
systems, multidimensional data mining is particularly important for the following
reasons:


4 Data Warehouse Implementation

Data warehouses contain huge volumes of data. OLAP servers demand that decision
support queries be answered in the order of seconds. Therefore, it is crucial for data
warehouse systems to support highly efficient cube computation techniques, access
methods, and query processing techniques.

Efficient Data Cube Computation: An Overview
At the core of multidimensional data analysis is the efficient computation of aggregations
across many sets of dimensions. In SQL terms, these aggregations are referred to
as group-by’s. Each group-by can be represented by a cuboid, where the set of group-by’s
forms a lattice of cuboids defining a data cube.

The compute cube Operator and the Curse of Dimensionality
One approach to cube computation extends SQL so as to include a compute cube operator.
The compute cube operator computes aggregates over all subsets of the dimensions
specified in the operation. This can require excessive storage space, especially for large
numbers of dimensions.

Partial Materialization: Selected Computation of Cuboids
There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not precompute any of the “nonbase” cuboids. This leads
to computing expensive multidimensional aggregates on-the-fly, which can be extremely
slow.
2. Full materialization: Precompute all of the cuboids. The resulting lattice of computed
cuboids is referred to as the full cube. This choice typically requires huge
amounts of memory space in order to store all of the precomputed cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole set of possible
cuboids. Alternatively, we may compute a subset of the cube, which contains
only those cells that satisfy some user-specified criterion, such as where the tuple
count of each cell is above some threshold. We will use the term subcube to refer to
the latter case, where only some of the cells may be precomputed for various cuboids.
Partial materialization represents an interesting trade-off between storage space and
response time.

Indexing OLAP Data: Bitmap Index and Join Index

To facilitate efficient data accessing, most data warehouse systems support index structures
and materialized views (using cuboids).

The bitmap indexing method is popular in OLAP products because it allows quick searching in data cubes.

The join indexing method gained popularity from its use in relational database query processing.The star schema model of data warehouses makes join indexing attractive for crosstable
search, because the linkage between a fact table and its corresponding dimension
tables comprises the fact table’s foreign key and the dimension table’s primary key. Join
indexing maintains relationships between attribute values of a dimension (e.g., within
a dimension table) and the corresponding rows in the fact table. Join indices may span
multiple dimensions to form composite join indices. We can use join indices to identify
subcubes that are of interest.

Efficient Processing of OLAP Queries

The purpose of materializing cuboids and constructing OLAP index structures is to speed up query processing in data cubes. Given materialized views, query processing should proceed as follows:

1. Determine which operations should be performed on the available cuboids: This
involves transforming any selection, projection, roll-up (group-by), and drill-down
operations specified in the query into corresponding SQL and/or OLAP operations.
For example, slicing and dicing a data cube may correspond to selection and/or
projection operations on a materialized cuboid.

2. Determine to which materialized cuboid(s) the relevant operations should be
applied: This involves identifying all of the materialized cuboids that may potentially
be used to answer the query, pruning the set using knowledge of “dominance”
relationships among the cuboids, estimating the costs of using the remaining
materialized cuboids, and selecting the cuboid with the least cost.



OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP
Logically, OLAP servers present business users with multidimensional data from data
warehouses or data marts, without concerns regarding how or where the data are stored.
However, the physical architecture and implementation of OLAP servers must consider
data storage issues. Implementations of a warehouse server for OLAP processing include
the following:

Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in
between a relational back-end server and client front-end tools. They use a relational
or extended-relational DBMS to store and manage warehouse data, and OLAP
middleware to support missing pieces. ROLAP servers include optimization for
each DBMS back end, implementation of aggregation navigation logic, and additional
tools and services. ROLAP technology tends to have greater scalability than
MOLAP technology. The DSS server of Microstrategy, for example, adopts the
ROLAP approach.

Multidimensional OLAP (MOLAP) servers: These servers support multidimensional
data views through array-based multidimensional storage engines. They map multidimensional
views directly to data cube array structures. The advantage of using a
data cube is that it allows fast indexing to precomputed summarized data. Notice
that with multidimensional data stores, the storage utilization may be low if the data
set is sparse. In such cases, sparse matrix compression techniques should be explored

Many MOLAP servers adopt a two-level storage representation to handle dense
and sparse data sets: Denser subcubes are identified and stored as array structures,
whereas sparse subcubes employ compression technology for efficient storage
utilization.

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and
MOLAP technology, benefiting from the greater scalability of ROLAP and the faster
computation of MOLAP. For example, a HOLAP server may allow large volumes
of detailed data to be stored in a relational database, while aggregations are kept in
a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP
server.

Specialized SQL servers: To meet the growing demand of OLAP processing in relational
databases, some database system vendors implement specialized SQL servers
that provide advanced query language and query processing support SQL queries over star and snowflake schemas in read only environment


5. Data Generalization by Attribute-Oriented Induction



 In general, data generalization summarizes data by replacing relatively low-level
values (e.g., numeric values for an attribute age) with higher-level concepts (e.g., young,
middle-aged, and senior), or by reducing the number of dimensions to summarize data
in concept space involving fewer dimensions (e.g., removing birth date and telephone
number when summarizing the behavior of a group of students). Given the large amount
of data stored in databases, it is useful to be able to describe concepts in concise and succinct
terms at generalized (rather than low) levels of abstraction. Allowing data sets to
be generalized at multiple levels of abstraction facilitates users in examining the general
behavior of the data

concept description, which is a form of data generalization.
A concept typically refers to a data collection such as frequent buyers, graduate
students, and so on. As a data mining task, concept description is not a simple
enumeration of the data. Instead, concept description generates descriptions for data
characterization and comparison. It is sometimes called class description when the concept
to be described refers to a class of objects. Characterization provides a concise and
succinct summarization of the given data collection, while concept or class comparison
(also known as discrimination) provides descriptions comparing two or more data
collections

Attribute-Oriented Induction for Data Characterization
The attribute-oriented induction (AOI) approach to concept description was first proposed
in 1989, a few years before the introduction of the data cube approach. The data
cube approach is essentially based on materialized views of the data, which typically
have been precomputed in a data warehouse.

The general idea of attribute-oriented induction is to first collect the task-relevant
data using a database query and then perform generalization based on the examination
of the number of each attribute’s distinct values in the relevant data set. The generalization
is performed by either attribute removal or attribute generalization

Efficient Implementation of Attribute-Oriented Induction


 Attribute-Oriented Induction for Class Comparisons

“How is class comparison performed?” In general, the procedure is as follows:
1. Data collection: The set of relevant data in the database is collected by query processing
and is partitioned respectively into a target class and one or a set of contrasting
classes.
2. Dimension relevance analysis: If there are many dimensions, then dimension relevance
analysis should be performed on these classes to select only the highly relevant
dimensions for further analysis. Correlation or entropy-based measures can be used
for this step (Chapter 3).
3. Synchronous generalization: Generalization is performed on the target class to the
level controlled by a user- or expert-specified dimension threshold, which results in
a prime target class relation. The concepts in the contrasting class(es) are generalized
to the same level as those in the prime target class relation, forming the prime
contrasting class(es) relation.
4. Presentation of the derived comparison: The resulting class comparison description
can be visualized in the form of tables, graphs, and rules. This presentation usually
includes a “contrasting” measure such as count% (percentage count) that reflects the
comparison between the target and contrasting classes. The user can adjust the comparison
description by applying drill-down, roll-up, and other OLAP operations to
the target and contrasting classes, as desired.


Next Chapter: Data Cube Technologies for Data Mining
Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
Advanced Patterns - Data Mining
Data Mining - Classification: Basic Concepts
Data Mining - Classification: Advanced Methods


Data Mining Recent Trends and Research Frontiers

Excerpts from the Book

Data Mining Concepts and Techniques

Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University




















No comments:

Post a Comment