Thursday, April 14, 2016

Data Cube Technology for Data Mining

1 Data Cube Computation: Preliminary Concepts

Data cubes facilitate the online analytical processing of multidimensional data. “But how
can we compute data cubes in advance, so that they are handy and readily available for
query processing?” This section contrasts full cube materialization (i.e., precomputation)
versus various strategies for partial cube materialization.

In general, there are two basic data structures used for storing cuboids. The implementation of relational OLAP (ROLAP) uses relational tables, whereas multidimensional arrays are used in multidimensional OLAP (MOLAP). Although ROLAP and MOLAP may each explore different cube
computation techniques, some optimization “tricks” can be shared among the different
data representations. The following are general optimization techniques for efficient
computation of data cubes.
Optimization Technique 1: Sorting, hashing, and grouping. Sorting, hashing, and
grouping operations should be applied to the dimension attributes to reorder and
cluster related tuples.

Optimization Technique 2: Simultaneous aggregation and caching of intermediate
results. In cube computation, it is efficient to compute higher-level aggregates from
previously computed lower-level aggregates, rather than from the base fact table

Optimization Technique 3: Aggregation from the smallest child when there exist multiple
child cuboids. When there exist multiple child cuboids, it is usually more
efficient to compute the desired parent (i.e., more generalized) cuboid from the
smallest, previously computed child cuboid

Optimization Technique 4: The Apriori pruning method can be explored to
compute iceberg cubes efficiently. The Apriori property, in the context of data
cubes, states as follows: If a given cell does not satisfy minimum support, then no descendant
of the cell (i.e., more specialized cell) will satisfy minimum support either.

2 Data Cube Computation Methods

Data cube computation is an essential task in data warehouse implementation. The precomputation
of all or part of a data cube can greatly reduce the response time and enhance the performance of online analytical processing. However, such computation is challenging because it may require substantial computational time and storage space.

Multiway Array Aggregation for Full Cube Computation

The multiway array aggregation (or simply MultiWay) method computes a full data
cube by using a multidimensional array as its basic data structure. It is a typical MOLAP
approach that uses direct array addressing, where dimension values are accessed via the
position or index of their corresponding array locations.

BUC: Computing Iceberg Cubes from the Apex Cuboid Downward

BUC is an algorithm for the computation of sparse and iceberg cubes. Unlike MultiWay,
BUC constructs the cube from the apex cuboid toward the base cuboid. This allows BUC
to share data partitioning costs. This processing order also allows BUC to prune during
construction, using the Apriori property.

Star-Cubing: Computing Iceberg Cubes Using a Dynamic Star-Tree Structure

Star-Cubing algorithm for computing iceberg cubes.
Star-Cubing combines the strengths of the other methods we have studied up to this
point. It integrates top-down and bottom-up cube computation and explores both
multidimensional aggregation (similar to MultiWay) and Apriori-like pruning (similar
to BUC). It operates from a data structure called a star-tree, which performs lossless
data compression, thereby reducing the computation time and memory requirements

Precomputing Shell Fragments for Fast High-Dimensional OLAP

Recall the reason that we are interested in precomputing data cubes: Data cubes facilitate
fast OLAP in a multidimensional data space. However, a full data cube of high
dimensionality needs massive storage space and unrealistic computation time. Iceberg
cubes provide a more feasible alternative, as we have seen, wherein the iceberg condition
is used to specify the computation of only a subset of the full cube’s cells.
However, although an iceberg cube is smaller and requires less computation time than
its corresponding full cube, it is not an ultimate solution.

3 Processing Advanced Kinds of Queries by Exploring Cube Technology

Methods for effective processing of advanced kinds of queries.

Sampling Cubes: OLAP-Based Mining
on Sampling Data
When collecting data, we often collect only a subset of the data we would ideally like
to gather. In statistics, this is known as collecting a sample of the data population.

Sampling Cube Framework
The sampling cube is a data cube structure that stores the sample data and their multidimensional
aggregates. It supports OLAP on sample data. It calculates confidence intervals
as a quality measure for any multidimensional query. Given a sample data relation
(i.e., base cuboid) R, the sampling cube CR typically co

The best way to solve the small sample size problem is to get more data. Fortunately,
there is usually an abundance of additional data available in the cube. The data do not
match the query cell exactly; however, we can consider data from cells that are “close
by.” There are two ways to incorporate such data to enhance the reliability of the query
answer: (1) intracuboid query expansion, where we consider nearby cells within the same
cuboid, and (2) intercuboid query expansion, where we consider more general versions
(from parent cuboids) of the query cell.

Ranking Cubes: Efficient Computation of Top-k Queries

Ranking Cube contributes to the efficient processing of top-k queries. Instead of returning a large set of indiscriminative answers to a query, a top-k
query (or ranking query) returns only the best k results according to a user-specified
The results are returned in ranked order so that the best is at the top. The userspecified
preference generally consists of two components: a selection condition and
a ranking function. Top-k queries are common in many applications like searching
web databases, k-nearest-neighbor searches with approximate matches, and similarity
queries in multimedia databases.

4 Multidimensional Data Analysis in Cube Space

Prediction Cubes: Prediction Mining in Cube Space

Recently, researchers have turned their attention toward multidimensional data mining to uncover knowledge at varying dimensional combinations and granularities. Such mining is also known as exploratory multidimensional data mining and online analytical data mining (OLAM).

There are at least four ways in which OLAP-style analysis can be fused with data
mining techniques:
1. Use cube space to define the data space for mining. Each region in cube space represents
a subset of data over which we wish to find interesting patterns. Cube space
is defined by a set of expert-designed, informative dimension hierarchies, not just
arbitrary subsets of data. Therefore, the use of cube space makes the data space both
meaningful and tractable.
2. Use OLAP queries to generate features and targets for mining. The features and even
the targets (that we wish to learn to predict) can sometimes be naturally defined as
OLAP aggregate queries over regions in cube space.
3. Use data mining models as building blocks in a multistep mining process. Multidimensional
data mining in cube space may consist of multiple steps, where data mining
models can be viewed as building blocks that are used to describe the behavior of
interesting data sets, rather than the end results.
4. Use data cube computation techniques to speed up repeated model construction. Multidimensional
data mining in cube space may require building a model for each
candidate data space, which is usually too expensive to be feasible. However, by carefully
sharing computation across model construction for different candidates based
on data cube computation techniques, efficient mining is achievable.

Multifeature cubes
enable more in-depth analysis. They can compute more complex queries of which the
measures depend on groupings of multiple aggregates at varying granularity levels. The
queries posed can be much more elaborate and task-specific than traditional queries,
as we shall illustrate in the next examples. Many complex data mining queries can be
answered by multifeature cubes without significant increase in computational cost, in
comparison to cube computation for simple queries with traditional data cubes.

Exception-Based, Discovery-Driven Cube Space Exploration

A discovery-driven approach to exploring cube space.

Precomputed measures indicating data exceptions are used to guide the user in the data
analysis process, at all aggregation levels. We hereafter refer to these measures as exception
indicators. Intuitively, an exception is a data cube cell value that is significantly
different from the value anticipated, based on a statistical model. The model considers
variations and patterns in the measure value across all the dimensions to which a cell
belongs. For example, if the analysis of item-sales data reveals an increase in sales in
December in comparison to all other months, this may seem like an exception in the
time dimension. However, it is not an exception if the item dimension is considered,
since there is a similar increase in sales for other items during December.
The model considers exceptions hidden at all aggregated group-by’s of a data cube.
Visual cues, such as background color, are used to reflect each cell’s degree of exception,
based on the precomputed exception indicators. Efficient algorithms have been proposed
for cube construction

Next Chapter: Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
Advanced Patterns - Data Mining
Data Mining - Classification: Basic Concepts
Data Mining - Classification: Advanced Methods

Excerpts from the Book

Data Mining Concepts and Techniques

Third Edition
Jiawei Han
University of Illinois at Urbana–Champaign
Micheline Kamber, Jian Pei
Simon Fraser University

No comments:

Post a Comment