## 1 Data Cube Computation: Preliminary Concepts

Data cubes facilitate the online analytical processing of multidimensional data. “But howcan we compute data cubes in advance, so that they are handy and readily available for

query processing?” This section contrasts full cube materialization (i.e., precomputation)

versus various strategies for partial cube materialization.

In general, there are two basic data structures used for storing cuboids. The implementation of relational OLAP (ROLAP) uses relational tables, whereas multidimensional arrays are used in multidimensional OLAP (MOLAP). Although ROLAP and MOLAP may each explore different cube

computation techniques, some optimization “tricks” can be shared among the different

data representations. The following are general optimization techniques for efficient

computation of data cubes.

Optimization Technique 1: Sorting, hashing, and grouping. Sorting, hashing, and

grouping operations should be applied to the dimension attributes to reorder and

cluster related tuples.

Optimization Technique 2: Simultaneous aggregation and caching of intermediate

results. In cube computation, it is efficient to compute higher-level aggregates from

previously computed lower-level aggregates, rather than from the base fact table

Optimization Technique 3: Aggregation from the smallest child when there exist multiple

child cuboids. When there exist multiple child cuboids, it is usually more

efficient to compute the desired parent (i.e., more generalized) cuboid from the

smallest, previously computed child cuboid

Optimization Technique 4: The Apriori pruning method can be explored to

compute iceberg cubes efficiently. The Apriori property, in the context of data

cubes, states as follows: If a given cell does not satisfy minimum support, then no descendant

of the cell (i.e., more specialized cell) will satisfy minimum support either.

## 2 Data Cube Computation Methods

Data cube computation is an essential task in data warehouse implementation. The precomputationof all or part of a data cube can greatly reduce the response time and enhance the performance of online analytical processing. However, such computation is challenging because it may require substantial computational time and storage space.

Multiway Array Aggregation for Full Cube Computation

The multiway array aggregation (or simply MultiWay) method computes a full data

cube by using a multidimensional array as its basic data structure. It is a typical MOLAP

approach that uses direct array addressing, where dimension values are accessed via the

position or index of their corresponding array locations.

BUC: Computing Iceberg Cubes from the Apex Cuboid Downward

BUC is an algorithm for the computation of sparse and iceberg cubes. Unlike MultiWay,

BUC constructs the cube from the apex cuboid toward the base cuboid. This allows BUC

to share data partitioning costs. This processing order also allows BUC to prune during

construction, using the Apriori property.

Star-Cubing: Computing Iceberg Cubes Using a Dynamic Star-Tree Structure

Star-Cubing algorithm for computing iceberg cubes.

Star-Cubing combines the strengths of the other methods we have studied up to this

point. It integrates top-down and bottom-up cube computation and explores both

multidimensional aggregation (similar to MultiWay) and Apriori-like pruning (similar

to BUC). It operates from a data structure called a star-tree, which performs lossless

data compression, thereby reducing the computation time and memory requirements

Precomputing Shell Fragments for Fast High-Dimensional OLAP

Recall the reason that we are interested in precomputing data cubes: Data cubes facilitate

fast OLAP in a multidimensional data space. However, a full data cube of high

dimensionality needs massive storage space and unrealistic computation time. Iceberg

cubes provide a more feasible alternative, as we have seen, wherein the iceberg condition

is used to specify the computation of only a subset of the full cube’s cells.

However, although an iceberg cube is smaller and requires less computation time than

its corresponding full cube, it is not an ultimate solution.

## 3 Processing Advanced Kinds of Queries by Exploring Cube Technology

Methods for effective processing of advanced kinds of queries.

Sampling Cubes: OLAP-Based Mining

on Sampling Data

When collecting data, we often collect only a subset of the data we would ideally like

to gather. In statistics, this is known as collecting a sample of the data population.

Sampling Cube Framework

The sampling cube is a data cube structure that stores the sample data and their multidimensional

aggregates. It supports OLAP on sample data. It calculates confidence intervals

as a quality measure for any multidimensional query. Given a sample data relation

(i.e., base cuboid) R, the sampling cube CR typically co

The best way to solve the small sample size problem is to get more data. Fortunately,

there is usually an abundance of additional data available in the cube. The data do not

match the query cell exactly; however, we can consider data from cells that are “close

by.” There are two ways to incorporate such data to enhance the reliability of the query

answer: (1) intracuboid query expansion, where we consider nearby cells within the same

cuboid, and (2) intercuboid query expansion, where we consider more general versions

(from parent cuboids) of the query cell.

Ranking Cubes: Efficient Computation of Top-k Queries

Ranking Cube contributes to the efficient processing of top-k queries. Instead of returning a large set of indiscriminative answers to a query, a top-k

query (or ranking query) returns only the best k results according to a user-specified

preference.

The results are returned in ranked order so that the best is at the top. The userspecified

preference generally consists of two components: a selection condition and

a ranking function. Top-k queries are common in many applications like searching

web databases, k-nearest-neighbor searches with approximate matches, and similarity

queries in multimedia databases.

## 4 Multidimensional Data Analysis in Cube Space

Prediction Cubes: Prediction Mining in Cube Space

Recently, researchers have turned their attention toward multidimensional data mining to uncover knowledge at varying dimensional combinations and granularities. Such mining is also known as exploratory multidimensional data mining and online analytical data mining (OLAM).

There are at least four ways in which OLAP-style analysis can be fused with data

mining techniques:

1. Use cube space to define the data space for mining. Each region in cube space represents

a subset of data over which we wish to find interesting patterns. Cube space

is defined by a set of expert-designed, informative dimension hierarchies, not just

arbitrary subsets of data. Therefore, the use of cube space makes the data space both

meaningful and tractable.

2. Use OLAP queries to generate features and targets for mining. The features and even

the targets (that we wish to learn to predict) can sometimes be naturally defined as

OLAP aggregate queries over regions in cube space.

3. Use data mining models as building blocks in a multistep mining process. Multidimensional

data mining in cube space may consist of multiple steps, where data mining

models can be viewed as building blocks that are used to describe the behavior of

interesting data sets, rather than the end results.

4. Use data cube computation techniques to speed up repeated model construction. Multidimensional

data mining in cube space may require building a model for each

candidate data space, which is usually too expensive to be feasible. However, by carefully

sharing computation across model construction for different candidates based

on data cube computation techniques, efficient mining is achievable.

Multifeature cubes

enable more in-depth analysis. They can compute more complex queries of which the

measures depend on groupings of multiple aggregates at varying granularity levels. The

queries posed can be much more elaborate and task-specific than traditional queries,

as we shall illustrate in the next examples. Many complex data mining queries can be

answered by multifeature cubes without significant increase in computational cost, in

comparison to cube computation for simple queries with traditional data cubes.

Exception-Based, Discovery-Driven Cube Space Exploration

A discovery-driven approach to exploring cube space.

Precomputed measures indicating data exceptions are used to guide the user in the data

analysis process, at all aggregation levels. We hereafter refer to these measures as exception

indicators. Intuitively, an exception is a data cube cell value that is significantly

different from the value anticipated, based on a statistical model. The model considers

variations and patterns in the measure value across all the dimensions to which a cell

belongs. For example, if the analysis of item-sales data reveals an increase in sales in

December in comparison to all other months, this may seem like an exception in the

time dimension. However, it is not an exception if the item dimension is considered,

since there is a similar increase in sales for other items during December.

The model considers exceptions hidden at all aggregated group-by’s of a data cube.

Visual cues, such as background color, are used to reflect each cell’s degree of exception,

based on the precomputed exception indicators. Efficient algorithms have been proposed

for cube construction

Next Chapter: Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

Advanced Patterns - Data Mining

Data Mining - Classification: Basic Concepts

Data Mining - Classification: Advanced Methods

Excerpts from the Book

## Data Mining Concepts and Techniques

Third EditionJiawei Han

University of Illinois at Urbana–Champaign

Micheline Kamber, Jian Pei

Simon Fraser University

## No comments:

## Post a Comment