A variety of Data Mining methods and algorithms exist. Some are well founded in mathematics and statistics, whereas others are used simply because they produce useful results. Some of them are:
Method - Foundation of the Method
Bayesian Network - Statistics
Decision Trees - Machine learning
Fuzzy Sets - Logic and Statistics
Inductive Logic Programming - Logic, Machine learning
Neural Networks - Black Box method
Rough Sets - Logic, Statistics, and Algebra
Bayesian, or belief networks, is a method to represent knowledge using directed acyclic graphs (DAG's). The vertices are events which describe the state of some part of the desired universe in some time interval. The edges represent relations between events, and the absence of edges represents independence between events.
A variable takes on values that correspond to events. Variables may be discrete or continuous. A database is denoted D in the following, the set of all variables U, and a Bayesian network-structure is denoted B. If D consists of all variables in U, then D is complete. Two structures are said to be equivalent if they have the same set of probability distributions.
The existing knowledge of an expert or set of experts is encoded into a Bayesian network, then a database is used to update the knowledge, and thus creates one or several new networks.
In probability theory, P(A) denotes the probability that event A will occur. Belief measures in Bayesian formalism obeys the three basic axioms of probability theory:
The Bayes rule is the basis for Bayesian Networks:
Probability can be seen as a measure of belief in how probable outcomes are, given an event (subjective interpretation). It can also be interpreted as a frequency; how often it has happened in the past (objective interpretation).
Bayesian Networks Applied in Data Mining
When Bayesian Networks are applied in Data Mining, the user may want to apply techniques to learn the structure of a Bayesian network with one or more variables, or to reason on a structure that is already known.
Learning Structure and Variables
To learn a structure, we want to generate candidate Bayesian Networks, and choose the structure that is best according to the database we have.
The probability distribution in equation gif gives the probability distribution for the next observation after looking in the database:
To find a Bayesian Network B for a database D, we can use metrics together with a search algorithm. Some of the metrics are:
Bayes Factor (BF)
Bayesian Dirichlet (BD)
Maximum a Posteriori (MAP)
A Information Criterion (AIC)
Bayes Information Criterion (BIC)
Minimum Description Length (MDL)
Some search methods that exist are:
Local Search - a heuristic search algorithm on the graph.
Iterated Hill-Climb - uses Local Search, but avoids getting stuck in a local maxima.
If data are missing in the database, it can either be filled in using the posterior distribution. Or we can use Gibbs sampling, which approximates the expectation of any function with respect to it's probability distribution. Another alternative is to use the expectation-maximization (EM)-algorithm which can be used to compute a score from the BIC-metric.
Missing data can be filled in using Gibbs sampling or the posterior distribution. The method is relatively resistant to noise.
Bayesian Networks are able to reason with uncertainty. Uncertainty in the data will be included in the model as probabilities of different outcomes.
In order to make the Bayesian network structure, it has to be learned from the database. It can also exist as prior knowledge.
The output from Bayesian Network algorithms is a graph, which is easy to interpret for humans.
There exist classifiers that perform in linear time, but further iterations give better results.
The method is statistically founded, and is always possible to examine why the bayesian method produced certain results.
Inductive Logic Programming
Inductive Logic Programming (ILP) is a method for learning first-order definite clauses, or Horn clauses, from data.
ILP Applied in Data Mining
ILP systems can be divided in three groups :
Empirical systems, which are single-predicate learning systems that are able to analyze large example sets with little or no human interaction. Interactive systems incrementally build complex domain theories consisting of multiple predicates where the user controls and initiates the model construction and refinement process. Programming assistants learn from small example sets without interaction with humans. The systems can use a multitude of techniques when inducing knowledge. The techniques are discussed briefly below.
Inverse Resolution is motivated by the fact that in deduction, results found by resolution are clauses. We invert this procedure to find clauses from results.
Least General Generalization
The method uses the relative least general generalization (described in section gif) to compute the most general clause given the background knowledge tex2html_wrap_inline1615 .
Search Refinement Graph
The top-down learner starts from the most general clause and repeatedly refine until it no longer covers negative examples. During the search the learner ensures that the clause considered covers at least one positive example.
All possible instantiations of the predicate variables are tried systematically for each rule model, and the resulting clauses are tested against the examples and background knowledge to produce new clauses that can be learned.
An ILP problem is transformed from relational to attribute-value form and solved by an attribute-value learner. This approach is only feasible for a small number of ILP problems where the clauses are typed, constrained and non-recursive. The induced hypothesis is transformed back into relational form.
Early ILP systems did not handle noise, but recent systems such as FOIL and LINUS are able to cope with noise.
General rules can not be generated from inconsistent data. Inconsistency must be removed by preprocessing the data.
The ILP methods require background knowledge on the form of definite clauses.
The output is given as definite clauses, often in Prolog syntax.
Inductive logic programming has a mathematical basis, and it is always possible to analyze why results are produced.
Overfitting can be a problem, if rules are generated that has low support.
The Rough Sets method was designed as a mathematical tool to deal with uncertainty in AI applications. Rough sets have also proven useful in Data Mining. The method was first introduced by Zdzislaw Pawlak in 1982.
The starting point of Rough Sets theory is an information system, an ordered pair P = [U, A]. U is a non empty finite set called universe, A is a non empty finite set of attributes. The elements of the universe are called objects, and for each attribute a that belongs to A, a(x) denotes the value of attribute a for object x that belongs to U. It is often useful to illustrate an information system in a table, called an information system table. One of the most important concepts in Rough Sets is indiscernibility, denoted IND(). Indiscernability is a relation.
The indiscernible objects can be classified in equivalence classes:
Intuitively the objects of an equivalence class are indiscernible from all other objects in the class, with respect to the attributes in the set B. In most cases we therefore consider the equivalence classes and not the individual objects.
Some of the objects attributes may be less important than others when doing the classification. An attribute a is said to be dispersible or superfluous. Otherwise the attribute is indispensable in B. A minimal set of all the attributes that preserves the partitioning of the universe is called a reduct, RED. Thus a reduct is a set of attributes such that all attributes are dispensible and IND(B)=IND(A). There are also object-related reducts that gives the set of attributes needed to separate one particular object/class from all the other.
They are lower and upper approximation, Rough definability of sets and the Rough membership function are concepts of rough sets.. The lower and upper approximation are used to classify a set of objects, X, from the universe that may belong to different equivalence classes.
A natural extension of the information system is the decision system. A decision system is an information system with an extra set of attributes, called decision attributes,
A Decision table is a table of the indiscernability classes, the attributes and the decision attribute.
Rough Sets Applied in Data Mining
When using rough sets as a algorithm for data mining, we generate decision rules that map the value of an objects attribute to a decision value. The rules are generated from a test set. There are two types of decision rules, definite and default decision rules. Definite decision rules are rules that map the attribute values of an object into exactly one decision class. A deterministic decision system can be completely expressed by a set of definite rules, but an indeterministic decision system can not.
The subsections below compare the Rough Set method to the criterias outlined in the above methods. This comparison will give us a framework that can be used to compare the various data mining techniques.
The rough set method is capable of handling most types of noise. If the input data is missing an attribute that is dispencible then the classification process is not affected. If the input data are missing important attributes, then the Rough Sets can not classify the object and the object must be thrown away. There are algorithms implemented that can 'guess' the value of a missing attribute, by doing statistical insertion among others, and thereby keeping the object.
As described before, the default rules in Rough Sets can handle inconsistency in the information system quite nicely. In fact, as pointed out, it may be a advantage to introduce inconsistency, by removing attributes, to make the decision rules more general.
Rough set method does not use prior knowledge when generating default rules. As mentioned before the default rules generated by Rough Set algorithms are generated only from the objects in the input data.
The output from the Rough Set algorithms are rules, preferably default rules. These rules can then be uses to predict decisions based on the input attributes. By making variations in the threshold value, one can tune the number of default rules to get the most optional result. Rules are easy for humans to interpret and use.
The complexity of partitioning is O(nlog(n)), a sort algorithm. Computing the reducts is NP-complete, but there are approximation algorithms (genetic, Johnson, heuristical search) that have proven very efficient.
It is easy for the user to find out how the results were generated. This is because Rough Sets method can display the indiscernability classes it created and the distribution of the objects in these classes. The user can from this confirm the rules generated, and argument, in basis of the classes, why the rules are correct.