Computer Science and Engineering Knowledge Center: The Back Propagation

Saturday, April 28, 2012

The Back Propagation - Algorithm

Author: Marek Libra

A brief note prior reading of this Knol: I suggest to read

introduction to Artificial Neural Networks
and Feed Forward Neural Networks
and Adaptation of Feed-Forward Neural Networks
and The Perceptron Rule

The Back Propagation (called BP later) is an algorithm used for adaptation of a multi-layered feed-forward neural network. It is based on the perceptron rule, but propagating the effect of the overall network error into inner parts of the network.

The network error computation on the training set is based on the complicated nonlinear function. It is a classical hard optimization problem to minimize the network error value and the BP tries to deal with it.

The BP uses gradient descent method which demands the differentiability of the network error function. Obviously, the differentiability of the network error function relies on the differentiability of the activation function (due to the feed-forward computation mechanism described in the preceeding Adaptation of Feed-Forward Neural Networks Knol).

The Knol you are reading aims for the standard sigmoid as the activation function in the description of the following equations. This function is commonly used in practical experiments.

Initially, the weight w_ij⁰ from the neuron i to the neuron j is set randomly for each interconnected neurons i and j at the time t=0.
The adaptation is computed in a loop of discrete steps (begin at time t = 1, increment t in each subsequent step).

The weights w_t of time t ensue from the weights w_t−1 of time t−1 by incrementing a negative gradient of the network error at the point w_t−1 multiplied by a learning speed parameter e ∈ R:

Eq. 1

The parameter e is set accordingly in interval 0 < e < 1 by the user or by some supporting algorithm. Its value can be changed during the learning process to tune the adaptation behavior (learning speed or skipping over local extremes).

The gradient

in Eq. 1 is computed from

Eq. 4

The function of the network can be considered as a composite function of neurons. We can differentiate the gradient in previous equotation using the rule for a composite function:

Eq. 2

By usage of standard sigmoid and inner potential definition (described in Feed-Forward Neural Networks):

Eq. 3

Next, the back propagation strategy is used for computation of the partial derivation

for a neuron j in output layer:
for a neuron j in hidden layer:

where j^→ is a set of neurons with j as its input.

Same as other gradient descent methods, the main disadvantage of the BP is in locating of a local
minimum. It is very difficult to reach the global minimum with such approaches. Some heuristics
based a stochastic extension of BP or on tuning of the learning speed parameter exist but they
are not optimal in general.

Following simplified BP algorithm is provided for better description. The ∆E is a matrix used for step-by-step computation
of the network error gradient.

The claim to discovery of the BP is controversial. Initially, it was accepted to have been discovered independently by Rumelhart, Hinton and Williams (1986), Le Cun (1985), and Parker (1985). But it was mainly the work by Rumelhart et. al. who made the model popular. He published his original article as [1].

Computer Science and Engineering Knowledge Center

Pages

Saturday, April 28, 2012

The Back Propagation - Algorithm

Further Reading

Knols

External

No comments:

Post a Comment

Total Pageviews