## Saturday, April 28, 2012

### The Back Propagation - Algorithm

Author: Marek Libra

A brief note prior reading of this Knol: I suggest to read

The Back Propagation (called BP later) is an algorithm used for adaptation of a multi-layered feed-forward neural network. It is based on the perceptron rule, but propagating the effect of the overall network error into inner parts of the network.

The network error computation on the training set is based on the complicated nonlinear function. It is a classical hard optimization problem to minimize the network error value and the BP tries to deal with it.

The BP uses gradient descent method which demands the differentiability of the network error function. Obviously, the differentiability of the network error function relies on the differentiability of the activation function (due to the feed-forward computation mechanism described in the preceeding Adaptation of Feed-Forward Neural Networks Knol).

The Knol you are reading aims for the standard sigmoid as the activation function in the description of the following equations. This function is commonly used in practical experiments.

Initially, the weight wij0 from the neuron i to the neuron j is set randomly for each interconnected neurons i and j at the time t=0.
The adaptation is computed in a loop of discrete steps (begin at time t = 1, increment t in each subsequent step).

The weights wt of time t ensue from the weights wt−1 of time t−1 by incrementing a negative gradient of the network error at the point wt−1 multiplied by a learning speed parameter e ∈ R:
The parameter e is set accordingly in interval 0 < e < 1 by the user or by some supporting algorithm. Its value can be changed during the learning process to tune the adaptation behavior (learning speed or skipping over local extremes).

The gradient in Eq. 1 is computed from
The function of the network can be considered as a composite function of neurons. We can differentiate the gradient in previous equotation using the rule for a composite function:

By usage of standard sigmoid and inner potential definition (described in Feed-Forward Neural Networks):

Next, the back propagation strategy is used for computation of the partial derivation :
where j is a set of neurons with j as its input.

Same as other gradient descent methods, the main disadvantage of the BP is in locating of a local
minimum
. It is very difficult to reach the global minimum with such approaches. Some heuristics
based a stochastic extension of BP or on tuning of the learning speed parameter exist but they
are not optimal in general.

Following simplified BP algorithm is provided for better description. The ∆E is a matrix used for step-by-step computation
The claim to discovery of the BP is controversial. Initially, it was accepted to have been discovered independently by Rumelhart, Hinton and Williams (1986), Le Cun (1985), and Parker (1985). But it was mainly the work by Rumelhart et. al. who made the model popular. He published his original article as .