Back-propagation

Feed-forward neural networks are inspired by the information processing of one or more neural cells (called a neuron). A neuron accepts input signals via its axon, which pass the electrical signal down to the cell body. The dendrites carry the signal out to synapses, which are the connections of a cell’s dendrites to other cell’s axons. In a synapse, the electrical activity is converted into molecular activity (neurotransmitter molecules crossing the synaptic cleft and binding with receptors). The molecular binding develops an electrical signal which is passed onto th connected cells axon. The Back-propagation algorithm is a training regime for multi-layer feed forward neural networks and is not directly inspired by the learning processes of the biological system.

The information processing objective of the technique is to model a given function by modifying internal weightings of input signals to produce an expected output signal. The system is trained using a supervised learning method, where the error between the system’s output and a known expected output is presented to the system and used to modify its internal state. State is maintained in a set of weightings on the input signals. The weights are used to represent an abstraction of the mapping of input vectors to the output signal for the examples that the system was exposed to during training. Each layer of the network provides an abstraction of the information processing of the previous layer, allowing the combination of sub-functions and higher order modeling.

The Back-propagation algorithm is a method for training the weights in a multi-layer feed-forward neural network. As such, it requires a network structure to be defined of one or more layers where one layer is fully connected to the next layer. A standard network structure is one input layer, one hidden layer, and one output layer. The method is primarily concerned with adapting the weights to the calculated error in the presence of input patterns, and the method is applied backward from the network output layer through to the input layer.

The following algorithm provides a high-level pseudocode for preparing a network using the Back-propagation training method. A weight is initialized for each input plus an additional weight for a fixed bias constant input that is almost always set to 1,0. The activation of a single neuron to a given input pattern is calculated as follows:

where n is the number of weights and inputs, x_ki is the k-th attribute on the i-th input pattern, and w_bias is the bias weight. A logistic transfer function (sigmoid) is used to calculate the output for a neuron in [0; 1] and provide nonlinearities between in the input and output signals: 1/(1+exp(-a)) , where a represents the neuron activation.

The weight updates use the delta rule, specifically a modified delta rule where error is backwardly propagated through the network, starting at the output layer and weighted back through the previous layers. The following describes the back-propagation of error and weight updates for a single pattern.

An error signal is calculated for each node and propagated back through the network. For the output nodes this is the sum of the error between the node outputs and the expected outputs:

where es_i is the error signal for the i-th node, c_i is the expected output and o_i is the actual output for the i-th node. The td term is the derivative of the output of the i-th node. If the sigmod transfer function is used, td_i would be o_i *(1-o_i). For the hidden nodes, the error signal is the sum of the weighted error signals from the next layer.

where es_i is the error signal for the i-th node, w_ik is the weight between the i-th and the k-th nodes, and es_k is the error signal of the k-th node.

The error derivatives for each weight are calculated by combining the input to each node and the error signal for the node.

where ed_i is the error derivative for the i-th node, es_i is the error signal for the i-th node and x_k is the input from the k-th node in the previous layer. This process include the bias input that has a constant value.

Weights are updated in a direction that reduces the error derivative ed_i (error assigned to the weight), metered by a learning coefficient.

where w_i(t + 1) is the updated i-th weight, ed_k is the error derivative for the k-th node and learn_rate is an update coefficient parameter.

The Back-propagation algorithm can be used to train a multi-layer network to approximate arbitrary non-linear functions and can be used for regression or classification problems.

Input and output values should be normalized such that x in [0; 1]. The initial weights are typically small random values in [0; 0,5]. The weights can be updated in an online manner (after the exposure to each input pattern) or in batch (after a fixed number of patterns have been observed). Batch updates are expected to be more stable than online updates for some complex problems.

A logistic (sigmoid) transfer function is commonly used to transfer the activation to a binary output value, although other transfer functions can be used such as the hyperbolic tangent (tanh), Gaussian, and softmax. It is good practice to expose the system to input patterns in a different random order each enumeration through the input set.

Typically a small number of layers are used such as 2-4 given that the increase in layers result in an increase in the complexity of the system and the time required to train the weights. The learning rate can be varied during training, and it is common to introduce a momentum term to limit the rate of change. The weights of a given network can be initialized with a global optimization method before being refined using the Back-propagation algorithm.

One output node is common for regression problems, where as one output node per class is common for classification problems.