Introduction

Backpropagation of errors (“backpropagation,” or even just “backprop”), is a method for updating the parameters of every unit in a neural network based on the error observed in the output layer. Rumelhart, Hinton, and Williams (1986) introduced (or at least popularized) this concept.

This propagation is made possible by the fact that the weighted sum of inputs to a given node are a linear combination of each of two quantities: the outputs of its immediate predecessors , and its own weights . Additionally, as long as the node has a differentiable activation function, we can use the chain rule to establish upstream dependencies.

Rumelhart et al. make the argument for backpropagation based on a sigmoid activation function, but the same argument works for any differentiable activation function.

Rationale

Consider a neuron with differentiable activation function , in the output layer of some network:

Suppose we also have some error function . (In Rumelhart, Hinton, and Williams (1986), they use a sum of squares error, but there are others.)

Given that we know the effect of a change in a weight on the total error of the system , we can update that weight to reduce the error through gradient descent.

That is, if we have a way to obtain , then we can update by some definite amount. Doing this repeatedly, for every weight in the network, will minimize the error.

Determining

We would like to know for any weight of any node , where is the focal node and is an upstream neighbor.

To start, observe by the chain rule that

So all we need to do is obtain and and we’re in business.

We can obtain by using the chain rule to decompose it into two quantities that can be easily obtained:

If we choose a differentiable error function and activation function , then we can obtain both of these quantities analytically.

Meanwhile, since is a linear combination of ‘s inputs and ‘s weights , we also know that

Therefore, we have

where

  • represents the gradient of the error with respect to the output of node ;
  • represents the gradient of the activation function with respect to the weighted input; and
  • is the output of the upstream node .

We’re done! We have everything we need to train every weight in our model.