Backpropagation algorithm

The backpropagation algorithm is essential for training large neural networks quickly. This article explains how the algorithm works.

Please scroll down...

Simple neural network

On the right, you see a neural network with one input, one output node and two hidden layers of two nodes each.

Nodes in neighboring layers are connected with weights wij, which are the network parameters.

Activation function

Each node has a total input x, an activation function f(x) and an output y=f(x). f(x) has to be a non-linear function, otherwise the neural network will only be able to learn linear models.

A commonly used activation function is the Sigmoid function: f(x)=11+ex.

Error function

The goal is to learn the weights of the network automatically from data such that the predicted output youtput is close to the target ytarget for all inputs xinput.

To measure how far we are from the goal, we use an error function E. A commonly used error functon is E(youtput,ytarget)=12(youtputytarget)2.

Forward propagation

We begin by taking an input example (xinput,ytarget) and updating the input layer of the network.

For consistency, we consider the input to be like any other node but without an activation function so its output is equal to its input, i.e. y1=xinput.

Forward propagation

Now, we update the first hidden layer. We take the output y of the nodes in the previous layer and use the weights to compute the input x of the nodes in the next layer.
xj=
iin(j)wijyi+bj

Forward propagation

Then we update the output of the nodes in the first hidden layer. For this we use the activation function, f(x).
y=f(x)

Forward propagation

Using these 2 formulas we propagate for the rest of the network and get the final output of the network.
y=f(x)
xj=
iin(j)wijyi+bj

Error derivative

The backpropagation algorithm decides how much to update each weight of the network after comparing the predicted output with the desired output for a particular example. For this, we need to compute how the error changes with respect to each weight dEdwij.
Once we have the error derivatives, we can update the weights using a simple update rule:
wij=wijαdEdwij
where α is a positive constant, referred to as the learning rate, which we need to fine-tune empirically.

[Note] The update rule is very simple: if the error goes down when the weight increases (dEdwij<0), then increase the weight, otherwise if the error goes up when the weight increases (dEdwij>0), then decrease the weight.

Additional derivatives

To help compute dEdwij, we additionally store for each node two more derivatives: how the error changes with:
  • the total input of the node dEdx and
  • the output of the node dEdy.

Back propagation

Let's begin backpropagating the error derivatives. Since we have the predicted output of this particular input example, we can compute how the error changes with that output. Given our error function E=12(youtputytarget)2 we have:
Eyoutput=youtputytarget

Back propagation

Now that we have dEdy we can get dEdx using the chain rule.
Ex=dydxEy=ddxf(x)Ey
where ddxf(x)=f(x)(1f(x)) when f(x) is the Sigmoid activation function.

Back propagation

As soon as we have the error derivative with respect to the total input of a node, we can get the error derivative with respect to the weights coming into that node.
Ewij=xjwijExj=yiExj

Back propagation

And using the chain rule, we can also get dEdy from the previous layer. We have made a full circle.
Eyi=jout(i)xjyiExj=jout(i)wijExj

Back propagation

All that is left to do is repeat the previous three formulas until we have computed all the error derivatives.

The end.

1y1xinput2y2x2dE/dy2dE/dx2fw12dE/dw3y3x3dE/dy3dE/dx3fw13dE/dw4y4x4dE/dy4dE/dx4fw24dE/dww34dE/dw5y5x5dE/dy5dE/dx5fw25dE/dww35dE/dw6youtputx6dE/dy6dE/dx6fw46dE/dww56dE/dwEytarget
Computing...