Backpropagation algorithm
The backpropagation algorithm is essential for training large neural networks
quickly. This article explains how the algorithm works.
Please scroll down...
Simple neural network
On the right, you see a neural network with one input, one output node
and two hidden layers of two nodes each.
Nodes in neighboring layers are connected with weights wij, which are the network
parameters.
Activation function
Each node has a total input x, an activation function f(x)
and an output y=f(x).
f(x) has to be a non-linear function, otherwise the neural network will only
be able to learn linear models.
A commonly used activation function is
the Sigmoid function:
f(x)=11+e−x.
Error function
The goal is to learn the weights of the network automatically from data such that the predicted output youtput
is close to the target ytarget for all inputs xinput.
To measure how far we are from the goal, we use an error function E.
A commonly used error functon is E(youtput,ytarget)=12(youtput−ytarget)2.
Forward propagation
We begin by taking an input example (xinput,ytarget) and updating the input layer of the network.
For consistency, we consider the input to be like any other node but without an activation function so its output is equal to its input, i.e. y1=xinput.
Forward propagation
Now, we update the first hidden layer. We take the output y of the nodes in the previous layer
and use the weights to compute the input x of the nodes in the next layer.
Forward propagation
Then we update the output of the nodes in the first hidden layer.
For this we use the activation function, f(x).
Forward propagation
Using these 2 formulas we propagate for the rest of the network and get the final output of the network.
Error derivative
The backpropagation algorithm decides how much to
update each weight of the network after comparing the predicted output with the desired output for a particular example.
For this, we need to compute how the error changes
with respect to each weight dEdwij.
Once we have the error derivatives, we can update the weights using a simple update rule:
where α is a positive constant, referred to as the learning rate, which we need to fine-tune empirically.
[Note] The update rule is very simple: if the error goes down when the weight increases (dEdwij<0),
then increase the weight, otherwise if the error goes up when the weight increases (dEdwij>0),
then decrease the weight.
Additional derivatives
To help compute dEdwij, we additionally store for each node two more derivatives:
how the error changes with:
- the total input of the node dEdx and
- the output of the node dEdy.
Back propagation
Let's begin backpropagating the error derivatives.
Since we have the predicted output of this particular input example, we can compute how the error changes with that output.
Given our error function E=12(youtput−ytarget)2 we have:
Back propagation
Now that we have dEdy we can get dEdx using the chain rule.
where ddxf(x)=f(x)(1−f(x)) when
f(x) is the Sigmoid activation function.
Back propagation
As soon as we have the error derivative with respect to the total input of a node,
we can get the error derivative with respect to the weights coming into that node.
Back propagation
And using the chain rule, we can also get dEdy from the previous layer. We have made a full circle.
Back propagation
All that is left to do is repeat the previous three formulas until we have computed all the error derivatives.