Of the many, many different ideas put forward to explain, model, and experiment with artificial cognition, one of the most practically successful is the Artificial Neural Network. Not because we’ve managed to actually create cognition with one, but because of how useful it is for Machine Learning and Pattern Recognition. If you’ve ever heard of the latest buzz about “Deep Learning,” it has to do with that.
The inspiration, as you might have guessed, are biological neural networks, such as brains. The ANN can be seen as a collection of “neurons,” basic units of calculation, organised in many interconnected layers, which takes a vector of inputs and combines them many times, applying transformations to each such combination, until it reaches the output layer with the output of the computation.
In the above example, we have the inputs ,
, and
. Then, each hidden node
combines them (each with its own weight) and applies a (generally nonlinear) transformation on them. After that, each output node
does the same, with its own transformation applied, and giving us its result in the end.
The networks don’t have to be fully connected like in the above case, we could remove some arrows. They also don’t have to only have two layers (not counting the inputs), and can have as many layers as we like (though it can be easily shown that if all the transformations are linear then the entire network is equivalent to a single-layer network). And while that might look all interesting and such, at first glance it may not look particularly useful – at least, not without many many layers and thousands upon thousands of neurons, so we could approximate some part of a brain, for instance.
But there are some very interesting results about ANNs. One of the most astonishing is that, if the transformations applied by the nodes fall within a certain class of functions, then we need only two layers in order to be able to approximate any continuous function to arbitrary precision (though how many nodes we’d need for that is undefined), and only three layers to be able to approximate any mathematical function at all (same). And, like I said, ANNs are very useful for Machine Learning, which implies that they can learn these functions.
Let’s dig deeper.
More formally, an Artificial Neural Network is defined by three types of parametres:
- The interconnection pattern between the different layers of neurons;
- The learning process for updating the weights of the interconnections;
- The activation function that converts a neuron’s weighted input to its output activation.
In full generality, let be the output of the
node of the
layer. Then it is given by the following recursive relations:
where:
is the
input node;
is called an activation function for the
node of the
layer of the network;
is a linear combination of all the outputs of the
layer and the input of the
node of the
layer;
is the weight of the connection between the
node of the
layer and the
node of the
layer, except for
, which is the bias of that node, and is multiplied by the constant
.
We will further define:
as the
input vector, so that
contains the values of each input node
at that vector;
as the matrix containing a set of inputs to the network (so the
column is
);
as the vector with all the weights
for all possible combinations of
,
, and
;
as the value of the
output of the network when evaluated on the input vector
with weight vector
;
as the output vector that contains the values of each output node of the network when evaluated on the input vector
with weight vector
;
as the output matrix that contains the values of each output vector, the
column given by
and the
line given by \textbf y_i(\textbf X, \textbf w) is an
-dimensional vector with the values of the
output when each of the
are given as input.
From the above definitions, it follows that, if our ANN has layers, then
(where the brackets indicate that this is the value of that node of the network when its input vector is
).
The functions define the type-3 parametres of the network, its activation functions. Some of the
are constrained to be
, and those constraints define which connections are absent from the network, so that all others define connections that do exist; these, together with the activation functions (which can sometimes be the identity and serve as a “skip-layer” function), determine the interconnection pattern between the different layers. So we only need to define the learning process that will turn a fun abstract mathematical object into a powerful regression and classification algorithm.
We will use the convention that each of the activation functions of all nodes of the layer is either equal to the same function
or equal to the identity
so as to serve as a “skip-layer” function.
In a future post I might explain in more depth what exactly Machine Learning algorithms are concerned with, but until I do, here’s a quick rundown.
Machine Learning is typically concerned with general algorithms that can look at a corpus of data, learn from it, and then be able to make predictions. It can be divided into a few categories, and two of them are what we call “classification” and “regression.”
- In a classification problem, a given datum can belong to one of a few classes, and after observing many data labelled with the correct classes, it’s the algorithm’s task to predict to which class new data will belong;
- Regression problems are similar, except instead of a datum belonging to one of a discrete set of classes, it’s an input to a function. Then, we have many data with the appropriate input-output information, and we want to approximate the function that would generate those pairs.
Neural Nets are awesome at this.
Let’s look at a regression problem, first. In addition to the definitions discussed above, let be the target vector for input
. This target vector is a value that only exists when we’re training our network, and we’re using input-output pairs to train it, so that we know what outputs we “want” a given input to give.
Furthermore, we’ll assume that this output is noisy; in other words, there is some function that we’re trying to approximate with our
such that
where
is some form of noise that we’re trying to filter out.
As a concrete example, suppose our input vectors consist in a single real number
, and the outputs associated with them (i.e. the “target values”
) are also single real numbers (
) distributed according to the little squares in the picture:
Then we can talk about a probability . Since
is our approximation to the true function, one possible model for the noisy parametre is:
where is the precision of the noise. So, since we know that there’s noise in the data, we’re not really trying to predict the values
exactly, but rather somehow to minimise the error we’d be making by using our function
. In the more general, multivariate case, we could have
where we’re assuming the noise of each target value for our input vector is independent of the noises of the other values.
We want our network to approximate the function that associates our with their desired outputs
, but we can’t touch the outputs and the inputs themselves; we can only change our weights
. One possible way, then, to minimise the error our network will be making is to take the likelihood function,
where is the matrix with the target values for each input in the training set, and try to maximise it. That happens to be equivalent to minimising the inverse of the above, or, even better, minimising the negative of the logarithm of the above. Then we have an error function:
where is the total number of target values. Since, like I said, the only part under our control are the weights
, what I really need to minimise is the sum-of-squares function:
If we call the solution to the minimisation of the above (since this is a Maximum Likelihood solution), we can find the value of
for the distribution:
Let’s look at the classification problem now. First of all, suppose I have only two classes, and
, to which a given input can belong. My training set consists of input-class pairs, so the target values are
. Then we can express the probability that a given input is in class
by:
where (which is the total evidence for class
) and
is known as the sigmoid function:
Then if the activation function of the output node (singular, since a two-class classifier only needs a single output) is a sigmoid, , we can interpret the output of the network itself,
, as the conditional probability that input vector
is in class
(with the probability that it’s in class
given by
), and:
The negative log likelihood for that, which we can use as our error function, is:
In the case of independent binary classifications, the error function becomes:
Now, in the case of different classes the input can belong to, we can use something we call a
coding scheme, in which we have
target variables
, with
if
is in class
and
otherwise. In that case, the probability that
is in class
is:
Where and the function above is known as the softmax function. If we follow the same procedure as before, setting the activation function of the
output nodes to the softmax, our error function becomes
where .
So now we have a few possible ways to measure the error we’re making when we perform regression or try to classify an input. How do we then train the network parametres to minimise that error? In other words, how do we learn?
Well, thinkin’ about the maths of the thing, it’s a matter of optimisation: we need to find a minimum of the error function, a point where
and
is positive-definite. Except that only finds a local minimum for the function, and in ANNs, there are lots of those.
Suppose, for instance, that the activation functions of a given layer are all even. In that case, if we invert the sign of all of that layer’s weights, we find a network that’s literally the exact same network as before. And in full generality, there are many such symmetries in an arbitrary ANN, which means that for any point in our search space there are many other points that are equivalent to it, and in particular there are many equivalent local and global minima. Any learning algorithm for an ANN has to deal with this.
But for now, we’ll suppose it’s been dealt with. Nonlinear optimisation is a fascinating field unto itself, let’s not steal their cake. However, there’s still more our ANN can do for us, to aid in our nonlinear optimisation.
In general nonlinear optimisation needs the derivatives of the function it’s trying to optimise, or at least its gradient, and to find the gradient of the error function , we need to find
for all weights of the weight vector
(since it’s the weights we’ll be tweaking). Applying the chain rule for partial derivatives gives us:
Let’s go back to our definition of the input of the node of the
layer of a network:
Where the are the outputs of the previous layer. Then:
As for the other term, let’s call it:
Whence:
We can easily calculate the just by evaluating an input on the network. Now how do we calculate the
?
At the last layer of the network, , the output of the
node is
(remember that
is the matrix with all the training inputs given and
is the
line of the output matrix
), and any good error function will have the form
, i.e. it will be a function of the outputs on a given input (or set of inputs) and the respective(s) target values. Furthermore, from the way the ANN is built, we have that:
So, in general, in a network with outputs, we’ll have:
Where is the partial derivative of
with respect to its
component, whereas
is the derivative of
.
What about the in layers that are not the output layer? Well, we can use the chain rule again:
where the sum ranges over all nodes of the
layer of the network to which node
of the
layer sends connections. The first term in the sum of the right-hand side is just
. The second, though, will use our definition above:
Whence:
This suggests a very nifty way of calculating the derivatives of the error function with respect to the weights, so that we can calculate its gradient and learn, and is the basis of the error backpropagation algorithm. A very appropriate name, too. Intuitively, we calculate the of the output nodes, which can be calculated directly and easily; then, we use those to calculate the
of the previous layer; and so on, until we have calculated all of them for the entire network and we have the value of whichever derivatives of the error function we might like.
Furthermore, in the cases of the error functions we’re interested in, they’re actually sums of the errors over all the inputs, i.e. where
is the error evaluated only on input
. This suggests then our algorithm:
Error Backpropagation Algorithm
- Apply an input vector
to the network and “forward propagate” through the network (i.e. calculate the outputs of all nodes of the network on that vector);
- Evaluate the
of the output nodes;
- “Backpropagate” the
through the network;
- Use them to calculate the derivatives of the error function.
What you do, after you’ve calculated this value for all inputs, is more general, and there are several different algorithms invented to deal with that, and like I said we fall into the field of nonlinear optimisation, which is definitely not what this post is about. Still, it is quite interesting to note the way networks help with learning and optimisation and stuff.
When I understand a proof of the universal approximation theorem I’ll write a post explaining it, but until then, here’s a link.