Gaussian Process, not quite for dummies
Published:
Before diving in
I have procrastinated reading up about Gaussian process for many many moons. However, as always, I’d like to think that this is not just due to my procrastination superpowers. Whenever I look up “Gaussian Process” on Google, I find these well-written tutorials with vivid plots that explain everything up until non-linear regression in detail, but shy away at the very first glimpse of any sort of information theory. The key takeaway is always,
A Gaussian process is a probability distribution over possible functions that fit a set of points.
While memorising this sentence does help if some random stranger comes up to you on the street and ask for a definition of Gaussian Process – which I’m sure happens all the time – it doesn’t get you much further beyond that. In what range does the algorithm search for “possible functions”? What gives it the capacity to model things on a continuous, infinite space?
Confused, I turned to the “the Book” in this area, Gaussian Processes for Machine Learning by Carl Edward Rasmussen and Christopher K. I. Williams. I have friends working in more statistical areas (excuse the ambiguous term, don’t want to get into a cat fight here) who swear by this book, but after spending half an hour just to read 2 pages about linear regression I went straight into an existential crisis. I’m sure it’s a great book, but the math is quite out of my league.
So what more is there? Thankfully I found this lecture by Dr. Richard Turner on YouTube, which was a great introduction to GP, and some of its state-of-the-art approaches. After watching this video, reading the Gaussian Processes for Machine Learning book became a lot easier. So I decided to compile some notes for the lecture, and hopefully if one day I do decide to put it on my website it will help some other people – those who are eager to more than just scratch the surface of GP by reading some “machine learning for dummies” tutorial, but not quite has the claws to take on a textbook.
Acknowledgement: the figures in this blog are from Dr. Richard Turner’s talk “Gaussian Processes: From the Basic to the State-of-the-Art”, which I highly recommend! Have a lookie here: Portal to lecture.
Motivation: non-linear regression
Of course, like almost everything in machine learning, we have to start from regression. Let’s revisit the problem: somebody comes to you with some data points (red points in image below), and we would like to make some prediction of the value of y
In non-linear regression, we fit some nonlinear curves to observations. The higher degrees of polynomials you choose, the better it will fit the observations.
This sort of traditional non-linear regression, however, typically gives you one function that it considers to fit these observations the best. But what about the other ones that are also pretty good? What if we observed one more points, and one of those ones end up being a much better fit than the “best” solution?
To solve this problem, we turn to the good Ol’ Gaussians.
The world of Gaussians
Recap
Here we cover the basics of multivariate Gaussian distribution. If you’re already familiar with this, skip to the next section 2D Gaussian Examples.
Definition
Multivariate Gaussian distribution is also known as joint normal distribution. It is the generalisation of univariate Gaussian in high dimensional space. Formally, the definition is:
A random variable is said to be k-variate normally distributed if every linear combination of its k components have a univariate normal distribution.
Mathematically, X=(X1,…Xk)T
Note: if all k components are Gaussian random variables, then X
Another note: sum of random variables is different from sum of distribution – the sum of two Gaussian distributions gives you a Gaussian mixture, which is not Gaussian except in special cases. Not sure why but I temporarily forgot about this and was extremely confused by everything I read about Gaussian for a very long time, so just putting this here as a note to self.
Independence conditions
If all components of X
2D Gaussian Examples
Covariance matrix
Here is an example of a 2D Gaussian distribution with mean 0, with the oval contours denoting points of constant probability.
The covariance matrix, denoted as Σ
Note that when y1
Conditioning
With multivariate Gaussian, another fun thing we can do is conditioning. In 2D, we can demonstrate this graphically:
We fix the value of y1
We can also visualise how this conditioned gaussian changes as the correlation drop – when correlation is 0
High dimensional gaussian: a new interpretation
2D Gaussian
The oval contour graph of Gaussian, while providing information on the mean and covariance of our multivariate Gaussian distribution, does not really give us much intuition on how the random variables correlate with each other during the sampling process.
Therefore, consider this new interpretation that can be plotted as such:
Take the oval contour graph of the 2D Gaussian (left-top in below image) and choose a random point on the graph. Then, plot the value of y1
Under this setting, we can now visualise the sampling operation in a new way by taking multiple “random points” and plot y1
For conditioning, we can simply fix one of the endpoint on the index graph (in below plots, fix y1
Higher dimensional Gaussian
5D Gaussian
Now we can consider a higher dimension Gaussian, starting from 5D. the covariance matrix is now 5x5.
Take a second to have a good look at the covariance matrix, and notice:
- All variance (diagonal) are equal to 1;
- The further away the indices of two points are, the less correlated they are. For instance, correlation between y1
y1 and y2y2 is quite high, y1y1 and y3y3 lower, y1y1 and y4y4 the lowest)
We can again condition on y1
20D Gaussian
To make things more intuitive, for 20D Gaussian we replace the numerical covariance matrix by a colour map, with warmer colors indicating higher correlation:
Now look at what happens to the 20D gaussian conditioned on y1
Hopefully some of you are now like: “Ah, this is looking exactly like the nonlinear regression problem we started with!” and yes, indeed, this is exactly like a nonlinear regression problem where y1
We can see from the above image that because of how covariance matrix is structured (i.e. closer points have higher correlation), the points closer to the observations has very low uncertainty with non-zero mean, whereas the ones further from them have high uncertainty and zero mean. (Note that in reality, we don’t have to actually do Monte Carlo sampling to compute the mean and uncertainty, they are completely analytical.)
Here we also offer a slightly more exciting example where we condition on 4 points of the 20D Gaussian (and you wonder why everybody hates statisticians):
Getting “real”
The problem with this approach for nonlinear regression seems obvious – it feels like all x-axis have to be all integers because they are indices, while in reality, we want to model observations with real values. One immediately obvious solution for this is, we can keep increasing the dimensionality of the Gaussian and calculate many many points close to the observation, but that is a bit clumsy.
The solution lies in how the covariance matrix is generated. Conventionally, Σ
Σ(x1,x2)=K(x1,x2)+Iσ2y
The covariance matrices in all above examples are computed using the Radial Basis Function (RBF) kernel K(x1,x2)
This means in principle, we can calculate this covariance matrix for any real-valued x1
Now that, is a Gaussian process (mic drop).
Gaussian Process
Textbook definition
From the above derivation, you can view Gaussian process as a generalisation of multivariate Gaussian distribution to infinitely many variables. Here we also provide the textbook definition of GP, in case you had to testify under oath:
A Gaussian process is a collection of random variables, any finite number of which have consistent Gaussian distributions.
Just like a Gaussian distribution is specified by it’s mean and variance, a Gaussian process is completely defined by (1) a mean function m(x)
Parametric vs. non-parametric
Note that our Gaussian processes are non-paramatric, as opposed to nonlinear regression models which are parametric. And here’s a secret:
non-parametric model == model with infinite number of parameters
In a parametric model, we define the function explicitly with some parameters:
y(x)=f(x)+ϵσyWhere σy
This GP will now generate lots of smooth/wiggly functions, and if you think your parametric function falls into this family of functions that GP generates, this is now a sensible way to perform non-linear regression.
We can also add Gaussian noise σy
In summary, GP is exactly the same as regression with parametric models, except you put a prior on the set of functions you’d like to consider for this dataset. The characteristic of this “set of functions” you consider is defined by the kernel of choice (K(x,x′)
Hyperparameters
There are 2 hyperparameters here:
- Vertical scale σ: describes how much span the function has vertically;
- Horizontal scale l: describes how quickly the correlation between two points drops as the distance between them increases – a high l gives you a smooth function, while lower l results in a wiggly function.
Luckily, because p(y∣θ) is Gaussian, we can compute its likelihood in close form. That means we can just maximise the likelihood of p(y∣θ) under these hyperparameters using a gradient optimiser:
Details for implementation
Before we start: here we are going to stay quite high level – no code will be shown, but you can easily find many implementations of GP on GitHub (personally I like this repo, it’s a Jupyter Notebook walk through with step-by-step explanation). However, I would say this part is important to understanding how GP actually works, so try not to skip it.
Computation
Hopefully at this point you are wondering: this smooth function with infinite-dimensional covariance matrix thing all sounds well and good, but how do we actually do computation with an infinite by infinite matrix?
Marginalisation baby! Imagine you have a multivariate Gaussian over two vector variables y1 and y2, where:
Here, we partition the mean into the mean of y1, a and the mean of y2, b; similarly, for covariance matrix, we have A as the covariance of y1, B that of y1 and y2, BT that of y2 and y1 and C of y2. So now, we can easily compute the probability of y1 using the marginalisation property:
Which means, we can just put all the points that we are interested in in one partition (y1) and compute mean and covariance for that partition only, and shove the rest of the infinte stuff into y2 and not worry about computing them. This nice little property allows us to think about finite dimensional projection of the underlying infinite object on our computer. We can forget about the infinite stuff happening under the hood.
Predictions
Taking the above y1, y2 example, but this time imagine all the observations are in partition y2 and all the points we want to make predictions about are in y1 (again, the infinite points are still in the background, let’s imagine we’ve shoved them into some y3 that is omitted here).
To make predictions about y1 given observations of y2, we can then use bayes rules to calculate p(y1∣y2):
Because p(y1), p(y2) and p(y1,y2) are all Gaussians, p(y1∣y2) is also Gaussian. We can therefore compute p(y1∣y2) analytically:
Note: here we catch a glimpse of the bottleneck of GP: we can see that this analytical solution involves computing the inverse of the covariance matrix of our observation C−1, which, given n observations, is an O(n3) operation. This is why we use Cholesky decomposition – more on this later.
To gain some more intuition on the method, we can write out the predictive mean and predictive covariance as such:
So the mean of p(y1∣y2) is linearly related to y2, and the predictive covariance is the prior uncertainty subtracted by the reduction in uncertainty after seeing the observations. Therefore, the more data we see, the more certain we are.
Higher dimensional data
You can also do this for higher-dimensional data (bear in mind the computational costs). Here we extend the covariance function to incorporate RBF kernels in 2D data:
Covariance matrix selection
As one last detail, let’s talk about different forms of covariance matrix that is commonly used for GP. Again, all positive definite matrices should qualify, but here are some frequently seen ones:
Laplacian Function
This function is continuous but non-differentiable. It looks like this:
If you average over all samples, you get straight lines joining your datapoints, which are called Browninan bridges.
Rational quadratic
Average over all samples looks like this:
Periodic functions
Average over all samples looks like this:
Summary
There are books that you can look up for appropriate kernels for covariance functions for your particular problem, and rules you can follow to produce more complicated covariance (like the product of two covariance functions is a valid covariance). They can give you very different results.
It is tricky to find the appropriate covariance, but there are also methods in place for model selection. One of those methods is Bayesian model comparison, defined as follow:
However, it does involve a very difficult integral (or sum in discrete case, as showcased above) over the hyperparameters of your GP, which makes it impratical. it’s also very sensitve to the prior you put on your hyperparameters.
The end
Hopefully this has been a helpful guide to Gaussian process for you. I want to keep things relatively short and simple here, so I made no mention of the war – in reality GP suffers from not being able to scale to large datasets, and choice of kernels can be very tricky. There are some state-of-the-art approaches that tackles with these issues (see deep GP and sparse GP), but since I am by no means an expert in this area I will leave you to exploring them.
Thank you for reading! Remember to take your canvas bag to the supermarket, baby whales are dying.
Leave a Comment