Poisson Regression and Generalised Linear Models
A theoretical introduction into Poisson Regression and Generalised Linear Models
Note: Throughout this article I erroneously refer to E[Y] as the target output. When I am mentioning E[Y], I am implicitly meaning E[Y|X] as thats the correct notation! Linked here is a stats exchange thread that explains this difference.
Linear Regression is the first algorithm most Data Scientists begin their journey with. It is an intiuative and easily implemented and visualised model for continous data. The second most learned algorithm by beginner Data Scientists is Logistic Regression, where the model has a binary output. Most people see these two algorithms as completely separate when in actual fact they are part of the same family of models named Generalised Linear Models (GLMs). In this article we will gain an intuition about GLMs through an example scenario using the Poisson distribution.
Linear Regression Basics and Limitations
Linear Regression is a model used to fit a line or hyperplane to a dataset where the output is continuous and has residuals which are normally distributed. This is mathematical written as:
Where E(Y) is the mean response of the target variable, X is a matrix of the predictor variables and β are the unknown linear coefficients which are adjusted and trained to produce the best model.
Linear Regression is used in many industries such as Epidemiology, Finance and Economics. However, despite being used in all these areas it does have some flaws that makes it’s predictions redundant in certain applications. Imagine you are a phone operator and want to predict how many calls you will receive in a day. Do you think Linear Regression would be a suitable model? The answer is NO for the following reasons:
- The number of calls have to be greater or equal to 0, whereas in Linear Regression the output can be negative as well as positive.
- The number of calls only take integer values but Linear Regression can output fractional values.
These flaws, and many others, require us to use another regression algorithm to model the expected number of calls.