Intuition behind Log-loss score
In Machine Learning, classification problem refers to predictive modeling where a class label needs to be predicted for a given observation (record). While the input data (features) comprise of either continuous or categorical variables, the output is always a categorical variable. For example, based on input features such as weather information (humidity, temperature, cloudy/sunny, wind speed, etc.) and time of year, predict whether it is going to “rain” or “not rain” (output variable) today in your city. Another example, based on email’s content and sender information, predict whether it is “spam” or “not spam” (aka “ham”).
Log-loss is one of the major metrics to assess the performance of a classification problem. But what does it conceptually mean? When you google the term, you easily get good articles and blogs that directly dig into the mathematics involved. That said, I plan to take a different approach here — talk about the intuition behind the metric and then provide the formula used to calculate the metric.
Remember that there is another important metric heavily used to evaluate the performance of a classification algorithm - ROC-AUC score. Once you have a firm understanding of log-loss score, you might want to go through my other blog Intuition behind ROC-AUC score, specifically contrast between the two metrics.
This blog strives to answer the following questions.
1. What is prediction probability?
2. What does log-loss conceptually mean?
3. How is log-loss value calculated?
4. How is log-loss score of a model calculated?
5. How to interpret log-loss score?
What is prediction probability?
A binary classification algorithm first predicts probability of a record to be classified under class 1 and then classifies the data point (record) under one of the two classes (1 or 0) based on whether the probability crossed a threshold value, which is usually set at 0.5 by default.
So, before predicting the class of the record, the model has to predict the probability of the record to be classified under class 1. Remember that it is this prediction probability of a data record that the log-loss value is dependent on.
What does log-loss conceptually mean?
Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value.
Consider the classification problem of spam vs. ham for emails. Let’s represent the spam class as 1 and the ham class as 0. Let’s consider a spam email (actual value=1) and a statistical model that predicts the email as spam with probability of 1. Since the prediction probability was not at all off the actual value of 1, log-loss value associated to the prediction of the observation is 0, indicating no divergence/error at all. (Actually, the log-loss value is miniscule enough to be regarded as 0 for all intents.) We will discuss the calculation later once we have established the conceptual understanding of the term.
Consider another spam email that is predicted with probability of 0.9. The model’s prediction probability is 0.1 off the actual value of 1, and hence, the log-loss value associated to the prediction is more than zero (precisely, 0.105).
Now, let’s look at a ham email. The model predicts it as spam with probability of 0.2, which is another way of saying that the model is going to classify it as ham (assuming the default probability threshold of 0.5). The absolute difference between the prediction probability and the actual value, which is 0 (since it is ham), is 0.2, which is larger than what we witnessed in the prior two observations. The log-loss value associated to the prediction is 0.223.
Notice how the log-loss value of a poorer prediction (farther from the actual value) is higher than that of a better prediction (closer to the actual value).
Now, let’s say there are a set of 5 different spam emails predicted with a wide range of probabilities (of being spam) — 1.0, 0.7, 0.3, 0.009 and 0.0001. You must be now thinking how a spam email could be predicted as spam with such a probability of 0.0001. Let’s play along and assume that the trained statistical model is not a perfect one, and hence, doing a (really) bad job on last three observations (and likely to classify them as ham, since their prediction probabilities are nearer to 0 than to 1). Notice how the log-loss value seems to exponentially (and not linearly) rise as the observation is predicted farther from the actual value of 1.
In fact, if we predicted spam emails with all the possible prediction probabilities between 0 and 1, the plot would look as follows. Lower the prediction probability of a true 1 observation is, higher is its log-loss value.
Similarly, for ham emails predicted on a wide range of probabilities, the graph would look as follows, a mirror image of the above plot. Higher the prediction probability of a true 0 observation is, higher is its log-loss value.
To sum up, farther the prediction probability is from the actual value, higher is its log-loss value.
While training a classification model, we would want the observation to be predicted with probability as close to the actual value (of 0 or 1) as possible. Hence, log-loss turns out to be a good choice for a loss function during training and optimizing classification models, wherein farther away the prediction probability from its true value is, higher the prediction is penalized.
How is log-loss value calculated?
Now that you understand the intuition behind log-loss, we can discuss about the formula and how to calculate it.
where i is the given observation/record, y is the actual/true value, p is the prediction probability, and ln refers to the natural logarithm (logarithmic value using base of e) of a number.
How is log-loss score of a model calculated?
As shown above, log-loss value is calculated for each observation based on observation’s actual value (y) and prediction probability (p). In order to evaluate a model and summarize its skill, log-loss score of the classification model is reported as average of log-losses of all the observations/predictions. As shown below, the average of the log-loss values of the given three predictions is 0.110.
where N is the number of observations (here, 3).
A model with perfect skill has a log-loss score of 0. In other words, the model predicts each observation’s probability as the actual value.
What log-loss score is to a classification problem, mean squared error (MSE) is to a regression problem. Both the metrics indicate how good or bad the prediction results are by denoting how far the predictions are from the actual values.
A model with lower log-loss score is better than the one with higher log-loss score, provided both the models are applied to the same distribution of dataset. We cannot compare log-loss scores of two models applied on two different datasets.
How to interpret log-loss score?
Consider a sample of 10 emails with 9 hams. Since there is only 1 email (out of 10) that is spam, we could build a naïve classification model that simply predicts the probability of each email being spam as 0.1. As shown below, the log-loss score of this naïve model is 0.325.
As shown below, resetting the prediction probability of each email at 0.08 (slightly less than 0.1), the log-loss score turns out to be 0.328. Similarly, if we set the prediction probability to 0.12 (slightly greater than 0.1), we get the log-loss score of 0.327. In short, if we set the prediction probability of the emails to anything other than 0.1, we get a higher log-loss score.
Even the below figure reaffirms our aforementioned discovery— setting the probability of the emails to 0.1 yields the lowest log-loss score for the dataset, which would be regarded as the baseline score for the given sample dataset.
Baseline log-loss score for a dataset is determined from the naïve classification model, which simply pegs all the observations with a constant probability equal to % of data with class 1 observations. For a balanced dataset with a 51:49 ratio of class 0 to class 1, a naïve model with constant probability of 0.49 will yield log-loss score of 0.693, which is regarded as the baseline score for that dataset.
Higher the imbalance in a dataset, lower the baseline log-loss score of the dataset, due to lower proportion of observations (in this case, class 1) that have higher impact on average of log-loss values.
Since predicting a low constant probability value for the imbalanced dataset results in a very low log-loss value, model’s skill evaluated using log-loss should be interpreted carefully in such cases. In fact, log-loss values should always be interpreted in context of the baseline score as provided by the naïve model.
When we build a statistical model on a given dataset, the model must beat the baseline log-loss score, thereby proving itself to be more skillful than the naïve model. If that does not turn out to be the case, it implies that the trained statistical model is not helpful at all, and it would be better to go with the naïve model instead.
When you plan to embark on your pursuit of advanced (inferential) statistics, feel free to check out my other article on Central Limit Theorem as well. The concept underpins almost every application of advanced statistics.
Should you have any question or feedback, feel free to leave a comment here. You can also reach me through my LinkedIn profile.