How do I choose the best metric to measure my calibration?

Question

I program and do test-driven development. After I made a change in my code I run my tests. Sometimes they succeed and sometimes they fail. Before I run a test I write down a number from 0.01 to 0.99 for my credence that the test will succeed.

I want to know whether I'm improving in predicting whether my test will succeed or fail. It would also be nice if I can track whether I'm better at predicting whether the test will succeed on Mondays or on Fridays. If my ability to predict test success correlates with other metrics I track, I want to know.

That leaves me with the task of choosing the right metric. In Superforcasting Philip Tetlock proposes to use the Brier score to measure how well experts are calibrated. Another metric that has been proposed in the literature is the Logarithmic scoring rule. There are also other possible candidates.

How do I decide which metric to use? Is there an argument for favoring one scoring rule over the others?

A potential source of difficulty in measuring changes in you forecasting skill is that the underlying difficulty of the forecasting problem can change. Changes in your skill may be indistinguishable from changes in problem difficulty. — Matthew Gunn, Dec 28 '16 at 4:56

Gumeo · Accepted Answer · 2017-01-06 13:59:15Z

I assume that you are doing unit-tests for your code.

One idea that I can think of, which would maybe not do exactly what you want, is to use a linear model.

The benefit of doing that, is that you can create a bunch of other variables that you can include in the analysis.

Let's say that you have a vector $\mathbf{Y}$ which includes the outcome of your tests, and another vector $\mathbf{x}$ that includes your predictions of the outcome.

Now you can simply fit the linear model

$$ y_i = a + bx_i +\epsilon $$

and find the value of $b$, the higher the value of $b$ would indicate that your predictions are becoming better.

The thing that makes this approach nice is that now you can start to add a bunch of other variables to see if that creates a better model, and those variables can help in making better predictions. The variables could be an indicator for the day of the week, e.g. for Monday it would always be 1, and zero for all the other days. If you include that variable in the model, you would get:

$$ y_i = a + a_{\text{Monday}} + bx_i +\epsilon $$

And if the variable $a_{\text{Monday}}$ is significant and positive, then it could mean that you are more conservative in your predictions on Mondays.

You could also create a new variable where you give a score to assess the difficulty of the task you performed. If you have version control, then you could e.g. use the number of lines of code as difficulty, i.e. the more code you write, the more likely something will break.

Other variables could be, number of coffee cups that day, indicator for upcoming deadlines, meaning there is more stress to finish stuff etc.

You can also use a time variable to see if your predictions are getting better. Also, how long you spent on the task, or how many sessions you have spent on it, whether you were doing a quick fix and it might be sloppy etc.

In the end you have a prediction model, where you can try to predict the likelihood of success. If you manage to create this, then maybe you do not even have to make your own predictions, you can just use all the variables and have a pretty good guess on whether things will work.

The thing is that you only wanted a single number. In that case you can use the simple model I presented in the beginning and just use the slope, and redo the calculations for each period, then you can look if there is a trend in that score over time.

Hope this helps.

I would argue that a higher slope ($b$ in your simple model) does not always correspond to a better prediction: Assuming a logistic regression, if $Y$ = the 'true'/observed probability of outcome, while $x$ is the predicted probability, than $a$ should be 0 and $b$ should be 1. Any higher $b$ would suggest overprediction of outcome, while a $b$ lower than 1 suggests underprediction. This method is actually described in the reference I point to in my answer. In short, this slope method is fine to use, but slopes near 1 are the best (when $a$ = 0). — IWS, Jan 9 at 10:05
@IWS Thanks for the input, I agree with you to the extent that you want a single value to estimate your performance, then omitting the intercept is a good idea. If you want to try to interpret the data any further, (and you have enough of it),, then it might be a good idea to add the intercept and compare the models. — Gumeo, Jan 10 at 0:20

Outlier · Answer 2 · 2017-01-04 12:27:02Z

I have done prediction model on sparse data and it is a big challenge to get your model calibrated in these cases. I will tell you what I did, you can get some help from that.

I made 20 bins of predicted probability and tried to plot average predicted and actual probability of success. For average predicted probability, I took average of the bin range. For average actual probability, I calculated actual success and failure count in the bins, from which I got actual (median) probability of success in the bin. To reduce impact of outliers, I removed the top and bottom 5% data before taking actual median probability in each bin.

Once I got these I could easily plot the data.

It would be good to point out that this is the first step in computing the Hosmer-Lemeshow goodness of fit test. — jwimberley, Jan 6 at 14:13

IWS · Answer 3 · 2017-01-04 14:04:22Z

Although this is far from an answer and more of a reference, it might be a good idea to check Steyerberg E - Epidemiology 2012.

In this article Steyerberg and colleagues explain different ways to check prediction model performance for models with binary outcomes (succes or failure). Calibration is just one of these measures. Depending on whether you want to have an accurate probability, accurate classification, or accurate reclassification you might want to use different measures of model performance. Even though this manuscript concerns models to be used in biomedical research I feel they could be applicable to other situations (yours) as well.

More specific to your situation, calibration metrics are really difficult to interpret because they summarize (i.e. average) the calibration over the entire range of possible predictions. Consequently, you might have a good calibration summary score, while your predictions were off in an important range of predicted probabilities (e.g. you might have a low (=good) brier score, while the prediction for succes is off in above or below a certain predicted probability) or vice versa (a poor summary score, while predictions are well-calibrated in the critical area). I would therefore suggest you think about whether such a critical range of predicted succes probability exists in your case. If so, use the appropriate measures (e.g. reclassification indices). If not (meaning you are interested in overall calibration), use brier, or check the intercept and slopes of your calibration plot (see Steyerberg article).

To conclude, any one of the calibration summary measures require your first step to plot your predicted probabilities versus the observed probability (see Outlier's answer for example how to). Next, the summary measure can be calculated, but the choice of summary measure should reflect the goal of predicting succes of failure in the first place.

asked	5 months ago
viewed	402 times
active	2 months ago

current community

your communities

more stack exchange communities

How do I choose the best metric to measure my calibration?

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged forecasting decision-theory calibration scoring-rules or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

How do I choose the best metric to measure my calibration?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged forecasting decision-theory calibration scoring-rules or ask your own question.

Linked

Related

Hot Network Questions