Co-authored with Gwern Branwen, who did most of the work. Data & code available here.
Updates
- 11/09 3:30pm Pacific: Updated Brier scores, added Simon Jackman, added ‘2008 Repeat’ baseline.
- 11/09 9pm Pacific: Updated scores, added Wang & Ferguson.
- 11/11 2:30am Pacific: Added appendix, updated scores with final batch of data from Wang & Ferguson.
- 11/26 8:30pm Pacific: Updated scores with latest data on e.g. the popular vote.
Obama may have won the presidency on election night, but pundit Nate Silver won the internet by correctly predicting presidential race outcomes in every state plus the District of Columbia — a perfect 51/51 score.
Now the interwebs are abuzz with Nate Silver praise. Gawker proclaims him “America’s Chief Wizard.” Gizmodo humorously offers 25 Nate Silver Facts (sample: “Nate Silver’s computer has no “backspace” button; Nate Silver doesn’t make mistakes”). IsNateSilverAWitch.com concludes: “Probably.”
Was Silver simply lucky? Probably not. In the 2008 elections he scored 50/51, missing only Indiana, which went to Obama by a mere 1%.
How does he do it? In his CFAR-recommended book The Signal and the Noise: Why So Many Predictions Fail, but Some Don’t, Silver reveals that his “secret” is bothering to obey the laws of probability theory rather than predicting things from his gut.
An understanding of probability can help us see what Silver’s critics got wrong. For example, Brandon Gaylord wrote:
Silver… confuses his polling averages with an opaque weighting process… and the inclusion of all polls no matter how insignificant – or wrong – they may be. For example, the poll that recently helped put Obama ahead in Virginia was an Old Dominion poll that showed Obama up by seven points. The only problem is that the poll was in the field for 28 days – half of which were before the first Presidential debate. Granted, Silver gave it his weakest weighting, but its inclusion in his model is baffling.
Actually, what Silver did is exactly right according to probability theory. Each state poll provided some evidence about who would win that state, but some polls — for example those which had been accurate in the past — provided more evidence than others. Even the Old Dominion poll provided some evidence, just not very much — which is why Silver gave it “his weakest weighting.” Silver’s “opaque weighting process” was really just a straightforward application of probability theory. (In particular, it was an application of Bayes’ Theorem.)
Comparing Pundits
But Silver wasn’t the only one to get the electoral college exactly right. In fact, the polls themselves were remarkably accurate. John Sides & Margin of Error plotted the national polls along with each poll’s margin of error:
We can see that any bias in the polls was pro-Romney, and also that the true margin of victory fell within the claimed margin of error of almost all the polls! (NPR, Rasmussen, and Gallup were the only exceptions.) This is a sobering graph for anyone who believes statistics is baloney. A larger table compiled by Nate Silver reaches the same conclusion – most polls favored Romney.
So who did best, and how much better did the top pundits do, compared to the rest? To find out, we can’t simply compare the absolute number of correct predictions each pundit made.
The reason for this is that pundits (the better ones, anyway) don’t just predict election outcomes, they also tell you how confident they are in each of their predictions. For example, Silver gave Obama a 50.3% chance of winning in Florida. That’s pretty damn close to 50/50 or “even odds.” So if Romney had won Florida, Silver would have been wrong, but only a little wrong. In contrast, Silver’s forecast was 92% confident that Rick Berg would win a Senate seat in North Dakota, but Berg lost. For that prediction, Silver was a lot wrong. Still, predictions with 92% confidence should be wrong 8% of the time (otherwise, that 92% confidence is underconfident), and Silver made a lot of correct predictions.
So how can we tell who did best? We need a method that accounts not just for the predicted outcomes, but also for the confidence of each prediction.
There are many ways you can score predictions to reward accuracy and punish arrogance, but most of them are gameable, for example by overstating one’s true beliefs. The methods which aren’t cheatable — where you score best if you are honest — are all called “proper scoring rules.”
One of the most common proper scoring rules is the Brier score. A Brier score is simply a number between 0 and 1, and as with golf, a lower score is better.
So, to measure exactly how well different pundits did, we can compare their Brier scores.
Predictions about the Presidential Race
First, who made the most accurate predictions about state outcomes in the presidential race? We must limit ourselves to those who stated their confidence in their predictions on all 50 state races (plus the District of Columbia). Here are the Brier scores:
Reality | 0.00000 |
Drew Linzer | 0.00384 |
Wang & Ferguson | 0.00761 |
Nate Silver | 0.00911 |
Simon Jackman | 0.00971 |
DeSart & Holbrook | 0.01605 |
Intrade | 0.02812 |
2008 Repeat | 0.03922 |
Margin of Error | 0.05075 |
Coin Flip | 0.25000 |
A random coin flip scores badly, with a score 4 times higher than the next worst predictor. This isn’t surprising: just about anybody should be able to out-predict a coin toss for the outcomes in California and many other states.
Next worse is the Margin of Error forecast: a very simple model based almost solely on approval ratings and economics. But this alone is enough to jump way ahead of a random guesser.
After that is ‘2008 Repeat’: this is a simple predictor which gives 100% for states that went Obama in 2008, and 0% for the rest. This is a more intelligent predictor than the random coin flip, but still simplistic, so a good forecaster should easily outperform it. Since Indiana and North Carolina did not go Obama in 2012, a better forecaster would have been less than 100% optimistic about them.
Intrade isn’t at the lead as fans of prediction markets might have hoped. The individual state markets are lightly traded, with many never traded at all, and an accurate prediction market needs many well-informed traders. But Intrade is apparently still picking up on some polls, systematically or unsystematically, because it beat the Margin of Error model which largely ignored them. (Some also suggest that Intrade was being manipulated.)
What of more complex models making full use of polls, and run by academics or private individuals skilled at forecasting? We see Drew Linzer, Wang & Ferguson, Nate Silver, Simon Jackman, and DeSart & Holbrook within about 0.01 of each other, with Linzer in the lead. This is interesting since Drew Linzer has not been lionized like Nate Silver.
Also note that Wang & Ferguson got a better Brier score than Silver despite getting Florida (barely) wrong while Silver got Florida right. The Atlantic Wire gave Wang & Ferguson only a “Silver Star” for this reason, but our more detailed analysis shows that Wang & Ferguson probably should have gotten a “Gold Star.”
Looking at the Senate Races
Only Silver, Wang, and Intrade made probabilistic predictions (predictions with confidences stated) on the contested Senate races. When we ask what happens if we look at the ~30 Senate races, we get the following Brier scores:
Wang & Ferguson | 0.0124 |
Nate Silver | 0.0448 |
Intrade | 0.0488 |
Wang & Ferguson crushed both Silver and Intrade, by correctly predicting that Montana and North Dakota had 69/75% odds of going to Democrats while Silver gave them both <50%. Further, Silver has only a tiny lead on Intrade. Part of what happened is that Intrade, at least, managed to put 16% on that surprising Rick Berg Senate win instead of Silver’s 8%, and is accordingly punished less by the Brier score (but still enough to be nowhere near Wang/Ferguson).
What if we pit Wang against Silver against Intrade on the state victory odds, the Presidency victory odds, and the Senate victory odds? This is almost all of Silver’s binary predictions and should make it clear whether Silver outperformed Intrade and Wang & Fergusonon his binary predictions:
Wang & Ferguson | 0.01120 |
Nate Silver | 0.02674 |
Intrade | 0.03897 |
Silver did about 1/3 better than Intrade, but Wang & Ferguson did half again better!
Predictions about State Victory Margins
What about the pundits’ predictions for the share of the electoral vote, the share of the popular vote, and state victory margins? Who was the most accurate pundit on those points?
We can’t use a Brier score, since a Brier score is for binary true/false values. There is no ‘true’ or ‘false’ response to “How much of the popular vote did Obama win?” We need to switch to a different scoring rule: in this case, a classic of statistics and machine learning, the RMSE (root mean squared error). The RMSE adds up how much each predicted amount differed from the true amount, and penalizes larger differences more than smaller one. (Unlike a Brier score, RMSE scores aren’t scored between 0 and 1, but smaller numbers are still better.)
Lots of pollsters reported their predicted margins of victory for states. Below are their RMSE scores:
Reality | 0.00000 |
Nate Silver | 1.863676 |
Josh Putnam | 2.033683 |
Simon Jackman | 2.25422 |
DeSart & Holbrook | 2.414322 |
Margin of Error | 2.426244 |
Drew Linzer | 2.5285 |
2008 Repeat | 3.206457 |
Unskewed Polls | 7.245104 |
Here, Nate Silver regains a clear lead from fellow polling/statistics buffs Putnam, DeSart, Jackman, and DeSart, the latter of which is closely pursued by Margin of Error. As before, we included a benchmark from 2008: simply predicting that the vote-share will be exactly like in 2008, which performs reasonably well but not as well as the good forecasters.
What happened with ‘Unskewed Polls’? Unskewed Polls was a right-wing website which believed that the standard polls were substantially underestimating the willingness of Romney supporters to go out and vote, so any Romney figure should be kicked up a few percent. This lead to a hugely inaccurate forecast of a Romney landslide. (Note: the Unskewed Polls guy later apologized and changed his mind, an admirable thing that humans rarely do.)
Predictions about State Victory Margins, the Popular Vote, and the Electoral Vote
What happens if we combine the state victory margin predictions with these pundits’ predictions for the nationwide popular vote and electoral vote, and calculate an overall RMSE? (This is not quite statistically orthodox but may be interesting anyway.)
Reality | 0.0000 |
Josh Putnam | 2.002633 |
Simon Jackman | 2.206758 |
Drew Linzer | 2.503588 |
Nate Silver | 3.186463 |
DeSart & Holbrook | 4.635004 |
Margin of Error | 4.641332 |
Wang & Ferguson | 4.83369 |
2008 Repeat | 5.525641 |
Unskewed Polls | 11.84946 |
Here, Putnam, Jackman, and Linzer take the lead, and the Margin of Error forecast performs almost as well as DeSart despite its simplicity. Why did 2008 Repeat perform so poorly? It ‘predicted’ a total electoral & popular vote blowout (which happened in 2008 but not 2012); here is another low bar we hope forecasters can pass: whether they were able to notice that Obama is not quite as popular as he was in 2008.
Probability Theory for the Win
So, was Nate Silver the most accurate 2012 election pundit? It depends which set of predictions you’re talking about. But the general lesson is this: the statistical methods used by the best-scoring pundits are not very advanced. Mostly, the top pundits outperformed everyone else by bothering to use statistical models in the first place, and by not making elementary mistakes of probability theory like the one committed by Brandon Gaylord.
Whether you’re making decisions about health, business, investing, or other things, a basic understanding of probability theory can improve your outcomes quite a bit. And that’s what CFAR is all about. That’s what we teach — for example in this month’s Rationality for Entrepreneurs workshop.
5 Summary tables
5.1 RMSEs
Predictor | Presidential electoral | Presidential popular | State margins | S+Pp+Sm1 | Senate margins |
---|---|---|---|---|---|
Silver | 19 | 0.01 | 1.81659 | 20.82659 | 3.272197 |
Linzer | 0 | 2.5285 | |||
Wang | 29 | 0.31 | 2.79083 | 32.10083 | |
Jackman | 0 | 0.01 | 2.25422 | 2.26422 | |
DeSart | 29 | 0.58 | 2.414322 | 31.99432 | |
Intrade | 41 | 0.04 | |||
2008 | 33 | 2.21 | 3.206457 | 38.41646 | |
Margin | 29 | 0.71 | 2.426244 | 32.13624 | |
Putnam | 0 | 2.033683 | |||
Unskewed | 69 | 1.91 | 7.245104 | 78.1551 |
5.2 Brier scores
(0 is a perfect Brier score or RMSE.)
Predictor | Presidential win | State win | Senate win | St+Sn+P |
---|---|---|---|---|
Silver | 0.008281 | 0.00911372 | 0.04484545 | 0.02297625 |
Linzer | 0.0001 | 0.00384326 | ||
Wang | 0 | 0.00761569 | 0.01246376 | 0.009408282 |
Jackman | 0.007396 | 0.00971369 | ||
DeSart | 0.012950 | 0.01605542 | ||
Intrade | 0.116964 | 0.02811906 | 0.04882958 | 0.03720485 |
2008 | 0 | 0.03921569 | ||
Margin | 0.1024 | 0.05075311 | ||
Random | 0.2500 | 0.25000000 | 0.25000000 | 0.25000000 |
5.3 Log scores
We mentioned there were other proper scoring rules besides the Brier score; another binary-outcome rule, less used by political forecasters, is the “logarithmic scoring rule” (see Wikipedia or Eliezer Yudkowsky’s “Technical Explanation”); it has some deep connections to areas like information theory, data compression, and Bayesian inference, which makes it invaluable in some context. But because a log score ranges between 0 and negative Infinity (bigger is better/smaller worse) rather than 0 and 1 (smaller better) and has some different behaviors, it’s a bit harder to understand than a Brier score.
(One way in which the log score differs from the Brier score is treatment of 100/0% predictions: the log score of a 100% prediction which is wrong is negative Infinity, while in Brier it’d simply be 1 and one can recover; hence if you say 100% twice and are wrong once, your Brier score would recover to 0.5 but your log score will still be negative Infinity! This is what happens with the “2008” benchmark.)
Forecaster | State win probabilities |
---|---|
Reality | 0 |
Linzer | -0.9327548 |
Wang & Ferguson | -1.750359 |
Silver | -2.057887 |
Jackman | -2.254638 |
DeSart | -3.30201 |
Intrade | -5.719922 |
Margin of Error | -10.20808 |
2008 | -Infinity |
Forecaster | Presidential win probability |
---|---|
Reality | 0 |
2008 | 0 |
Wang & Ferguson | 0 |
Jackman | -0.08992471 |
Linzer | -0.01005034 |
Silver | -0.09541018 |
DeSart | -0.1208126 |
Intrade | -0.4185503 |
Margin of Error | -0.3856625 |
Note that the 2008 benchmark and Wang & Ferguson took a risk here by an outright 100% chance of victory, which the log score rewarded with a 0: if somehow Obama had lost, then the log score of any set of their predictions which included the presidential win probability would automatically be -Infinity, rendering them officially The Worst Predictors In The World. This is why one should allow for the unthinkable by including some fraction of percent; of course, I’m sure Wang & Ferguson don’t mean 100% literally but more like “it’s so close to 100% we can’t be bothered to report the tiny remaining possibility”.
Forecaster | Senate win probabilities |
---|---|
Reality | 0 |
Wang | -2.89789 |
Silver | -4.911792 |
Intrade | -5.813129 |
Discuss this post on Hacker News and Reddit:
http://news.ycombinator.com/item?id=4760649
http://www.reddit.com/r/politics/comments/12vxle/was_nate_silver_the_most_accurate_2012_election/
Why didn’t you include Sam Wang from the Princeton Election Consortium in your results?
Do you know where to find the actual numbers for Sam Wang’s model?
All I can see is the map, but it only gives probability ranges (0-2.5, 2.5-20, 20-40, 40-60, 60-80, 80-97.5, 97.5-100); using the median values for those ranges would be unfair, I think. State Victory Margins also seem to be absent from the data.
Simon Jackman’s model exhibits a similar problem here – it is also probabilistic, but Jackman did not publicize all the numbers needed to make a Brier score comparison. However, it’s notable that Jackman has self-reported a Brier score of 0.0091; if this is accurate I’d say that Jackman deserves the prize overall.
Weird. Did they really wipe the page? It used to be at election.princeton.edu
OK, Wang talks about probability here: http://election.princeton.edu/faq/ discounting probability as the wrong number to focus on. He says, “For those who still insist upon getting a probability from the Meta-Analysis, it can be computed by pasting current histogram data into a spreadsheet and summing rows 270 through 538.” The data is available as a link on the faq.
If all you need are the state probabilities for the Brier Score, they are in this file
http://election.princeton.edu/code/matlab/stateprobs.csv
The fields are, in order,: probability of Obama win, median margin, probability assuming +2% for the Democratic candidate, probability assuming +2% for the Republican candidate, state.
Probabilities for the ten competitive senate races are listed in the table at the bottom of this post
http://election.princeton.edu/2012/11/06/senate-prediction-final-election-eve/
His popular vote prediction was Obama +2.2 +/- 1.0%
(see http://election.princeton.edu/2012/11/06/presidential-final-prediction-2012-election-eve-final/ under “Final predicted popular-vote margin”)
Let us know what scores you find.
I already incorporated Wang’s state and presidential numbers, thanks. I saw the Senate races, but I want *all* the numbers before I give a Brier or RMSE to compare him with Intrade & Silver on the Senate races.
This article conspicuously fails to mention just how good a job that “Reality” model did at predicting the election results.
That said, “Reality” will have no choice but to eat crow if Romney ends up winning Florida.
Indeed. But updating the results shouldn’t be too hard if I have to flip all the Reality entries and redo all the numbers – that’s the nice thing about taking the time to write code for everything and not doing it by hand.
Where is this blog’s RSS feed?
Very cool. Votamatic might also be worth analyzing. This BBC story, which is relevant in and of itself, suggests that Votamatic might be worth analyzing.
http://www.bbc.co.uk/news/magazine-20246741
Very cool. This BBC story, which is relevant in and of itself, suggests that Votamatic might be worth analyzing.
http://www.bbc.co.uk/news/magazine-20246741
Sorry, since I was looking for “Votamatic” I missed that you already have that site’s predictions under Linzer’s name. Hope you still find the article useful.
The electoral map has also become increasingly predictable over the past couple decades.
I.e., the accuracy with which you can predict the current electoral map based only on the last election’s electoral map has increased.
Let me share with you my personal “arithmetic”,which was published,i.e.documented in a Tweeter: eugene goom
http://twitter.com/russianForGump/status/265545034139578368
kudos are welcome!)))
You conclude this post by saying that whether “Nate Silver [was] the most accurate 2012 election pundit […] depends [on] which set of predictions you’re talking about.” But in your “recommended readings” sections you write that “Nate Silver has reliably predicted electoral results better than any other pundit.” Which is it?
Please update the state-by-state Brier scores and state-by-state RMSE as the actual results are updated.
Margin of Error also did a calculation of the RSME if you average the predictions from the poll aggregators. The result was an RMSE of. 1.67, lower than Silver’s RMSE of 1.816. However, they only use Silver, Linzer, Margin of Error, DeSart/Holbrook, and Jackman. Since you’re using more aggregators, it would be interesting to see your calculation of the RSME of the average.
We’re going to update the state margins in 2 weeks or so (rather than multiple times, right?).
This is a minor point, but it’s a pet peeve of mine[*] about election predictions, which I would have hoped would not be a problem any more after 2008: There are 56 races, not only 51, (and Nate Silver got 56/56 correct).
[*] Probably because I live in Nebraska.
If you think that the main problem with the logarithmic scoring rule is that the numbers run from 0 to -∞ instead of from 0 to 1, then just exponentiate the score to get a number from 1 to 0. (I say from 1 to 0 instead of from 0 to 1 because now 1 is best, although you could even subtract this from 1 if you really want 0 to be best.) The greater difficulty of calculating with these (you multiply them instead of adding them, and raise them to powers instead of multiplying them by coefficients) is small, and you can always take logs again to do calculations; but for reporting results, they’re perfectly nice.
And the “exponentiated logarithmic scoring rule” keeps the really important benefit: the ruthless, unforgivable punishment of those fools who dare to ever make a 100% wrong (or 0% correct) prediction. (In practice, people should be given the opportunity to say that they’re rounding off, but of course only before the results are in.)
Thanks for doing this analysis! Where did you get Intrade’s state-by-state probabilities? Ditto for Wang? Do you know whether any of the historical probabilities – e.g. the probability forecasts issued in August, September, and October are still available for any of these forecasters? Thanks!
1. extracted it by hand from the expired contracts’ advanced info pages, combined with a wget/xclip/grep script
2. Wang’s CSV is listed in the sources/notes page
3. Linzer sent me a copy of his forecasts for many days over the last year; I didn’t use it because this is focused on election-day predictions and it’d be even more work to extract similar predictions from the few sources that frequently updated like Linzer or Intrade
As a right-brainer, I was more than delighted to find Nate’s blog, and the rational scientific election analysis he provided. In appreciation for sanity, I painted “Silver, lining” suzannebortgray.blogspot.com
Another trackback: http://election.princeton.edu/2012/11/13/an-open-source-thank-you/
I thought of a possible problem with the analysis.
Nate Silver missed two Senate races that polling suggested would be won by Democrats, but his “state fundamentals” suggested should lean Republican. In the only race in which his model disagreed with a simple polling average, he was thus incorrect.
So why did he do so well in the Brier? I would guess it’s because his use of state fundamentals allows him to call very tight CI’s in sparsely polled states. Any polling-alone method would have huge CI’s in AK, SC, MA, etc. because those states never get polled. Silver’s “secret sauce” of adding in state fundamentals allows him to tighten up those CI’s. I suggest that’s where his advantage in your analysis comes from.
But the trade-off is that it caused him to incorrectly call two critical Senate races. And of course no one would trade incorrect calls and possible over-fitting in important races for a tight CI in Alaska.
Perhaps you could re-do the analysis, but only use close Senate races and Presidential swing states.
I know this a very late, but for any future readers thinking of applying the same methodology:
There are proper scoring rules even for non-binary outcomes, which I would use instead of RMSE. The best known one is probably the Continuous Ranked Probability Score:
http://www.stat.washington.edu/research/reports/2004/tr463R.pdf