Thursday, March 15, 2012

Accuracy and Skill in Predicting Sporting Outcomes

[UPDATE 3/16: Dan Johnson replies to this post in the comments here.]

With March Madness about to tip-off and people paying attention to their bracket predictions, it seems like a good time to revisit the issue of predictive accuracy using sporting events as a laboratory for understanding.

Earlier this week I saw an article in the WSJ about some interesting research done by Daniel Johnson, an economist at Colorado College, who has developed a methodology to predict the winning of Olympic medals by nation. There is plenty of details here including data and a link to a peer reviewed paper (Johnson and Ali 2004).

Here is an excerpt from that WSJ article:
In what has become a biennial pre-Olympic event, Daniel Johnson, an economist at Colorado College, released his projections Monday with his usual caveat—his projections, which averaged a 93% success rate for overall medals from 2000-2010, pay exactly zero attention to the actual athletes who are participating in the upcoming Summer Games.
I contacted Johnson, and he was really great in answering questions and providing data (Thanks, Dan!). I was intrigued by the claims of predictive success of his methodology, such as this from Canada's National Post: "For a mathematical model compiled from freely available data, its predictive power is striking."

The metric used by Johnson to characterize predictive accuracy is a correlation between the prediction and the outcome. As I have written about on many occasions, predictions are best measured by skill, not accuracy and certainly not by correlation. Consider the useful anecdote that if you were to predict no tornadoes for, say, central Oklahoma every day, you'd have about a 99% accuracy rate and an eye-popping correlation. But you'd also provide no value added whatsoever.

To provide value added in forecasting any methodology must outperform some naive baseline expectation -- that is, a simple prediction. For Olympic medals there are a number of different metrics that one might use as a naive baseline. In this exercise using Johnson's predictions, I have used medal results from the prior Olympic Games as the basis for a naive forecast of results for the subsequent games. That is, a simple expectation for the 2004 Olympic Games medal counts would just be the results from the 2000 games, for the 2008 games the naive expectation would be the results from 2004 and so on. Any methodology purporting to have predictive skill ought to be able to improve upon this very simple naive baseline.

To evaluate a prediction against the naive baseline one can compute what is called a root mean squared error, which is simply calculated by taking the difference between the prediction (and the naive baseline) and the actual results for each country, adding them up, and then taking the square root to return to the original units.

For instance, the US had 97 total medals in the 2000 Sydney Olympics. Johnson's methodology predicted 103 for the 2004 Athens games. The actual number of medals won was 103. So the RMSE for the naive forecast was 6 and for Johnson was 0, meaning that in that instance, Johnson's methodology outperformed the naive baseline.

I have repeated this calculation for the top 20 medal-winning countries for 2004 and 2008, and compared the results of the naive forecast to the contemporaneous predictions made using Johnson's methods. Here are those results in the following graph:
What this graphs shows is, unfortunately, Johnson's predictions were not able to outperform a very simple naive baseline. The black bars are smaller than the red bars for both 2004 and 2008, indicating that the predictive error of Johnson's forecasts was larger than that of the naive baseline expectation. In other words, if you went to Las Vegas to bet on Olympic outcomes, you'd fare better using the naive baseline as a predictor than the methodology proposed by Johnson.

While I haven't comprehensively evaluated Johnson's predictions (he also predicts participation and Gold medals for all participating countries), my tentative conclusion is that Johnson's work may tell us something about relationships between different characteristics of nations and Olympic outcomes, but it does not appear to provide us with a skillful way to forecast Olympic medals. To make that case, he would have to utilize conventional metrics of forecast verification to demonstrate skill versus a naive baseline. As they say, prediction is hard, especially about the future!


  1. Roger, can you explain to a layman the essential difference between the methods of measuring skill in a 'contest' like this?

    When I look at the data, say the table of top 20 medals for 2004, I see the top 5 make up just over 50% of the medals. Being 5% out for the USA (no. 1, 101 medals) is 5 medals, which equates to being 40% out for Canada (no. 20, 12 medals).

    In reverse, being 5% out for the USA (error 5) counts for 100 times as much under RMSE as being 5% out for Canada (error 1/2).

    So to 'win' on a RMSE basis you just have to get the top few closer than the other guy, and devil take the rest?

    Or, in reverse, you can get everybody else bang on make a mess of the USA and/or China (top two in 2008), and you're toast.

    Given the very wide range of medal hauls within the top 20, exactly 10:1 top to bottom in 2008, shouldn't one think about taking, for example, a log scale of outcomes if using RMSE method? As you know I'm at the LSE at the moment struggling with econometrics among other things, and if I saw an 'exponential' data set like this one I'd think about logging it, my inutuition would be that the percentage error of each individual forecast mattered rather more than the absolute?

  2. Thanks Roddy ... a few replies:

    1. I am very much of the school of thought that predictions should be evaluated using the same units that they were issued in -- so if those units are medals by countries, so be it.

    2. That said, there is nothing technically wrong with performing a transformation and then conducting the evaluation. You just have to explain why you have done so (and making the evaluation metrics look better probably won't fly;-)

    Like anything involving stats, there is a lot of flexibility in an analysis, so caution is advised all around.


    1. thanks -

      Having spent my life in the City, dealing in stuff like currencies and stocks, I'm compelled to be of the school of thought that only percentages matter, a $5 error on Apple is a rounding mistake these days, whereas it's life and death on Citicorp.

      So even though someone predicting those two share prices would express the prediction in $, I'd automatically measure their error in percentages. If a tipster recommended 5 stocks for 2012, with price predictions, I'd assume an equally weighted portfolio on Jan 1 to evaluate how he did versus a rival, and percentages would do that better?

      Thinking about how to test Johnson's predictions, I always default to profitable trading - I'd try and work out if I could make money out of it, how I'd construct a portfolio of bets around his predictions - if I did that would I risk ten times as much on his USA prediction as his Canada prediction or would I weight equally? I think the latter.

      You probably don't have access to, but I expect they will start running bets on medals per country, for some countries at least. I'll try and remember to record their prices prior to the Olympics starting.

  3. Thanks Roddy, but percentages have their problems also ... if the method predicts 2 medals for Uzbekistan and they actually get 6, then that is a 300% error.

    In a paper I did on flood forecast evaluation I used both the absolutes and the percentages:

    Ultimately, as you suggest, any such evaluation will have to be contextual and multi-dimensional. Olympic medals make for a nice case to explore the topic.

    The prediction/betting markets would also provide a useful naive baseline, which I'd guess would be even harder to beat.


  4. Roger, nice article. You're exactly right of course--- if the goal is simply prediction, then the single best predictor of "future success" is probably "past success". And unfortunately, the media seem much more interested in predictive accuracy than they are in the actual point of the model exercise--- explanation.

    In my opinion, to say that the US will win roughly 100 medals this year because they won roughly 100 medals in 2008 is rather pointless. To say instead that they will win roughly 100 medals this year due to a specific combination of income and population (and in previous models, political structure and climate) is more important. The fact that the same pattern repeats itself over time indicates that the same factors are still important.

    In fact, the most interesting thing I found in this year's recalibration is that the pattern is changing over time. In fact, income is becoming slightly less important as a factor, while population is becoming more important. Clearly, that's due to the changing role of China in the Olympic medal counts.

    So while I *completely* agree with you on the predictive accuracy of the model, my goal has never been perfect prediction. It's been a greater understanding of "why" the counts look this way. The answer is pretty clear: income and population, plus a strong home nation advantage. If you want to add "inertia" to that list, and predict medals using last year's counts, I won't argue with you at all.

    ---Dan Johnson, Colorado College

  5. Roger - I decided to test your conclusions that the naive prediction beat Johnson (mainly as a device to avoid real work), so went to which has his 2008 predictions and the actual 2008 results usefully listed on the back page, and Wikipedia gave the 2004 results as the naive forecast.

    I couldn't agree your conclusions, using the 2008 top 20 medal-winning countries he beat the naive forecast handsomely. So at that point either:

    a) I'm smarter than RPJr - unlikely
    b) I can't do simple maths - I refuse to admit that
    c) We're using different data - the only possible answer

    I went back to his website and looked at, headed Performance of Olympics economic model, 2000-2010:

    Beijing 2008
    · Predicted 89 medals for host China (actual was 100)
    · Predicted 95 medals for Russia (actual was 72)
    · Predicted 28 medals for future host Britain (actual was 47)
    · Predicted 26 medals for Australia (actual was 46)

    In the March 2012 release he has as his 2008 predictions different numbers:

    China: 79
    Russia: 84
    Britain: 36
    Australia: 42

    Which explains why my results were different (and better). While the later 'prediction' for China is worse by 10 medals, the others are better by 11, 8, and 16, so on this sample of four significant countries his predictions have improved substantially since the Games took place!

    I can't find a complete list of forecasts for 2010 made prior to the 2010 Games, but I suppose that's what you had, and why your results were worse than mine.

    I understand that he is constantly improving his model to get it to fit past data better, in the hope/expectation of predicting better, but I find it strange that his 'predictions' for 2008 have mutated closer to historic reality in the latest press release. I have no problem with him thinking his model does a better predictive job as each Games passes, but I do have a problem with later (better) numbers being presented as if they were the predictions made at the time!

    Makes me feel better about my City career, even we weren't allowed to present our past performance in a later and more flattering light.

  6. Dan - I just saw your comment, I quite take your point that you're trying to back out reasons why, rather than be a character from Luck, interesting.

    I hope you see that the March 2012 pdf might suggest to an average reader that those were your predictions, rather than the predictions you would make now with a recalibrated model? And apologies if I have maligned you!