Monday, August 13, 2012

The Olympics Medals Prediction Olympics

The FT has set up a nice dataset of Olympic medals predictions, and now that the 2012 games are in the books, we are in a position to give out awards for the top performers. To recap, there were 5 predictions tracked by the FT: two by academics, Williams and Johnson and two by companys, PriceWaterhouseCoopers and GoldmanSachs. The fifth was the average of the four as calculated by the FT, called the Consensus.

 Here is a summary of the models from the FT:

The methodology of forecast evaluation used to evaluate the Olympic medal predictions is a conventional skill score, measured as the root mean squared error – that is the difference between the forecast and the actual results, squared, then summed over the predictions with the square root taken at the end to get us back to original units. Skill is defined as an improvement made by a particular forecast over some naïve expectation.

In the evaluation summarized in the graph at the top of this post I present results for two naïve baselines. One is simply the unadjusted results of the 2008 Olympic games (2008 Actual) and the second is a simple adjustment to the 2008 games (Fancy Naive) described below.

Add caption
Here are the results:
  • Gold Medal – Consensus Forecast 
  • Silver Medal – Goldman Sachs 
  • Bronze Medal – PriceWaterhouseCoopers 
  • Did not place: Johnson and Williams 
Here are the details:

In this evaluation I looked only at the top 25 medal-winning countries, for 2012 taking us down to Azerbaijan with 10 medals. It is possible that the rankings might shift looking all the way down the list. Further, not all predictions covered all countries, so of the top 25 medal winners, only 18 appeared across all predictions.

In addition, to explored the robustness of the evaluation, I explored the top 25 medal winners in 2008 and 2012, and I also looked at skill with respect to the top 18 minus China and the UK, and only with respect to China and the UK. The reason for singling out these two countries is that as hosts of the 2008 and 2012 Olympics, they experienced a big increase in medals won as the home nation. The results I found were largely robust to these various permutations.

Just for fun, I introduced a slightly more rigorous baseline for the naïve prediction (Fancy Naive), which assumed that the host country experiences a 15% boost in medals over 2008 and the 2008 host experiences a 15% decrease. All other elements of the naïve prediction remained the same. As you might expect, this slight variation created a much more rigorous baseline of skill, which has the result of coming in very, very close to (but not beating) the GS and PWC forecasts. The difference is slim enough to conclude that there is exceedingly little added value in either forecast. It would not be difficult to produce a slightly more sophisticated version of the naive forecast that would beat all of the individual models.

However, the consensus forecast handily beat the more rigorous baseline. This deserves some additional consideration -- how is it that 4 models with no or marginal skill aggregate to produce a forecast with skill? My initial hypothesis is that the distribution of errors is highly skewed -- each method tends to badly miss a few country predictions. Such errors are reduced by the averaging across the four methods, resulting in a higher skill score. I would also speculate that tossing out the two academic models, having no skill, would lead to a further improvement in the skill score. Of course, the skill in the Consensus forecast might also just be luck. If I have some time I'll explore these various hypotheses.

The skillful performance of GS and PWC does have commentators speculating as to whether they may have been doping involved. That analysis awaits another day.

Here are some additional results:
  • The models did worst on Russia, Japan, China and the UK
  • The models did best on Brazil, Canada, Kenya and Italy
  • Every model had several predictions right on target (error of 0 or 1)
  • Every model had big errors for some countries
Bottom line: Do the various models provide value (beyond the pub debate variety) beyond a naive forecast?

The verdict is mixed. Individually, very little. Together, some. Teasing out whether the ensemble demonstrates skill due to luck or design requires further work.


Post a Comment