[Note: An earlier version of this post appeared last summer during the World Cup, at my other blog.]
Last summer, I organized a competition for predictions of the outcome of the World Cup, using ESPN's Bracket Predictor. The competition, fun on its own terms, also provides a useful window into the art and science of prediction and how such efforts might be evaluated. This post summarizes some of the relevant lessons.
A key concept to understand in the evaluation of prediction is the concept of skill. A prediction is said to have skill if it improves upon a naive baseline. A naive baseline is the predictive performance that you could achieve without really having any expertise in the subject. For instance, in weather forecasting a naive baseline might just be the climatological weather for a particular day. In the mutual fund industry, it might be the performance of the S&P 500 index (or some other index).
For the World Cup predictions, one naive baseline is the expected outcomes based on the FIFA World rankings. FIFA publishes a widely available ranking of teams, with the expectation that the higher ranked teams are better than lower ranked teams. So even if you know nothing about world football, you could just predict outcomes based on this simple information.
RogersBlogGroup. The naive FIFA World Ranking prediction outperformed 64.8% of the more than one million entries across the ESPN Competition. Only 33 of the 84 entries in RogersBlogGroup outperformed this naive baseline. The majority of predictions thus can be said to "lack skill."
I also created entries for "expert" predictions from financial powerhouses UBS, JP Morgan and Goldman Sachs, each of which applied their sophisticated analytical techniques to predicting the World Cup. As it turns out, none of the sophisticated approaches demonstrated skill, achieving results in the 35th, 61st and 54th percentiles respectively -- a pretty poor showing giving the obvious time and expense put into developing their forecasts. Imagine that these "experts" were instead predicting stock market behavior, hurricane landfalls or some other variable of interest, and you learn that you could have done better with 10 minutes and Google -- then you probably would not think that you received good value for money!
It gets tricky when naive strategies become a bit more sophisticated. I also created a second naive forecast based on the estimated transfer market value of each team, assuming that higher valued teams will beat lower valued teams. This approach outperformed 90% of the million ESPN entries and all but 11 of the 84 entries in RogersBlogGroup.
It would be fair to say that the TeamWorth approach is not really a naive forecast as it requires some knowledge of the worth of each team and some effort to collect that data. On the other hand, data shows that the market value of players is correlated with their performance on the pitch, and it is pretty simple to fill out a bracket based on football economics. This exact debate has taken place in the context of El Nino predictions, where evidence suggests that simple methods can outperform far more sophisticated approaches. Similar debates take place in the financial services industry, with respect to active management versus market indices.
One dynamic of forecast evaluation is that the notion of the naive forecast can get "defined up" over time as we learn more. If I were to run another world football prediction now, there would be no excuse for any participant to underperform a TeamWorth Index -- excerpt perhaps an effort to outperform the TeamWorth Index. Obviously, matching the index provides no added value to the index. In my one of my own predictions I tried explicitly to out-predict the TeamWorth Index by tweaking just a few selection, and I fell short. Adding value to sophisticated naive strategies is extremely difficult.
There are obviously incentives at play in forecast evaluation. If you are a forecaster, you might prefer a lower rather than higher threshold for skill. Who wants to be told that their efforts add no (or even subtract) value?
The situation gets even more complex when there are many, many predictions being issued for events and the statistics of such situations means that chance alone will mean that some proportion of predictions will demonstrate skill by chance alone. How does one evaluate skill over time while avoid being fooled by randomness?
That is the topic that I'll take up in Part II, where I''l discuss Paul the Octopus.