Tuesday, July 17, 2018

A Deeper Dive into the Scientific Basis for IAAF Testosterone Regulations

This post represents some notes, references and quotes that I'd like to have accessible. Perhaps they are useful to others. It is technical and terse, but if you are following along on Bermon and Garnier (2017, BG17), it'll probably make sense. As always here, caveat lector.

The IAAF testosterone regulations focus on what is called "circulating testosterone" or "serum testosterone." These concepts as well as "free testosterone" are defined as follows by Goldman et al. 2017:
Total testosterone refers to the sum of the concentrations of protein-bound and unbound testosterone in circulation. The fraction of circulating testosterone that is unbound to any plasma protein is referred to as the free testosterone fraction.
T and fT are important variables in BG17. For reasons I do not fully understand (I welcome explanations), BG17 chose not to analyze athletes according to T values, but instead divides them into tertiles based on fT levels.
In order to test the influence of serum androgen levels on athletic performance, in each of the 21 female athletic events and 22 male athletic events, athletes were classified in tertiles according to their fT concentration.
There are two methodological issues/questions here:
  1. Why use tertiles rather than look at correlating all data in continuous fashion (discussed here)?
  2. Why use fT at all, as T is measured in the study and fT is calculated (a point raised by Sőnkson et al. 2018)?
There is no good answer to #1.

I welcome being educated on #2. 

However, there is an interesting twist here. In my reading and researching this topic, I came across a recent paper by David Handelsman (2017) on the use of "free testosterone" in clinical research. For those scoring at home, Handelsman is the lead author of the other new peer-reviewed paper (in addition to BG17) that is cited in the IAAF T regulations.

Handelsman (2017) says this about "free testosterone":
Despite being extant for decades, the use of FT measurement has barely been stated in a testable form and virtually never directly tested as a refutable hypothesis. Rather by inference and repetition as if it is self-evident, it has become entrenched as an enthusiastically wielded yet largely untested concept that goes from one paper to the next without ever seeming to pass through a questioning mind.
Hmm ... more:
A valid scientific concept requires a sound foundational theory and evidence and being open to testing and refutation not just an unshakeable belief. Introducing the FT into clinical guidelines is particularly hard to understand as using that unproven criterion can only provide false reassurance by merely shifting the uncertainty to an even shakier footing in a subtle bait-and-switch.
OK, so Handelsman is not a fan of "free testosterone" as a meaningful metric to be used in clinical guidelines. Got it.

Yet, fT is the basis for the binning of athletes that forms the very basis for the analysis and conclusions of BG17. Even if we were to ignore the clearly fatal data problems that we have identified, it would seem that one expert that the IAAF relies on has eviscerated the very basis for the study performed by the other experts that IAAF relies on. Even with perfect data, BG17 is problematic.

This provides yet another reason why BG17 doesn't form a legitimate basis for any regulatory action. At CAS, lawyers will surely have a great time asking Prof. Handlesman about his views on fT and its role in the methodologies of BG17.

But there is more. I discussed BHKE18 as a do-over, as it re-did the BG17 analysis after dropping some 220 data points. Important side note here: For comparison, BHKE18 says (emphasis added): 
We have excluded 230 observations, corrected some data capture errors and performed the modified analysis on a population of 1102 female athletes.
Below is my tabulation of data points in the new (BHKE18) and original (BG17), and the difference between the two representing what I've called here bad data. It appears that BHKE18 has miscalculated how many bad data points that it identified (220 vs. 230) between the two studies. Sure it may be a typo, but it is exceedingly sloppy to get wrong the number of bad data points you are reporting:

new original bad data
Event n n
100 m   96 112 16
100 m H 59 73 14
200 m 59 71 12
400 m 62 67 5
400 m H 52 67 15
800 m 56 64 8
1500 m 55 66 11
3000 m SC 49 56 7
5000 m 36 40 4
10 000 m 29 33 4
Marathon 86 92 6
Discus 36 48 12
Hammer Throw 42 54 12
Shot Put 42 54 12
Javelin 42 55 13
Long Jump 50 62 12
Triple Jump 41 54 13
High Jump 44 56 12
Pole Vault 39 48 9
20 km RW 80 97 17
Heptathlon 47 53 6
SUM 1322 220

In addition to dropping the bad data, BHKE18 re-does the analysis focused on T not fT in response to Sőnkson et al. 2018. In what sure looks like some kind of p-hacking, BHKE18 adopts some significant changes to their methodology (quotes from BHKE18 below, followed by my comments in italics):
  • BHKE18: "we used running events from 400 m up to 1 mile, on the basis that that is where T produces its greatest performance-enhancing effects"
    • This of course was a conclusion of BG17 and subsequently the focus of the IAAF regulations. With the bad data of BG17, there is no longer any evidence that these events are where T has the greatest effects. There is no basis here. This is affirming the consequent. 
  • BHKE18: "we have aggregated result from the long sprints (400 m events), then middle-distance runs (800 m and 1500 m), and finally long sprints and middle-distance runs (400 m, 800 m and 1500 m), into one event group for further statistical analysis."
    • No such grouping or aggregation was used in the original study. In fact BG17 said this: "These different athletic events were considered as distinct independent analyses and adjustment for multiple comparisons was not required." So a complete reversal, which is it?
  • BHKE18: "The time results were transformed into an index, that is, percentage of the best performance achieved by each event"
    • The original analysis focused on absolute times, not a percentage index.
  • BHKE18: "We have used the Spearman rankorder correlation coefficient to explore the correlation between competition results and testosterone levels, using a two-sided test at the 0.05 significance level."
    • Here we see a new statistical test (a common one at that), based on ranked ordering of values rather than the actual values themselves.
  • BHKE18: "Finally, we used a serum T threshold concentration of 2 nmol/L to identify a group of female athletes with ‘high T’ levels, for comparison against the results of athletes with T levels of less than 2 nmol/L (‘normal T’ levels).
    • Where does 2 nmol/L come from? The IAAF regulations identify 5 nmol/L. 
It seems logical that if BHKE18 could have simply applied the same statistical methods of BG17 to the new dataset (minus the bad data) and obtained the same or similar results, they would have. The presence of so many methodological modifications is a big red flag.

BHKE18 clearly represents a do-over from a flawed study. But the authors double down and characterize it as somehow being an independent verification of the flawed study:
In conclusion, our complementary statistical analysis and sensitivity analysis using a modified analysis population shows consistent and robust results and has strengthened the evidence from this study, where we have shown exploratory evidence that female athletes with the highest T concentration have a significant competitive advantage over those with lower T concentration, in 400 m, 400 m hurdles, 800 m and hammer throw, and that there is a very strong correlation between testosterone levels and best results obtained in the World Championships in those events. A similar trend is also observed for 1500 m and pole vault events.
Thee results of BHKE18 are not the same as BG17, which were the basis for the IAAF regulations. From where I sit this situation looks like this:
  • BG17 relied on flawed data and questionable methods and arrived at a set of results that formed the basis for IAAF regulations;
  • When the data was challenged and errors identified, it seems logical that the methods of BG17 could not reproduce the results of BG17 using the new dataset (without the flawed data);
  • But the IAAF regulations had already been released, focused on four specific events identified based on BG17;
  • Altering the regulations based on errors in BG17 would of course mean admitting that BG17 was flawed in important respects, undercutting the basis for the regulations and IAAF;
  • So it seems that a considerable variety of new methods were introduced in BHKE18 that allowed the reduced-form dataset to plausibly approximate the results of BG17;
  • The regulations thus stand as written;
  • The new conclusions of BHKE18 are characterized as reinforcing BG17, giving at least a surface impression of a greater scientific basis for the regulations. In fact, the opposite has occurred.
This episode helps to illustrate why it is not a good idea to have an organization responsible for implementing desired regulations to be in charge of performing the science that produces the evidence on which those regulations are based. This would seem obvious, but has not really taken hold in the world of sports governance.

There is of course a need for BJSM to require the authors to release all data and code for both papers. One question that independent researchers will want to ask is how the results look when the methods of BG17 are applied to the data of BHKE18. Why were the methodological innovations introduced and what were their quantitative effects? This is the basic sort of independent check that makes science strong.

It is a fascinating case and no doubt has a few more twists and turns to come.It'll make for a great case study when all is said and done.


Post a Comment