Thursday, July 12, 2018

A Call for Bermon and Garnier (2017) to be Retracted

The New York Times has a story just out on an analysis we've done on a recent IAAF study. Take a seat, this is a bombshell and these are my individual views on it.

Earlier this year, the IAAF announced new regulations governing natural testosterone levels in female athletes. One of the few academic studies that the regulations refer to is Bermon and Garnier (2017, hereafter BG17), conducted by two IAAF researchers and published in the British Journal of Sports Medicine.

Earlier this year several of us (me, Ross Tucker and Erik Boye) formally asked Drs. Bermon and Garnier to release their data (the part not involving private medical data) for purposes of independent replication. Dr. Bermon shared with us a subset of that data last week.

What the shared data shows is absolutely remarkable and has led to the three of us submitting a "Discussion" (here in PDF, as submitted except for a few typos fixed and page numbers added) to BJSM calling for BG17 to be formally retracted.

Here is what we wrote in that submission:
Due to the pervasiveness of problematic data we are calling for Bermon and Garnier (2017) to be retracted immediately by the authors and by BJSM. If a new analysis is subsequently completed and submitted for publication, we request that it be done so only with a full, independent audit of the underlying data and results by a team committed to keeping private the associated medical data. Further, upon publication, any such analysis should also in parallel publish performance data (i.e. not the medical data with privacy concerns) such that replication of this part of the analysis is possible by any independent scholar.

This case illustrates the importance of data sharing in science as well as the role of independent checks on data with policy or regulatory significance. We encourage BJSM to adopt immediately a more rigorous policy on data availability consistent with best practices among scientific publishers. Mistakes happen. Science is robust because they can be corrected
We identified 3 types of errors in their data:
  •  Duplicated athletes: more than one time is included for an individual. In each of these instances, more than one time from the 2011 and 2013 World Championships is included for the same athlete, contrary to the paper’s stated methods;
  • Duplicated times: the same time is repeated once or more for an individual athlete, which is clearly a data error;
  • Phantom times: no athlete could be found with the reported time for the event.
We also identified the inclusion of times from Russian athletes who had been disqualified due to doping. The Table below shows a summary of the number of problematic data points we found for four events in the BG17 analysis (400m, 400mH, 800m, 1500m).


We found between 17% and 33% problematic data in the four women's events and suggested that such errors may be present throughout other women's and men's data. This is unacceptable in a peer-reviewed scientific paper. Thus, we have called for retraction, as a matter of basic scientific integrity. It's not a difficult call.

Much to our surprise we subsequently learned that Dr. Bermon and three colleagues had published a new letter at BJSM just days before our submission (which I'll call BHKE18 hereafter). From all indications, BHKE18 represents a "do over" after they realized that they had serious data problems in the original work. 

BHKE18 unambiguously also confirms our identification of bad data. Just compare the number of data points included in BHKE18 versus BG17 shown in the graph below.

There are fully 220 data points eliminated from one analysis to the next, representing ~17% of the total. The elimination of data (which BHKE18 alludes to in passing as some double counting in BG17) clearly supports our critique.

And yet, the elimination of problematic data points still does not reconcile with our re-creation of the BG17 dataset for the four events that we looked at closely.

Data points for four women's events
BG17 BHKE18 PTB18
400 m 67 62 45
400 m H 67 52 48
800 m 64 56 53
1500 m 66 55 51

It appears that there remains problematic data. Further, the new letter is not peer reviewed, nor are its data publicly available for replication. By not being candid about their data errors in BG17, Bermon and colleagues have added confusion on top of confusion. This is not how science is supposed to work.

Mistakes are made, it is inevitable. What matters is what happens after that.

Here is what my colleague Erik Boye, Oslo University Hospital, says about this episode:
A set of data normally follows publications like BG17. The conclusions are linked to the data and their interpretation and the data must be made available to the general public. That is basic in science. If now the authors have received some help to understand that their data are fraught with errors they should call for a retraction and resubmit a new paper with new data if they so wish. We have pointed out this to the IAAF and to the publisher. None of them appear to handle this well. It is unacceptable that the paper stands and that a few people are informed that there were serious errors attached to the data and that unseen changes have been made to the data set. Furthermore, there is no sign that the new set of data has been subjected to any more of a critical review or that it will be released for external scrutiny.

For this reasons we should insist that scientific standards and rules are followed. In my practice at editorial boards (the EMBO and FEBS publications) I am certain that such a faulty data set would have released a demand for a retraction, with the possibility of a resubmittal.
I agree 100%. There is only one acceptable outcome here. BG17 must be retracted by BJSM. This could be done by a request from the authors or by BJSM itself. You do not get a "do over" in research when such pervasive errors are made. If Bermon et al. wish to submit another analysis for peer review, I'd expect that the data should be provided and a full audit done prior to publication.

By all indications neither BG17 nor IAAF intend to retract the paper. This says something about conflicts of interest in research, I would think. Thus, the ball is in the BJSM court. This will be a test of scientific integrity standards at BJSM. I hope they pass, for BJSM and for research integrity sake.

The IAAF analysis is far to important to be treated in such sloppy fashion. I'll be following up on the significance of the flawed data, IAAF's refusal to retract and what it might mean for the fate of the IAAF T regulations in days to come.

Tuesday, June 5, 2018

IAAF Opens Up on Testosterone: Some Reactions

My experiences are that sports organizations rarely like to engage in public. However, this norm seems to be evolving, perhaps a motivated both by necessity and a by a newer commitment to engagement among forward-thinking sports administrators.

The IAAF, via one of its lawyers, Jonathan Taylor of Bird & Bird, has written a lengthy response to a Sports Integrity Initiative article on proposed new testosterone regulations. That Sports Integrity Initiative commentary can be found here. I am less interested about the back-and-forth than I am in what the IAAF response says about their proposed regulatory approach to testosterone regulation.

In this post I offer a few thoughts on the new IAAF arguments and applaud their commitment to public engagement. In that spirit, if Mr. Taylor or IAAF wish to comment here, I'm happy to host their views. Sport is better through such engagement, even (especially) when there is disagreement that can be clearly articulated.

Can IAAF Regulate Sport According to Athlete Biological Characteristics?

The answer here is clearly "yes."

Sports organizations routinely segregate athletes by biological characteristics, most obviously by weight classes in boxing and wrestling, and of course systematically in the Paralympics.

There are two logical fallacies here that are worth discarding up front, one typically advanced by opponents to T regulations and one advanced by IAAF in support of T regulations. They are:
  • Fallacy #1: Governing bodies do not (generally) regulate other "natural advantages" so IAAF cannot regulate T. 
  • Fallacy #2: Governing bodies do (sometimes) regulate other "natural advantages" so IAAF should regulate T.
The issue here is not going to be settled by invocation of general principles, but rather, the specific question of whether it is appropriate for IAAF to regulate women's athletics based on endogenous levels of T across four events.

What a big picture view can tell us however is that biological regulation of athletes in the disciplines of athletics is incredibly unique, and T would be the only biological characteristic that is regulated in all of the Olympic sport of Athletics. This fact does not determine an outcome, but it should set a high bar for approving any such regulatory action. 

Are the Male/Female Classifications in Athletics Regulated According to Biology?

This is tricky. The answer however is clearly "no."

The T regulations are an effort to regulate the male/female classification according to a biological characteristic. But at the moment, and for much of recent years, there has been no such regulation in place. Thus, the male/female is not regulated at present according to biological characteristics.
  • Do men and women have different biological characteristics? Of course
  • Are men typically faster and stronger? Of course
Male and female are genders with a strong, but not perfect, correlation with the biology of sex. One important reason for this imperfect correlation is that male and female are discrete categorizations in sport competition, whereas the biology of sex is not discrete. 

Taylor observes that the parties to the Chand case (2015, here in PDF) all agreed that it is appropriate to distinguish male and female classifications because males enjoy such a performance advantage that virtually all females would be excluded from elite competition. This issue need not be debated.

But none of this helps in resolving questions about T regulation. The challenge at hand is to determine eligibility for participation in male and female classifications. To state that males and females compete in different classifications is simply to set the stage. We should beware circular reasoning. 

Right now, society outside of sport does all the work of determining who is female and who is male for purposes of elite sports competition. This work embodies complex social processes that integrates considerations of biology, culture, politics, law and more into determining who gets classified as female and who as male. IAAF is not satisfied with how society is doing this work and is seeking to create its own regulations. (As a comparison, not long ago the sport of gymnastics decided that it was unhappy with how society was accounting for the ages of its athletes and so internalized age certification as a regulatory process.)

But make no mistake, at present neither males nor females are classified according to biological characteristics by the IAAF. That is what the proposed regulations are about. 

Does Science Distinguish Between Males and Females?

The short answer is "no." There is simply no single biological characteristic - chromosomes, hormones, whatever -- that uniquely and unambiguously distinguishes the biological sexes. This point is not particularly controversial, even by IAAF.

However, IAAF appears to be somewhat conflicted on this topic. The issues here are not male vs female, but female vs. female. This is clearly explained in the CAS Chand decision which (at 51) noted: "the Regulations do not police the male/female divide but establish a female/female divide within the female category."

The key issue here is not whether some females have biological characteristics more typically found among males, but whether those specific biological characteristics are clearly associated with a performance difference between females of a magnitude similar to those typically observed between males and females. Please read that sentence again.

In his response, Taylor introduces a biological characteristic into the debate that is not mentioned in the IAAF T regulations: testes. He writes:
the physical advantages enjoyed by male athletes are due to the fact that they have testes that produce testosterone in amounts that circulate in serum in the range 7.7 to 29.4 nmol/L, whereas female athletes have ovaries that produce much lower levels of testosterone, in the range 0.12 to 1.79 nmol/L. (RP: Based on a forthcoming, but not yet available paper by Handelsman et al.)
Males have testes, females have ovaries. OK, got it. Taylor then writes:
Due to conditions referred to as ‘differences in sex development’ (most often, 5-α reductase deficiency, or partial androgen insensitivity), an XY baby’s testes may not descend from the abdomen, so that it presents on birth with female or ambiguous genitalia, and so may be assigned the female sex. At puberty, however, the testes start producing the much larger levels of testosterone mentioned above, which (unless the XY female is completely androgen-insensitive) will have an androgenising effect on her body and will increase her circulating haemoglobin, in the same way as happens to an XY male at puberty.
So we have an individual with a "condition" called DSD ("differences in sex development") who has testes but "may be assigned the female sex." At puberty her body is responds the same way as an XY male. So is Taylor implying that she is actually a male (i.e., with testes)?

Taylor further writes:
the ‘natural physiology’ of most DSD athletes includes male gonads (testes) that produce levels of circulating testosterone not in the normal female range (0.12 to 1.79 nmol/L in serum) but in the normal male range (7.7 to 29.4 nmol/L), producing (if the athlete is not androgen-insensitive) lean body mass and levels of circulating haemoglobin well above the normal female range and rather in the normal male range.
The language here is important (and confused). "Male gonads" -- can individual body parts have their own genders? Can a woman have male body parts? Can a man have female body parts? If IAAF wants a gonad policy they should call it a gonad policy. The presence of gonads/testes is completely irrelevant in the proposed regulations, as it is focused on testosterone levels.

This issue is important because the proposed IAAF regulations stress that they are not seeking to classify athletes as male or female:
These  Regulations  exist  solely  to  ensure  fair  and  meaningful  competition  within  the  female classification, for the benefit of the broad class of female athletes.  In no way are  they  intended  as  any  kind  of  judgement  on  or  questioning  of  the  sex  or  the  gender  identity of any athlete.
Taylor's introduction of testes would seem to betray this claim. He writes:
If it is not fair and meaningful for a female athlete to have to compete with a male athlete whose gonads produce 10-30 times more testosterone than she does, so too it is not fair and meaningful for that female athlete to have to compete with a DSD athlete whose gonads also produce 10-30 times more testosterone than she does.
The IAAF regulations explain that a female athlete who does not meet the regulatory standard "will not be eligible to compete in the female classification in a Restricted Event at an International Competition" but would be eligible to compete in the male classification.

If this is not sex testing and classification according to physical characteristics, I don't know what is.

What about Performance?

Taylor's response emphasizes biological differences and says very little about performance or how it is related to testosterone (presumably because he was writing a response, so fair enough).. However, performance is absolutely essential to the IAAF case.

The CAS ruled against IAAF in the Chand case because the evidence available did not support the claim that high testosterone levels in certain female athletes were associated with a difference in performance between these women and other women that was similar to the difference between male and female.

CAS explained (527):
The Panel considers the lack of evidence regarding the quantitative relationship between enhanced levels of endogenous testosterone and enhanced athletic performance to be an important issue. While a 10% difference in athletic performance certainly justifies having separate male and female categories, a 1% difference may not justify a separation between athletes in the female category, given the many other relevant variables that also legitimately affect athletic performance. The numbers therefore matter. 
Because the performance numbers matter, levels of testosterone (or, unmentioned in the regulations, the presence of testes) are by themselves irrelevant. CAS judged that it is only if high levels of testosterone can be associated with a performance advantage of the order enjoyed by men over women that regulation might make sense.

CAS further explained (528):
However, in order to justify excluding an individual from competing in a particular category on the basis of a naturally occurring characteristic such as endogenous testosterone, it is not enough simply to establish that the characteristic has some performance enhancing effect. Instead, the IAAF needs to establish that the characteristic in question confers such a significant performance advantage over other members of the category that allowing individuals with that  characteristic to compete would subvert the very basis for having the separate category and thereby prevent a level playing field. The degree or magnitude of the advantage is therefore critical. 
This is where things get a bit sticky.

Upon receiving this judgment IAAF sought to commission research on the relationship of testosterone and performance. Rather than invite independent researchers to conduct such research, IAAF conducted it internally. This approach is clearly problematic because IAAF, as an interested party in the outcome, can hardly be called independent. Thus, IAAF handicapped itself from the outset.

The resulting research (much discussed on this blog) is Bermon and Garnier (2017). Not surprisingly, IAAF claims that its results support further regulation of testosterone. A close look doesn't really support this claim.

The most striking conclusion of this paper -- taking it at face value -- is that the resulting statistics come no where close to the 10% difference in athletic performance cited by CAS as an appropriate basis for regulation. In fact, the paper found no performance difference worth regulating in 19 of 23 athletic events in which women compete.

Think about that. After all of the talk of the overwhelming importance of testosterone to athletic performance, an internal IAAF study designed to look for such differences could not justify testosterone regulations for almost all women's events. Clearly, testosterone is not the magical athletic elixir claimed by some.

Of the four events that IAAF decided to regulate (400m, 400mH, 800m and 1500m), the Bermon and Garnier study found performance differences between the highest and lowest tertiles to be, respectively for each event: 1.5%, 3.1%, 1.6% and 0.3% (from Table 6). Only the first 3 were claimed to be statistically significant differences. These differences are similar to those that led CAS to suspend the original IAAF regulations at dispute in the Chand case, and far removed from 10%.

Given these numbers, it is surprising that IAAF has sought to again implement regulations that were previously unsuccessful at CAS. There seems to be no case here. Perhaps IAAF has some additional science in its back pocket.

Finally, on performance data, a last note. Along with Ross Tucker and Erik Boye, I have requested the underlying performance data of Bermon and Garnier. This is a normal request in research and should be expected of anyone who publishes peer reviewed research. Thus far IAAF has not released the data. This is deeply troublesome. We have engaged the journal's editor and will push this as far as it takes. As CAS explains, the numbers matter.

Bottom Line

It is very good to see IAAF (or its representatives) engaging in public. This is good for sport governance, for athlete rights and for the effective role of evidence in decision making. In this instance, I applaud Jonathan Taylor for his lengthy defense of the newly proposed IAAF regulations. He provides a further window into their basis and justification. They also raise some important issues worthy of further debate and discussion.

Monday, May 14, 2018

Wisdom on College Hoops from Carlon Brown

This series of comments below from Carlon Brown ((@carlonautentico) on Twitter offers a fantastic perspective on what college basketball prepares an athlete for and what it does not. Brown played at the University of Colorado and professionally in the US and overseas.

It'd be great to get him to my class next year. The perspective below is smart, have a read.











Saturday, May 12, 2018

Reverse Engineering Bermon & Garnier (2017)

Last week, Ross Ticker, Erik Boye and I wrote a letter to the British Journal for Sports Medicine calling for the authors of Bermon and Garnier (2017, BG17) and their sponsor IAAF to release the performance data used in their study. You can read our letter here.

Professor Joe Guinness, a statistician and visiting assistant professor at Cornell (@joeguinness) has attempted to reproduce the reported performance results in BG17 for the women's 800m, which we discuss in our letter.
BG17 report an average time of 121.80 seconds with a standard deviation of 5.42 seconds, for 64 times included in the analysis. Prof. Guinness sought to reporduce these numbers by brute force (his code is linked in the Tweet above).

He has found that he can only come close to reproducing the times by removing Caster Semenya's 2011 time plus that of one other athlete. See his results above. He notes in a Tweet: "There are some caveats here, especially how rounding is dealt with, so this shouldn’t be taken as definitive."

If these numbers are correct then it would mean that Caster Semenya's time was removed while 2 times from Mariya Savinova in 2011 and 2013 would remain. Savinova's times have officially been removed from the IAAF database after she was suspended for doping at both the 2011 and 2013 World Championships.

The inclusion of Savinova alone would call into question the meaningfulness of BG17, and the deletion of Semenya's time would be curious. Of course, we cannot be sure about any of this until IAAF and BG17 release their data.

The longer the stonewall the more questions will be raised as to why the just don't release the data. Are there some things in their work that they are afraid to show?

Wednesday, April 25, 2018

Some Resources on Testosterone Regulation in Elite Athletics

It appears that the IAAF is on the verge of announcing another set of regulations governing allowable natural testosterone levels in women athletes. This is a bad idea. In anticipation of the new regulations I thought I'd post up some resources for those who are interested in the issue.

The regulation of testosterone only the latest effort by sports administrators to police how women should look. There are countless biological characteristics of humans that in some way contribute to elite athletics performance -- testosterone in women (but not in men) is the only naturally occurring biological characteristic that is regulated.

I take on this issue in some depth in this paper
Pielke Jr, R. (2017). Sugar, spice and everything nice: how to end ‘sex testing’in international athletics. International Journal of Sport Policy and Politics, 9:649-665. (PDF, free to read)
Remarkably, in 2011 the IAAF listed a set of criteria for how women should appear, lest they be reported to officials for investigation of their testosterone level. These criteria are listed in the slide below (from a talk I give on this subject). Two of the nine criteria have to do with breats size and shape.
More generally, let's say that you accept the argument that testosterone should be regulated. I don't, but let's play along. Even here, the science relied on by the IAAF does not support the case that they are making.

The IAAF bases its case on this paper:
Bermon, S., & Garnier, P. Y. (2017). Serum androgen levels and their relation to performance in track and field: mass spectrometry results from 2127 observations in male and female elite athletes. British Journal of Sports Medicine
That paper purports to show that women in certain events gain a benefit from testosterone levels in the higher end of the range found in female athletes. 

That paper has received a range of criticism as being flawed. Notably:
Franklin S, Ospina Betancurt J, Camporesi S What statistical data of observational performance can tell us and what they cannot: the case of Dutee Chand v. AFI & IAAF Br J Sports Med Published Online First: 23 February 2018. doi: 10.1136/bjsports-2017-098513
That paper concludes:
we believe that it is scientifically incorrect to draw the conclusions in the Bermon and Garnier paper from the statistical results presented. Their paper claims that certain athletes have an advantage in precisely the five events where a significant effect was found: we calculate that a high share of those five significant effects are likely to be false positives.
the statistical analysis data processing in this paper is such a mess that I can’t really figure out what data they are working with, what exactly they are doing, or the connection between some of their analyses and their scientific goals. 
Gelman was motivated by Simon Franklin, a post-doc at LSE, who emailed him that:
There are more than a few problems with the paper, not least the fact that it makes causal claims from correlations in a highly selective sample, and the bizarre choice of comparing averages within the highest and lowest tertiles of fT levels using a student t-test (without any other statistical tests presented).

But most problematic is the multiple hypothesis testing. The authors test for a correlation between T-levels and performance across a total of over 40 events (men and women) and find a significant correlation in 5 events, at the 5% level. They then conclude:
These are 5 events for which they found significant correlations! And we are lead to believe that there is no such advantage for any of the other events.
Female athletes with high fT levels have a significant competitive advantage over those with low fT in 400 m, 400 m hurdles, 800 m, hammer throw, and pole vault.
I also have written two critiques. First, a post-publication peer review:
My bottom line: The paper has some significant methodological issues, most notably the inclusion of female athletes who doped with those with naturally high levels of T. There is some double counting of athletes in 2011 and 2013. There is also speculation that the male findings are contaminated by doping. Methodological issues notwithstanding, the paper nonetheless strongly reinforces the 2015 CAS Chand decision. 

The IAAF data of Bermon and Garnier (2017) don't support the proposed regulations of testosterone in women at distances of 400m to one mile. Consider the figure below:
Let's accept the analysis as valid (maybe not, but let's play along). These IAAF data (pink bar) indicate that over distances of 400m, 800m and 1500m high testosterone women are on average 1.1% faster than their low testosterone counterparts. Unfair, IAAF might scream.

But look at the data that IAAF collected for men at 400m and 1500m (blue bar). These data indicate that high testosterone men are on average 1.1% faster than their low testosterone counterparts. Surely if high T in women in selected events where performance differs is to be regulated, then high T in men in selected events where performance differs is also to be regulated?

If IAAF responds that the T standard applies only to women but not men based on performance data, then this is the very hallmark of sex discrimination. This only scratches the surfaced of flawed T regulation.

We shall see what IAAF actually presents tomorrow. However, based on the evidence and arguments that IAAF have presented thus far, its T regulations are focused on one athlete (initials CS), discriminatory, sexist and (for those who think analysis of T levels in athlete performance is relevant) resting on a flawed evidence base.

There can be little doubt that this new policy will be challenged at CAS.

Monday, April 23, 2018

A Talk on College Sports: How University Faculty Can Help Fix College Athletics

Over the weekend I gave a talk to the Coalition on Intercollegiate Athletics (COIA). Here it is:
Comments welcomed!

Monday, April 16, 2018

Six-Figure Salaries in the US NGBs Reported in 2016 IRS 990s

The figure above was motivated by a column by Sally Jenkins in the Washington Post a few weeks ago in which she reported that the USOC pays 129 staff members more than $100k per year. I was curious how that statistic looks for the 47 Olympic National Governing Bodies.

One way to take a look at that question is to dive into the 20016 (most recently reported) IRS 990 forms required for non-profits and sum up all the highly-paid employees reported on those forms. We identified 184 individuals on the 990s with compensation levels above $100,000. This number is surely an underestimate as not all such salaries are reported on the 990s. In addition, there are many subcontracts and transfers reported on the 990s to other non-profits or businesses for which it is impossible to identify salaries. US Soccer for instance, transferred some $60+ million and awarded USSF employees unspecified bonuses. Even so, the reported numbers tell us something.

Summary stats:

  • Number of salaries between $100k and $150k = 56
  • $150k - $200k = 46
  • $200k - $250k = 33
  • $250k- $500k = 39
  • >$500k = 10
  • >$1M = 3
All told, these salaries for the 184 employees total $44.7 million and represent just about 4% of the total NGB budgets.

Please send along comments, corrections and data requests via Twitter @rogerpielkejr.