Friday, July 27, 2018

BJSM Lets Stand a Deeply Flawed Paper, Why?

A few weeks ago the New York Times wrote about a paper we had submitted to the British Journal of Sports Medicine calling for Bermon and Garnier (2017, BG17) to be retracted. You can get the back story at the links in the previous sentence, but two things to understand up front:
  • BG17 is not just any old scientific paper -- it is the only scientific basis for regulations to be implemented by the International Association of Athletics Federation (IAAF) governing naturally occurring testosterone in female athletes.
  • Calling for a retraction of a scientific paper is not something to be done lightly. BG17 is the first paper that I have called on publicly to be retracted in 25+ years of publishing, reviewing, serving on editorial boards and studying science in policy. Yes, it is that bad.
Today the editor of BJSM emailed with the following information (which are quoted in full from his message):
1. The BJSM editorial team has considered the various points raised to us about retracting BG17 (including yours) and stand by our decision that retraction would be inappropriate. 
2. We respect the authors’ decision not to open these data even though we support the general principle of data sharing.
No retraction, no sharing of data.

When should a paper be retracted? Fortunately, the publisher of BJSM has a policy on retraction which states:
Retractions are considered by journal editors in cases of evidence of unreliable data or findings, plagiarism, duplicate publication, and unethical research.
This retraction policy is similar to the recommendation of the Committee on Publication Ethics (COPE), whose guidelines are followed by most scientific publishers (PDF):
Retraction is a mechanism for correcting the literature and alerting readers to publications that contain such seriously flawed or erroneous data that their findings and conclusions cannot be relied upon.
Why have we called for BJSM to retract BG17? Because of seriously flawed and erroneous data such that the paper's conclusions cannot be relied on. This is such a clear case that it is baffling why BJSM has chosen not only to let the paper stand, but to not require the paper's flawed data to be shared openly.

Why is the case so clear?
An editorial board should be so lucky as to have such a clear cut case. It's a no brainer. The message to BG17 should be: Sorry guys but this effort is so flawed that we are going to pull it. End of story.

So why did the BJSM editorial board act as they did? I have no insight on their internal deliberations, but given the retraction policy of the publisher of BJSM and the ethical guidelines suggested by COPE, there logically can be only three possibilities.
  • The BJSM editorial board disagrees with our analysis and the statement of the lead author of BG17 that there are pervasive errors underlying the original analysis. This would be a very odd position to take, as it is contrary to both evidence and the admission of the researchers who wrote BG17 and BHKE18.
  • The BJSM editorial board accepts that there are pervasive errors in BG17 and has decided to let the paper stand regardless. This too would be an odd position to take, as it is unethical and unscientific (according to COPE) and contrary to the retraction policy that BJSM is expected to follow. No scientific publisher worthy of the title would let flawed science stand. 
  • The BJSM editorial board is uncertain about the presence of pervasive errors in BG17 and in the face of this uncertainty has decided to let the paper stand. This would be an exceptionally odd position to take in light of the fact that BJSM has concluded (emphasis added), "We respect the authors’ decision not to open these data even though we support the general principle of data sharing." A really good way to understand the true depth of data errors would be for BJSM to require the authors of BG17 to release fully 100% of their data that has no privacy concerns.
So which is it?

The bottom line here is that BJSM has failed in its core scientific obligations. By all appearances BJSM is acting in the interests of IAAF and protecting IAAF research from normal scientific scrutiny.  I have no idea why this is so, but it is a subject that I'll continue to pursue.

Ross Tucker, Erik Boye and I will be revising our submission to BJSM and will ask to have it reviewed, published and linked to BG17. Obvious more to come, stay tuned.

(Note: This post represents my views only, though everyone is welcome to share them.)

Tuesday, July 17, 2018

A Deeper Dive into the Scientific Basis for IAAF Testosterone Regulations

This post represents some notes, references and quotes that I'd like to have accessible. Perhaps they are useful to others. It is technical and terse, but if you are following along on Bermon and Garnier (2017, BG17), it'll probably make sense. As always here, caveat lector.

The IAAF testosterone regulations focus on what is called "circulating testosterone" or "serum testosterone." These concepts as well as "free testosterone" are defined as follows by Goldman et al. 2017:
Total testosterone refers to the sum of the concentrations of protein-bound and unbound testosterone in circulation. The fraction of circulating testosterone that is unbound to any plasma protein is referred to as the free testosterone fraction.
T and fT are important variables in BG17. For reasons I do not fully understand (I welcome explanations), BG17 chose not to analyze athletes according to T values, but instead divides them into tertiles based on fT levels.
In order to test the influence of serum androgen levels on athletic performance, in each of the 21 female athletic events and 22 male athletic events, athletes were classified in tertiles according to their fT concentration.
There are two methodological issues/questions here:
  1. Why use tertiles rather than look at correlating all data in continuous fashion (discussed here)?
  2. Why use fT at all, as T is measured in the study and fT is calculated (a point raised by Sőnkson et al. 2018)?
There is no good answer to #1.

I welcome being educated on #2. 

However, there is an interesting twist here. In my reading and researching this topic, I came across a recent paper by David Handelsman (2017) on the use of "free testosterone" in clinical research. For those scoring at home, Handelsman is the lead author of the other new peer-reviewed paper (in addition to BG17) that is cited in the IAAF T regulations.

Handelsman (2017) says this about "free testosterone":
Despite being extant for decades, the use of FT measurement has barely been stated in a testable form and virtually never directly tested as a refutable hypothesis. Rather by inference and repetition as if it is self-evident, it has become entrenched as an enthusiastically wielded yet largely untested concept that goes from one paper to the next without ever seeming to pass through a questioning mind.
Hmm ... more:
A valid scientific concept requires a sound foundational theory and evidence and being open to testing and refutation not just an unshakeable belief. Introducing the FT into clinical guidelines is particularly hard to understand as using that unproven criterion can only provide false reassurance by merely shifting the uncertainty to an even shakier footing in a subtle bait-and-switch.
OK, so Handelsman is not a fan of "free testosterone" as a meaningful metric to be used in clinical guidelines. Got it.

Yet, fT is the basis for the binning of athletes that forms the very basis for the analysis and conclusions of BG17. Even if we were to ignore the clearly fatal data problems that we have identified, it would seem that one expert that the IAAF relies on has eviscerated the very basis for the study performed by the other experts that IAAF relies on. Even with perfect data, BG17 is problematic.

This provides yet another reason why BG17 doesn't form a legitimate basis for any regulatory action. At CAS, lawyers will surely have a great time asking Prof. Handlesman about his views on fT and its role in the methodologies of BG17.

But there is more. I discussed BHKE18 as a do-over, as it re-did the BG17 analysis after dropping some 220 data points. Important side note here: For comparison, BHKE18 says (emphasis added): 
We have excluded 230 observations, corrected some data capture errors and performed the modified analysis on a population of 1102 female athletes.
Below is my tabulation of data points in the new (BHKE18) and original (BG17), and the difference between the two representing what I've called here bad data. It appears that BHKE18 has miscalculated how many bad data points that it identified (220 vs. 230) between the two studies. Sure it may be a typo, but it is exceedingly sloppy to get wrong the number of bad data points you are reporting:

new original bad data
Event n n
100 m   96 112 16
100 m H 59 73 14
200 m 59 71 12
400 m 62 67 5
400 m H 52 67 15
800 m 56 64 8
1500 m 55 66 11
3000 m SC 49 56 7
5000 m 36 40 4
10 000 m 29 33 4
Marathon 86 92 6
Discus 36 48 12
Hammer Throw 42 54 12
Shot Put 42 54 12
Javelin 42 55 13
Long Jump 50 62 12
Triple Jump 41 54 13
High Jump 44 56 12
Pole Vault 39 48 9
20 km RW 80 97 17
Heptathlon 47 53 6
SUM 1322 220

In addition to dropping the bad data, BHKE18 re-does the analysis focused on T not fT in response to Sőnkson et al. 2018. In what sure looks like some kind of p-hacking, BHKE18 adopts some significant changes to their methodology (quotes from BHKE18 below, followed by my comments in italics):
  • BHKE18: "we used running events from 400 m up to 1 mile, on the basis that that is where T produces its greatest performance-enhancing effects"
    • This of course was a conclusion of BG17 and subsequently the focus of the IAAF regulations. With the bad data of BG17, there is no longer any evidence that these events are where T has the greatest effects. There is no basis here. This is affirming the consequent. 
  • BHKE18: "we have aggregated result from the long sprints (400 m events), then middle-distance runs (800 m and 1500 m), and finally long sprints and middle-distance runs (400 m, 800 m and 1500 m), into one event group for further statistical analysis."
    • No such grouping or aggregation was used in the original study. In fact BG17 said this: "These different athletic events were considered as distinct independent analyses and adjustment for multiple comparisons was not required." So a complete reversal, which is it?
  • BHKE18: "The time results were transformed into an index, that is, percentage of the best performance achieved by each event"
    • The original analysis focused on absolute times, not a percentage index.
  • BHKE18: "We have used the Spearman rankorder correlation coefficient to explore the correlation between competition results and testosterone levels, using a two-sided test at the 0.05 significance level."
    • Here we see a new statistical test (a common one at that), based on ranked ordering of values rather than the actual values themselves.
  • BHKE18: "Finally, we used a serum T threshold concentration of 2 nmol/L to identify a group of female athletes with ‘high T’ levels, for comparison against the results of athletes with T levels of less than 2 nmol/L (‘normal T’ levels).
    • Where does 2 nmol/L come from? The IAAF regulations identify 5 nmol/L. 
It seems logical that if BHKE18 could have simply applied the same statistical methods of BG17 to the new dataset (minus the bad data) and obtained the same or similar results, they would have. The presence of so many methodological modifications is a big red flag.

BHKE18 clearly represents a do-over from a flawed study. But the authors double down and characterize it as somehow being an independent verification of the flawed study:
In conclusion, our complementary statistical analysis and sensitivity analysis using a modified analysis population shows consistent and robust results and has strengthened the evidence from this study, where we have shown exploratory evidence that female athletes with the highest T concentration have a significant competitive advantage over those with lower T concentration, in 400 m, 400 m hurdles, 800 m and hammer throw, and that there is a very strong correlation between testosterone levels and best results obtained in the World Championships in those events. A similar trend is also observed for 1500 m and pole vault events.
Thee results of BHKE18 are not the same as BG17, which were the basis for the IAAF regulations. From where I sit this situation looks like this:
  • BG17 relied on flawed data and questionable methods and arrived at a set of results that formed the basis for IAAF regulations;
  • When the data was challenged and errors identified, it seems logical that the methods of BG17 could not reproduce the results of BG17 using the new dataset (without the flawed data);
  • But the IAAF regulations had already been released, focused on four specific events identified based on BG17;
  • Altering the regulations based on errors in BG17 would of course mean admitting that BG17 was flawed in important respects, undercutting the basis for the regulations and IAAF;
  • So it seems that a considerable variety of new methods were introduced in BHKE18 that allowed the reduced-form dataset to plausibly approximate the results of BG17;
  • The regulations thus stand as written;
  • The new conclusions of BHKE18 are characterized as reinforcing BG17, giving at least a surface impression of a greater scientific basis for the regulations. In fact, the opposite has occurred.
This episode helps to illustrate why it is not a good idea to have an organization responsible for implementing desired regulations to be in charge of performing the science that produces the evidence on which those regulations are based. This would seem obvious, but has not really taken hold in the world of sports governance.

There is of course a need for BJSM to require the authors to release all data and code for both papers. One question that independent researchers will want to ask is how the results look when the methods of BG17 are applied to the data of BHKE18. Why were the methodological innovations introduced and what were their quantitative effects? This is the basic sort of independent check that makes science strong.

It is a fascinating case and no doubt has a few more twists and turns to come.It'll make for a great case study when all is said and done.

Thursday, July 12, 2018

A Call for Bermon and Garnier (2017) to be Retracted

The New York Times has a story just out on an analysis we've done on a recent IAAF study. Take a seat, this is a bombshell and these are my individual views on it.

Earlier this year, the IAAF announced new regulations governing natural testosterone levels in female athletes. One of the few academic studies that the regulations refer to is Bermon and Garnier (2017, hereafter BG17), conducted by two IAAF researchers and published in the British Journal of Sports Medicine.

Earlier this year several of us (me, Ross Tucker and Erik Boye) formally asked Drs. Bermon and Garnier to release their data (the part not involving private medical data) for purposes of independent replication. Dr. Bermon shared with us a subset of that data last week.

What the shared data shows is absolutely remarkable and has led to the three of us submitting a "Discussion" (here in PDF, as submitted except for a few typos fixed and page numbers added) to BJSM calling for BG17 to be formally retracted.

Here is what we wrote in that submission:
Due to the pervasiveness of problematic data we are calling for Bermon and Garnier (2017) to be retracted immediately by the authors and by BJSM. If a new analysis is subsequently completed and submitted for publication, we request that it be done so only with a full, independent audit of the underlying data and results by a team committed to keeping private the associated medical data. Further, upon publication, any such analysis should also in parallel publish performance data (i.e. not the medical data with privacy concerns) such that replication of this part of the analysis is possible by any independent scholar.

This case illustrates the importance of data sharing in science as well as the role of independent checks on data with policy or regulatory significance. We encourage BJSM to adopt immediately a more rigorous policy on data availability consistent with best practices among scientific publishers. Mistakes happen. Science is robust because they can be corrected
We identified 3 types of errors in their data:
  •  Duplicated athletes: more than one time is included for an individual. In each of these instances, more than one time from the 2011 and 2013 World Championships is included for the same athlete, contrary to the paper’s stated methods;
  • Duplicated times: the same time is repeated once or more for an individual athlete, which is clearly a data error;
  • Phantom times: no athlete could be found with the reported time for the event.
We also identified the inclusion of times from Russian athletes who had been disqualified due to doping. The Table below shows a summary of the number of problematic data points we found for four events in the BG17 analysis (400m, 400mH, 800m, 1500m).


We found between 17% and 33% problematic data in the four women's events and suggested that such errors may be present throughout other women's and men's data. This is unacceptable in a peer-reviewed scientific paper. Thus, we have called for retraction, as a matter of basic scientific integrity. It's not a difficult call.

Much to our surprise we subsequently learned that Dr. Bermon and three colleagues had published a new letter at BJSM just days before our submission (which I'll call BHKE18 hereafter). From all indications, BHKE18 represents a "do over" after they realized that they had serious data problems in the original work. 

BHKE18 unambiguously also confirms our identification of bad data. Just compare the number of data points included in BHKE18 versus BG17 shown in the graph below.

There are fully 220 data points eliminated from one analysis to the next, representing ~17% of the total. The elimination of data (which BHKE18 alludes to in passing as some double counting in BG17) clearly supports our critique.

And yet, the elimination of problematic data points still does not reconcile with our re-creation of the BG17 dataset for the four events that we looked at closely.

Data points for four women's events
BG17 BHKE18 PTB18
400 m 67 62 45
400 m H 67 52 48
800 m 64 56 53
1500 m 66 55 51

It appears that there remains problematic data. Further, the new letter is not peer reviewed, nor are its data publicly available for replication. By not being candid about their data errors in BG17, Bermon and colleagues have added confusion on top of confusion. This is not how science is supposed to work.

Mistakes are made, it is inevitable. What matters is what happens after that.

Here is what my colleague Erik Boye, Oslo University Hospital, says about this episode:
A set of data normally follows publications like BG17. The conclusions are linked to the data and their interpretation and the data must be made available to the general public. That is basic in science. If now the authors have received some help to understand that their data are fraught with errors they should call for a retraction and resubmit a new paper with new data if they so wish. We have pointed out this to the IAAF and to the publisher. None of them appear to handle this well. It is unacceptable that the paper stands and that a few people are informed that there were serious errors attached to the data and that unseen changes have been made to the data set. Furthermore, there is no sign that the new set of data has been subjected to any more of a critical review or that it will be released for external scrutiny.

For this reasons we should insist that scientific standards and rules are followed. In my practice at editorial boards (the EMBO and FEBS publications) I am certain that such a faulty data set would have released a demand for a retraction, with the possibility of a resubmittal.
I agree 100%. There is only one acceptable outcome here. BG17 must be retracted by BJSM. This could be done by a request from the authors or by BJSM itself. You do not get a "do over" in research when such pervasive errors are made. If Bermon et al. wish to submit another analysis for peer review, I'd expect that the data should be provided and a full audit done prior to publication.

By all indications neither BG17 nor IAAF intend to retract the paper. This says something about conflicts of interest in research, I would think. Thus, the ball is in the BJSM court. This will be a test of scientific integrity standards at BJSM. I hope they pass, for BJSM and for research integrity sake.

The IAAF analysis is far to important to be treated in such sloppy fashion. I'll be following up on the significance of the flawed data, IAAF's refusal to retract and what it might mean for the fate of the IAAF T regulations in days to come.