Skip to content

The next Lancet retraction? [“Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults”]

Someone who prefers to remain anonymous asks for my thoughts on this post by Michael Corrigan and Robert Whitaker, “Lancet Psychiatry Needs to Retract the ADHD-Enigma Study: Authors’ conclusion that individuals with ADHD have smaller brains is belied by their own data,” which begins:

Lancet Psychiatry, a UK-based medical journal, recently published a study titled Subcortical brain volume differences in participants with attention deficit hyperactivity disorder in children and adults: A cross-sectional mega-analysis. According to the paper’s 82 authors, the study provides definitive evidence that individuals with ADHD have altered, smaller brains. But as the following detailed review reveals, the study does not come close to supporting such claims.

Below are tons of detail, so let me lead with my conclusion, which is that the criticisms coming from Corrigan and Whitaker seem reasonable to me. That is, based on my quick read, the 82 authors of that published paper seem to have made a big mistake in what they wrote.

I’d be interested to see if the authors have offered any reply to these criticisms. The article has just recently come out—the journal publication is dated April 2017—and I’d like to see what the authors have to say.

OK, on to the details. Here are Corrigan and Whitaker:

The study is beset by serious methodological shortcomings, missing data issues, and statistical reporting errors and omissions. The conclusion that individuals with ADHD have smaller brains is contradicted by the “effect-size” calculations that show individual brain volumes in the ADHD and control cohorts largely overlapped. . . .

Their results, the authors concluded, contained important messages for clinicians: “The data from our highly powered analysis confirm that patients with ADHD do have altered brains and therefore that ADHD is a disorder of the brain.” . . .

The press releases sent to the media reflected the conclusions in the paper, and the headlines reported by the media, in turn, accurately summed up the press releases. Here is a sampling of headlines:

Given the implications of this study’s claims, it deserves to be closely analyzed. Does the study support the conclusion that children and adults with ADHD have “altered brains,” as evidenced by smaller volumes in different regions of the brain? . . .

Alternative Headline: Large Study Finds Children with ADHD Have Higher IQs!

To discover this finding, you need to spend $31.50 to purchase the article, and then make a special request to Lancet Psychiatry to send you the appendix. Then you will discover, on pages 7 to 9 in the appendix, a “Table 2” that provides IQ scores for both the ADHD cohort and the controls.

Although there were 23 clinical sites in the study, only 20 reported comparative IQ data. In 16 of the 20, the ADHD cohort had higher IQs on average than the control group. In the other four clinics, the ADHD and control groups had the same average IQ (with the mean IQ scores for both groups within two points of each other.) Thus, at all 20 sites, the ADHD group had a mean IQ score that was equal to, or higher than, the mean IQ score for the control group. . . .

And why didn’t the authors discuss the IQ data in their paper, or utilize it in their analyses? . . . Indeed, if the IQ data had been promoted in the study’s abstract and to the media, the public would now be having a new discussion: Is it possible that children diagnosed with ADHD are more intelligent than average? . . .

They Did Not Find That Children Diagnosed with ADHD Have Smaller Brain Volumes . . .

For instance, the authors reported a Cohen’s d effect size of .19 for differences in the mean volume of the accumbens in children under 15. . . in this study, for youth under 15, it was the largest effect size of all the brain volume comparisons that were made. . . . Approximately 58% of the ADHD youth in this convenience sample had an accumbens volume below the average in the control group, while 42% of the ADHD youth had an accumbens volume above the average in the control group. Also, if you knew the accumbens volume of a child picked at random, you would have a 54% chance that you could correctly guess which of the two cohorts—ADHD or healthy control—the child belonged to. . . . The diagnostic value of an MRI brain scan, based on the findings in this study, would be of little more predictive value than the toss of a coin. . . .

The authors reported that the “volumes of the accumbens, amygdala, caudate, hippocampus, putamen, and intracranial volume were smaller in individuals with ADHD compared with controls in the mega-analysis” (p. 1). If this is true, then smaller brain volumes should show up in the data from most, if not all, of the 21 sites that had a control group. But that was not the case. . . . The problem here is obvious. If authors are claiming that smaller brain regions are a defining “abnormality” of ADHD, then such differences should be consistently found in mean volumes of ADHD cohorts at all sites. The fact that there was such variation in mean volume data is one more reason to see the authors’ conclusions—that smaller brain volumes are a defining characteristic of ADHD—as unsupported by the data. . . .

And now here’s what the original paper said:

We aimed to investigate whether there are structural differences in children and adults with ADHD compared with those without this diagnosis. In this cross-sectional mega-analysis [sic; see P.P.S. below], we used the data from the international ENIGMA Working Group collaboration, which in the present analysis was frozen at Feb 8, 2015. Individual sites analysed structural T1-weighted MRI brain scans with harmonised protocols of individuals with ADHD compared with those who do not have this diagnosis. . . .

Our sample comprised 1713 participants with ADHD and 1529 controls from 23 sites . . . The volumes of the accumbens (Cohen’s d=–0·15), amygdala (d=–0·19), caudate (d=–0·11), hippocampus (d=–0·11), putamen (d=–0·14), and intracranial volume (d=–0·10) were smaller in individuals with ADHD compared with controls in the mega-analysis. There was no difference in volume size in the pallidum (p=0·95) and thalamus (p=0·39) between people with ADHD and controls.

The above demonstrates some forking paths, and there are a bunch more in the published paper, for example:

Exploratory lifespan modelling suggested a delay of maturation and a delay of degeneration, as e ect sizes were highest in most subgroups of children (<15 years) versus adults (>21 years): in the accumbens (Cohen’s d=–0·19 vs –0·10), amygdala (d=–0·18 vs –0·14), caudate (d=–0·13 vs –0·07), hippocampus (d=–0·12 vs –0·06), putamen (d=–0·18 vs –0·08), and intracranial volume (d=–0·14 vs 0·01). There was no di erence between children and adults for the pallidum (p=0·79) or thalamus (p=0·89). Case-control differences in adults were non-signi cant (all p>0·03). Psychostimulant medication use (all p>0·15) or symptom scores (all p>0·02) did not in uence results, nor did the presence of comorbid psychiatric disorders (all p>0·5). . . .

Outliers were identified at above and below one and a half times the interquartile range per cohort and group (case and control) and were excluded . . . excluding collinearity of age, sex, and intracranial volume (variance in ation factor <1·2) . . . The model included diagnosis (case=1 and control=0) as a factor of interest, age, sex, and intracranial volume as fixed factors, and site as a random factor. In the analysis of intracranial volume, this variable was omitted as a covariate from the model. Handedness was added to the model to correct for possible effects of lateralisation, but was excluded from the model when there was no significant contribution of this factor. . . . stratified by age: in children aged 14 years or younger, adolescents aged 15–21 years, and adults aged 22 years and older. We removed samples that were left with ten patients or fewer because of the stratification. . . .

Forking paths are fine; I have forking paths in every analysis I’ve ever done. But forking paths render published p-values close to meaningless; in particular I have no reason to take seriously a statement such as, “p values were significant at the false discovery rate corrected threshold of p=0·0156,” from the summary of the paper.

So let’s forget about p-values and just look at the data graphs, which appear in the published paper:



Unfortunately these are not raw data or even raw averages for each age; instead they are “moving averages, corrected for age, sex, intracranial volume, and site for the subcortical volumes.” But we’ll take what we’ve got.

From the above graphs, it doesn’t seem like much of anything is going on: the blue and red lines cross all over the place! So now I don’t understand this summary graph from the paper:

I mean, sure, I see it for Accumbens, I guess, if you ignore the older people. But, for the others, the lines in the displayed age curves cross all over the place.

The article in question has the following list of authors: Martine Hoogman, Janita Bralten, Derrek P Hibar, Maarten Mennes, Marcel P Zwiers, Lizanne S J Schweren, Kimm J E van Hulzen, Sarah E Medland, Elena Shumskaya, Neda Jahanshad, Patrick de Zeeuw, Eszter Szekely, Gustavo Sudre, Thomas Wolfers, Alberdingk M H Onnink, Janneke T Dammers, Jeanette C Mostert, Yolanda Vives-Gilabert, Gregor Kohls, Eileen Oberwelland, Jochen Seitz, Martin Schulte-Rüther, Sara Ambrosino, Alysa E Doyle, Marie F Høvik, Margaretha Dramsdahl, Leanne Tamm, Theo G M van Erp, Anders Dale, Andrew Schork, Annette Conzelmann, Kathrin Zierhut, Ramona Baur, Hazel McCarthy, Yuliya N Yoncheva, Ana Cubillo, Kaylita Chantiluke, Mitul A Mehta, Yannis Paloyelis, Sarah Hohmann, Sarah Baumeister, Ivanei Bramati, Paulo Mattos, Fernanda Tovar-Moll, Pamela Douglas, Tobias Banaschewski, Daniel Brandeis, Jonna Kuntsi, Philip Asherson, Katya Rubia, Clare Kelly, Adriana Di Martino, Michael P Milham, Francisco X Castellanos, Thomas Frodl, Mariam Zentis, Klaus-Peter Lesch, Andreas Reif, Paul Pauli, Terry L Jernigan, Jan Haavik, Kerstin J Plessen, Astri J Lundervold, Kenneth Hugdahl, Larry J Seidman, Joseph Biederman, Nanda Rommelse, Dirk J Heslenfeld, Catharina A Hartman, Pieter J Hoekstra, Jaap Oosterlaan, Georg von Polier, Kerstin Konrad, Oscar Vilarroya, Josep Antoni Ramos-Quiroga, Joan Carles Soliva, Sarah Durston, Jan K Buitelaar, Stephen V Faraone, Philip Shaw, Paul M Thompson, Barbara Franke.

I also found a webpage for their research group, featuring this wonderful map:

The number of sites looks particularly impressive when you include each continent twice like that. But they should really do some studies in Antarctica, given how huge it appears to be!

P.S. Following the links, I see that Corrigan and Whitaker come into this with a particular view:

Mad in America’s mission is to serve as a catalyst for rethinking psychiatric care in the United States (and abroad). We believe that the current drug-based paradigm of care has failed our society, and that scientific research, as well as the lived experience of those who have been diagnosed with a psychiatric disorder, calls for profound change.

This does not mean that the critics are wrong—presumably the authors of the original paper came into their research with their own strong views—; it can just be helpful to know where they’re coming from.

P.P.S. The paper discussed above uses the term “mega-analysis.” At first I thought this might be some sort of typo, but apparently the expression does exist and has been around for awhile. From my quick search, it appears that the term was first used by James Dillon in a 1982 article, “Superanalysis,” in Evaluation News, where he defined mega-analysis as “a method for synthesizing the results of a series of meta-analyses.”

But in the current literature, “mega-analysis” seems to simply refer to a meta-anlaysis that uses the raw data from the original studies.

If so, I’m unhappy with the term “mega-analysis” because: (a) The “mega” seems a bit hypey, (b) What if the original studies are small? Then even all the data combined might not be so “mega”?, and (c) I don’t like the implication that plain old “meta-analysis” doesn’t use the raw data. I’m pretty sure that the vast majority of meta-analyses use only published summaries, but I’ve always thought of it as the preferred version of meta-anlaysis to use the original data.

I bring up this mega-analysis thing not as a criticism of the Hoogman et al. paper—they’re just using what appears to be a standard term in their field—but just as an interesting side-note.

P.P.P.S. The above post represents my current impression. As I wrote, I’d be interested to see the original authors’ reply to the criticism. Lancet does have a pretty bad reputation—it’s known for publishing flawed, sensationalist work—but I’m sure they run the occasional good article too. So I wouldn’t want to make any strong judgments in this case before hearing more.

P.P.P.P.S. Regarding the title of this post: No, I don’t think Lancet would ever retract this paper, even if all the above criticisms are correct. It seems that retraction is used only in response to scientific misconduct, not in response to mere error. So when I say “retraction,” I mean what one might call “conceptual retraction.” The real question is: Will this new paper join the list of past Lancet papers which we would not want to take seriously, and which we regret were ever published?

Stan in St. Louis this Friday

This Friday afternoon I (Jonah) will be speaking about Stan at Washington University in St. Louis. The talk is open to the public, so anyone in the St. Louis area who is interested in Stan is welcome to attend. Here are the details:

Title: Stan: A Software Ecosystem for Modern Bayesian Inference
Jonah Sol Gabry, Columbia University

Neuroimaging Informatics and Analysis Center (NIAC) Seminar Series
Friday April 28, 2017, 1:30-2:30pm
NIL Large Conference Room
#2311, 2nd Floor, East Imaging Bldg.
4525 Scott Avenue, St. Louis, MO

medicine.wustl.eduNIAC

Stan without frontiers, Bayes without tears

This recent comment thread reminds me of a question that comes up from time to time, which is how to teach Bayesian statistics to students who aren’t comfortable with calculus. For continuous models, probabilities are integrals. And in just about every example except the one at 47:16 of this video, there are multiple parameters, so probabilities are multiple integrals.

So how to teach this to the vast majority of statistics users who can’t easily do multivariate calculus?

I dunno, but I don’t know that this has anything in particular to do with Bayes. Think about classical statistics, at least the sort that gets used in political science. Linear regression requires multivariate calculus too (or some pretty slick algebra or geometry) to get that least-squares solution. Not to mention the interpretation of the standard error. And then there’s logistic regression. Going further we move to popular machine learning methods which are really gonna seem like nothing more than black boxes. Kidz today all wanna do deep learning or random forests or whatever. And that’s fine. But no way are most of them learning the math behind it.

Teach people to drive. Then later, if they want or need, they can learn how the internal combustion engine works.

So, in keeping with this attitude, teach Stan. Students set up the model, they push the button, they get the answers. No integrals required. Yes, you have to work with posterior simulations so there is integration implicitly—the conceptual load is not zero—but I think (hope?) that this approach of using simulations to manage uncertainty is easier and more direct than expressing everything in terms of integrals.

But it’s not just model fitting, it’s also model building and model checking. Cross validation, graphics, etc. You need less mathematical sophistication to evaluate a method than to construct it.

About ten years ago I wrote an article, “Teaching Bayesian applied statistics to graduate students in political science, sociology, public health, education, economics, . . .” After briefly talking about a course that uses the BDA book and assumes that students know calculus, I continued:

My applied regression and multilevel modeling class has no derivatives and no integrals—it actually has less math than a standard regression class, since I also avoid matrix algebra as much as possible! What it does have is programming, and this is an area where many of the students need lots of practice. The course is Bayesian in that all inference is implicitly about the posterior distribution. There are no null hypotheses and alternative hypotheses, no Type 1 and Type 2 errors, no rejection regions and confidence coverage.

It’s my impression that most applied statistics classes don’t get into confidence coverage etc., but they can still mislead students by giving the impression that those classical principles are somehow fundamental. My class is different because I don’t pretend in that way. Instead I consider a Bayesian approach as foundational, and I teach students how to work with simulations.

My article continues:

Instead, the course is all about models, understanding the models, estimating parameters in the models, and making predictions. . . . Beyond programming and simulation, probably the Number 1 message I send in my applied statistics class is to focus on the deterministic part of the model rather than the error term. . . .

Even a simple model such as y = a + b*x + error is not so simple if x is not centered near zero. And then there are interaction models—these are incredibly important and so hard to understand until you’ve drawn some lines on paper. We draw lots of these lines, by hand and on the computer. I think of this as Bayesian as well: Bayesian inference is conditional on the model, so you have to understand what the model is saying.

The meta-hype algorithm

catwash

Kevin Lewis pointed me to this article:

There are several methods for building hype. The wealth of currently available public relations techniques usually forces the promoter to judge, a priori, what will likely be the best method. Meta-hype is a methodology that facilitates this decision by combining all identified hype algorithms pertinent for a particular promotion problem. Meta-hype generates a final press release that is at least as good as any of the other models considered for hyping the claim. The overarching aim of this work is to introduce meta-hype to analysts and practitioners. This work compares the performance of journal publication, preprints, blogs, twitter, Ted talks, NPR, and meta-hype to predict successful promotion. A nationwide database including 89,013 articles, tweets, and news stories. All algorithms were evaluated using the total publicity value (TPV) in a test sample that was not included in the training sample used to fit the prediction models. TPV for the models ranged between 0.693 and 0.720. Meta-hype was superior to all but one of the algorithms compared. An explanation of meta-hype steps is provided. Meta-hype is the first step in targeted hype, an analytic framework that yields double hyped promotion with fewer assumptions than the usual publicity methods. Different aspects of meta-hype depending on the context, its function within the targeted promotion framework, and the benefits of this methodology in the addiction to extreme claims are discussed.

I can’t seem to find the link right now, but you get the idea.

Would you prefer three N=300 studies or one N=900 study?

Stephen Martin started off with a question:

I’ve been thinking about this thought experiment:


Imagine you’re given two papers.
Both papers explore the same topic and use the same methodology. Both were preregistered.
Paper A has a novel study (n1=300) with confirmed hypotheses, followed by two successful direct replications (n2=300, n3=300).
Paper B has a novel study with confirmed hypotheses (n=900).
*Intuitively*, which paper would you think has the most evidence? (Be honest, what is your gut reaction?)

I’m reasonably certain the answer is that both papers provide the same amount of evidence, by essentially the likelihood principle, and if anything, one should trust the estimates of paper B more (unless you meta-analyzed paper A, which should give you the same answer as paper B, more or less).

However, my intuition was correct that most people in this group would choose paper A (See https://www.facebook.com/groups/853552931365745/permalink/1343285629059137/ for poll results).

My reasoning is that if you are observing data from the same DGP, then where you cut the data off is arbitrary; why would flipping a coin 10x, 10x, 10x, 10x, 10x provide more evidence than flipping the coin 50x? The method in paper A essentially just collected 300, drew a line, collected 300, drew a line, then collected 300 more, and called them three studies; this has no more information in sum (in a fisherian sense, the information would just add together) than if you didn’t arbitrarily cut the data into sections.

If you read in the comments of this group (which has researchers predominantly of the NHST world), one sees this fallacy that merely by passing a threshold more times means you have more evidence. They use p*p*p to justify it (even though that doesn’t make sense, because one could partition the data into 10 n=90 sets and get ‘more evidence’ by this logic; in fact, you could have 90 p-values of ~.967262 and get a p-value of .05). They use fisher’s method to say the p-value could be low (~.006), even though when combined, the p-value would actually be even lower (~.0007). One employs only Neyman-Pearson logic, and this results in a t1 error probability of .05^3.

I replied:

What do you mean by “confirmed hypotheses,” and what do you mean by a replication being “successful”? And are you assuming that the data are identical in the two scenarios?

To which Martin answered:

I [Martin], in a sense, left it ambiguous because I suspected that knowing nothing else, people would put paper A, even though asymptotically it should provide the same information as paper B.

I also left ‘confirmed hypothesis’ vague, because I didn’t want to say one must use one given framework. Basically, the hypotheses were supported by whatever method one uses to judge support (whether it be p-values, posteriors, bayes factors, whatever).

Successful replication as in, the hypotheses were supported again in the replication studies.

Finally, my motivating intuition was that paper A could basically be considered paper B if you sliced the data into thirds, or paper B could be written had you just combined the three n=300 samples.

That said, if you are experimenter A gaining three n=300 samples, your data should asymptotically (or, over infinite datasets) equal that of experimenter B gaining one n=900 sample (over infinite datasets), in the sense that the expected total information is equal, and the accumulated evidence should be equal. Therefore, even if any given two papers have different datasets, asymptotically they should provide equal information, and there’s not a good reason to prefer three smaller studies over 1 larger one.

Yet, knowing nothing else, people assumed paper A, I think, because three studies is more intuitively appealing than one large study, even if the two could be interchangeable had you divided the larger sample into three, or combined the smaller samples into 1.

From my perspective, Martin’s question can’t really be answered because I don’t know what’s in papers A and B, and I don’t know what is meant by a replication being “successful.” I think the answer depends a lot on these pieces of information, and I’m still not quite sure what Martin’s getting at here. But maybe some of you have thoughts on this one.

Drug-funded profs push drugs

Someone who wishes to remain anonymous writes:

I just read a long ProPublica article that I think your blog commenters might be interested in. It’s from February, but was linked to by the Mad Biologist today (https://mikethemadbiologist.com/). Here is a link to the article: https://www.propublica.org/article/big-pharma-quietly-enlists-leading-professors-to-justify-1000-per-day-drugs

In short, it’s about a group of professors (mainly economists) who founded a consulting firm that works for many big pharma companies. They publish many peer-reviewed articles, op-eds, blogs, etc on the debate about high pharmaceutical prices, always coming to the conclusion that high prices are a net benefit (high prices -> more innovation -> better treatments in the future vs poor people having no access to existing treatment today). They also are at best very inconsistent about disclosing their affiliations and funding.

One minor thing that struck me is the following passage, about their response to a statistical criticism of one of their articles:

The founders of Precision Health Economics defended their use of survival rates in a published response to the Dartmouth study, writing that they “welcome robust scientific debate that moves forward our understanding of the world” but that the research by their critics had “moved the debate backward.”

The debate here appears to be about lead-time bias – increased screening leads to earlier detection which can increase survival rates without actually improving outcomes. So on the face it doesn’t seem like an outrageous criticism. If they have controlled it appropriately, they should have a “robust debate” so they can convince their critics and have more support for increasing drug prices! Of course I doubt they have any interest in actually having this debate. It seems similar to the responses you get from Wansink, Cuddy (or the countless other researchers promoting flawed studies who have been featured on your blog) when they are confronted with valid criticism: sound reasonable, do nothing, and get let off the hook.

This interests me because I consult for pharmaceutical companies. I don’t really have anything to add, but this sort of conflict of interest does seem like something to worry about.

We talk a lot on this blog about bad science that’s driven by some combination of careerism and naivite. We shouldn’t forget about the possibility of flat-out corruption.

Journals for insignificant results

Tom Daula writes:

I know you’re not a fan of hypothesis testing, but the journals in this blog post are an interesting approach to the file drawer problem. I’ve never heard of them or their like. An alternative take (given academia standard practice) is “Journal for XYZ Discipline papers that p-hacking and forking paths could not save.”

Psychology: Journal of Articles in Support of the Null Hypothesis

Biomedicine: Journal of Negative Results in Biomedicine

Ecology and Evolutionary Biology: Journal of Negative Results

In psychology, this sort of journal isn’t really needed because we already have PPNAS, where they publish articles in support of the null hypothesis all the time, they just don’t realize it!

OK, ok, all jokes aside, the above post recommends:

Is it time for Economics to catch up? . . . a number of prominent Economists have endorsed this idea (even if they are not ready to pioneer the initiative). So, imagine… a call for papers along the following lines:

Series of Unsurprising Results in Economics (SURE)

Is the topic of your paper interesting, your analysis carefully done, but your results are not “sexy”? If so, please consider submitting your paper to SURE. An e-journal of high-quality research with “unsurprising” findings.
How does it work:
— We accept papers from all fields of Economics…
— Which have been rejected at a journal indexed in EconLit…
— With the ONLY important reason being that their results are statistically insignificant or otherwise “unsurprising”.

I can’t imagine this working. Why not just publish everything on SSRN or whatever, and then this SURE can just link to the articles in question (along with the offending referee reports)?

Also, I’m reminded of the magazine McSweeney’s, which someone once told me had been founded based on the principle of publishing stories that had been rejected elsewhere.

Teaching Statistics: A Bag of Tricks (second edition)

Hey! Deb Nolan and I finished the second edition of our book, Teaching Statistics: A Bag of Tricks. You can pre-order it here.

I love love love this book. As William Goldman would say, it’s the “good parts version”: all the fun stuff without the standard boring examples (counting colors of M&M’s, etc.). Great stuff for teaching, also I’ve been told that’s a fun read for students of statistics.

Here’s the table of contents. If this doesn’t look like fun to you, don’t buy the book.

Representists versus Propertyists: RabbitDucks – being good for what?

kaninchen_und_ente

It is not that unusual in statistics to get the same statistical output (uncertainty interval, estimate, tail probability,etc.) for every sample, or some samples or the same distribution of outputs or the same expectations of outputs or just close enough expectations of outputs. Then, I would argue one has a variation on a DuckRabbit. In the DuckRabbit, the same sign represents different objects with different interpretations (what to make of it) whereas here we have differing signs (models) representing the same object (an inference of interest) with different interpretations (what to make of them). I will imaginatively call this a RabbitDuck ;-)

Does one always choose a Rabbit or a Duck, or sometimes one or another or always both? I would argue the higher road is both – that is to use differing models to collect and consider the  different interpretations. Multiple perspectives can always be more informative (if properly processed), increasing our hopes to find out how things actually are by increasing the chances and rate of getting less wrong. Though this getting less wrong is in expectation only – it really is an uncertain world.

Of course, in statistics a good guess for the Rabbit interpretation would be Bayesian and for the Duck, Frequentest (Canadian spelling). Though, as one of Andrew’s colleagues once argued it is really modellers versus non modellers rather than Bayesians versus Frequentests and that makes a lot of sense to me. Representists are Rabbits “conjecturing, assessing, and adopting idealized representations of reality, predominantly using probability generating models for both parameters and data” while  Propertyists are Ducks “primarily being about discerning procedures with good properties that are uniform over a wide range of possible underlying realities and restricting use, especially in science, to just those procedures” from here.  Given that “idealized representations of reality” can only be indirectly checked (i.e. always remain possibly wrong) and “good properties” always beg the question “good for what?” (as well as only hold over a range of possible but largely unrepresented realities) – it should be a no brainer? that would it be more profitable than not to thoroughly think through both perspectives (and more actually).

An alternative view might be Leo Breiman’s “two cultures” paper.

This issue of multiple perspectives also came up in Bob’s recent post where the possibility arose that some might think it taboo to mix Bayes and Frequentist perspectives.

Some case studies would be:  Continue reading ‘Representists versus Propertyists: RabbitDucks – being good for what?’ »

My proposal for JASA: “Journal” = review reports + editors’ recommendations + links to the original paper and updates + post-publication comments

Whenever they’ve asked me to edit a statistics journal, I say no thank you because I think I can make more of a contribution through this blog. I’ve said no enough times that they’ve stopped asking me. But I’ve had an idea for awhile and now I want to do it.

I think that journals should get out of the publication business and recognize that their goal is curation. My preferred model is that everything gets published on some sort of super-Arxiv, and then the role of an organization such as the Journal of the American Statistical Association is to pick papers to review and to recommend. The “journal” is then the review reports plus the editors’ recommendations plus links to the original paper and any updates plus post-publication comments.

If JASA is interested in going this route, I’m in.

My talk this Friday in the Machine Learning in Finance workshop

This is kinda weird because I don’t know anything about machine learning in finance. I guess the assumption is that statistical ideas are not domain specific. Anyway, here it is:

What can we learn from data?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

The standard framework for statistical inference leads to estimates that are horribly biased and noisy for many important examples. And these problems all even worse as we study subtle and interesting new questions. Methods such as significance testing are intended to protect us from hasty conclusions, but they have backfired: over and over again, people think they have learned from data but they have not. How can we have any confidence in what we think we’ve learned from data? One appealing strategy is replication and external validation but this can be difficult in the real world of social science. We discuss statistical methods for actually learning from data without getting fooled.

Reputational incentives and post-publication review: two (partial) solutions to the misinformation problem

So. There are erroneous analyses published in scientific journals and in the news. Here I’m not talking not about outright propaganda, but about mistakes that happen to coincide with the preconceptions of their authors.

We’ve seen lots of examples. Here are just a few:

– Political scientist Larry Bartels is committed to a model of politics in which voters make decisions based on irrelevant information. He’s published claims about shark attacks deciding elections and subliminal smiley faces determining attitudes about immigration. In both cases, second looks by others showed that the evidence wasn’t really there. I think Bartels was sincere; he just did sloppy analyses—statistics is hard!—and jumped to conclusions that supported his existing views.

– New York Times columnist David Brooks has a habit of citing statistics that fall apart under closer inspection. I think Brooks believes these things when he writes them—OK, I guess he never really believed that Red Lobster thing, he must really have been lying exercising poetic license on that one—but what’s important is that these stories work to make his political points, and he doesn’t care when they’re proved wrong.

– David’s namesake and fellow NYT op-ed columnist Arthur Brooks stepped in it one or twice when reporting survey data. He wrote that Tea Party supporters were happier than other voters, but a careful look at the data suggested the opposite. A. Brooks’s conclusions were counterintuitive and supported his political views; they just didn’t happen to line up with reality.

– The familiar menagerie from the published literature in social and behavioral sciences: himmicanes, air rage, ESP, ages ending in 9, power pose, pizzagate, ovulation and voting, ovulation and clothing, beauty and sex ratio, fat arms and voting, etc etc.

– Gregg Easterbrook writing about politics.

And . . . we have a new one. A colleague emailed me expressing annoyance at a recent NYT op-ed by historian Stephanie Coontz entitled, “Do Millennial Men Want Stay-at-Home Wives?”

Emily Beam does the garbage collection. The short answer is that, no, there’s no evidence that millennial men want stay-at-home wives. Here’s Beam:

You can’t say a lot about millennials based on talking to 66 men.

The GSS surveys are pretty small – about 2,000-3,000 per wave – so once you split by sample, and then split by age, and then exclude the older millennials (age 26-34) who don’t show any negative trend in gender equality, you’re left with cells of about 60-100 men ages 18-25 per wave. . . .

Suppose you want to know whether there is a downward trend in young male disagreement with the women-in-the-kitchen statement. Using all available GSS data, there is a positive, not statistically significant trend in men’s attitudes (more disagreement). Starting in 1988 only, there is very, very small negative, not statistically significant effect.

Only if we pick 1994 as a starting point, as Coontz does, ignoring the dip just a few years prior, do we see a negative less-than half-percentage point drop in disagreement per year, significant at the 10-percent level.

To Coontz’s (or the NYT’s) credit, they followed up with a correction, but it’s half-assed:

The trend still confirms a rise in traditionalism among high school seniors and 18-to-25-year-olds, but the new data shows that this rise is no longer driven mainly by young men, as it was in the General Social Survey results from 1994 through 2014.

And at this point I have no reason to believe anything that Coontz says on this topic, any more than I’d trust what David Brooks has to say about high school test scores or the price of dinner at Red Lobster, or Arthur Brooks on happiness measurements, or Susan Fiske on himmicanes, power pose, and air rage. All these people made natural mistakes but then were overcommitted, in part I suspect because the mistaken analyses what they’d like to think is true.

But it’s good enough for the New York Times, or PPNAS, right?

The question is, what to do about it. Peer review can’t be the solution: for scientific journals, the problem with peer review is the peers, and when it comes to articles in the newspaper, there’s no way to do systematic review. The NYT can’t very well send all their demography op-eds to Emily Beam and Jay Livingston, after all. Actually, maybe they could—it’s not like they publish so many op-eds on the topic—but I don’t think this is going to happen.

So here are two solutions:

1. Reputational incentives. Make people own their errors. It’s sometimes considered rude to do this, to remind people that Satoshi Kanazawa Satoshi Kanazawa Satoshi Kanazawa published a series of papers that were dead on arrival because the random variation in his data was so much larger than any possible signal. Or to remind people that Amy Cuddy Amy Cuddy Amy Cuddy still goes around promoting power pose even thought the first author on that paper had disowned the entire thing. Or that John Bargh John Bargh John Bargh made a career out of a mistake and now refuses to admit his findings didn’t replicate. Or that David Brooks David Brooks David Brooks reports false numbers and then refused to correct them. Or that Stephanie Coontz Stephanie Coontz Stephanie Coontz jumped to conclusions based on a sloppy reading of trends from a survey.

But . . . maybe we need these negative incentives. If there’s a positive incentive for getting your name out there, there should be a negative incentive for getting it wrong. I’m not saying the positive and negative incentives should be equal, just that there more of a motivation for people to check what they’re doing.

And, yes, don’t keep it a secret that I published a false theorem once, and, another time, had to retract the entire empirical section of a published paper because we’d reverse-coded a key variable in our analysis.

2. Post-publication review.

I’ve talked about this one before. Do it for real, in scientific journals and also the newspapers. Correct your errors. And, when you do so, link to the people who did the better analyses.

Incentives and post-publication review go together. To the extent that David Brooks is known as the guy who reports made-up statistics and then doesn’t correct them—if this is his reputation—this gives the incentives for future Brookses (if not David himself) to prominently correct his mistakes. If Stephanie Coontz and the New York Times don’t want to be mocked on twitter, they’re motivated to follow up with serious corrections, not minimalist damage control.

Some perspective here

Look, I’m not talking about tarring and feathering here. The point is that incentives are real; they already exist. You really do (I assume) get a career bump from publishing in Psychological Science and PPNAS, and your work gets more noticed if you publish an op-ed in the NYT or if you’re featured on NPR or Ted or wherever. If all incentives are positive, that creates problems. It creates a motivation for sloppy work. It’s not that anyone is trying to do sloppy work.

Econ gets it (pretty much) right

Say what you want about economists, but they’ve got this down. First off, they understand the importance of incentives. Second, they’re harsh, harsh critics of each other. There’s not much of an econ equivalent to quickie papers in Psychological Science or PPNAS. Serious econ papers go through tons of review. Duds still get through, of course (even some duds in PPNAS). But, overall, it seems to me that economists avoid what might be called the “happy talk” problem. When an economist publishes something, he or she tries to get it right (politically-motivated work aside), in awareness that lots of people are on the lookout for errors, and this will rebound back to the author’s reputation.

Donald Trump’s nomination as an unintended consequence of Citizens United

The biggest surprise of the 2016 election campaign was Donald Trump winning the Republican nomination for president.

A key part of the story is that so many of the non-Trump candidates stayed in the race so long because everyone thought Trump was doomed, so they were all trying to grab Trump’s support when he crashed. Instead, Trump didn’t crash, and he benefited from the anti-Trump forces not coordinating on an alternative.

David Banks shares a theory of how it was that these candidates all stayed in so long:

I [Banks] see it as an unintended consequence of Citizens United. Before that [Supreme Court] decision, the $2000 cap on what individuals/corporations could contribute largely meant that if a candidate did not do well in one of the first three primaries, they pretty much had to drop out and their supporters would shift to their next favorite choice. But after Citizens United, as long as a candidate has one billionaire friend, they can stay in the race through the 30th primary if they want. And this is largely what happened. Trump regularly got the 20% of the straight-up crazy Republican vote, and the other 15 candidates fragmented the rest of the Republicans for whom Trump was the least liked candidate. So instead of Rubio dropping out after South Carolina and his votes shifting over to Bush, and Fiorino dropping out and her votes shifting to Bush, so that Bush would jump from 5% to 10% to 15% to 20% to 25%, etc., we wound up very late in the primaries with Trump looking like the most dominant candidate to field.

Of course, things are much more complex than this facile theory suggests, and lots of other things were going on in parallel. But it still seems to me that this partly explains how Trump threaded the needle to get the Republican nomination.

Interesting. I’d not seen this explanation before so I thought I’d share it with you.

Fitting hierarchical GLMs in package X is like driving car Y

Given that Andrew started the Gremlin theme (the car in the image at the right), I thought it would only be fitting to link to the following amusing blog post:

It’s exactly what it says on the tin. I won’t spoil the punchline, but will tell you the packages considered are: lme4, JAGS, RStan(Arm), and INLA.

What do you think?

Anyway, don’t take my word for it—read the original post. I’m curious about others’ take on systems for fitting GLMs and how they compare (to cars or otherwise).

You might also like…

My favorite automative analogy was made in the following essay, from way back in the first dot-com boom:

Although it’s about operating systems, the satirical take on closed- vs. open-source is universal.

(Some of) what I thought

Chris Brown reports in the post,

I simulated a simple hierarchical data-set to test each of the models. The script is available here. The data-set has 100 binary measurements. There is one fixed covariate (continuous) and one random effect with five groups. The linear predictor was transformed to binomial probabilities using the logit function. For the Bayesian approaches, slightly different priors were used for each package, depending on what was available. See the script for more details on priors.

Apples and oranges. This doesn’t make a whole lot of sense, given that lme4 is giving you max marginal likelihood, whereas JAGS and Stan give you full Bayes. And if you use different priors in Stan and JAGS, you’re not even fitting the same posterior. I’m afraid I’ve never understood INLA (lucky for me Dan Simpson’s visiting us this week, so there’s no time like the present to learn it). You’ll also find that relative performance of Stan and JAGS will vary dramatically based on the shape of the posterior and scale of the data.

It’s all about effective sample size. The author doesn’ tmention the subtlety of choosing a way to estimate effective sample size (RStan’s is more conservative than the Coda package, using a variance approach like that of the split R-hat we use to detect convergence problems in RStan).

Random processes are hard to compare. You’ll find a lot of variation across runs with different random inits. You really want to start JAGS and Stan at the same initial points and run to the same effective sample size over multiple runs and compare averages and variation.

RStanArm, not RStan. I looked at the script, and it turns out the post is comparing RStanArm, not coding a model in Stan itself and running it in RStan. Here’s the code.

library(rstanarm)
t_prior <- student_t(df = 4, location = 0, scale = 2.5)
mb.stanarm <- microbenchmark(mod.stanarm <- stan_glmer(y ~ x + (1|grps),
                                                       data = dat,
                                                       family = binomial(link = 'logit'),
                                                       prior = t_prior,
                                                       prior_intercept = t_prior,
                                                       chains = 3, cores = 1, seed = 10),
                             times = 1L)

Parallelization reduces wall time. This script runs RStanArm three Markov chains on a single core, meaning they have to run one after the other. This can obviously be sped up by the up to the number of cores you have and letting them all run at the same time. Presuambly JAGS could be sped up the same way. The multiple chains are embarassingly parallelizable, after all.

It's hard to be fair! There's a reason we don't do a lot of these comparisons ourselves!

“Do you think the research is sound or is it gimmicky pop science?”

David Nguyen writes:

I wanted to get your opinion on http://www.scienceofpeople.com/. Do you think the research is sound or is it gimmicky pop science?

My reply: I have no idea. But since I see no evidence on the website, I’ll assume it’s pseudoscience until I hear otherwise. I won’t believe it until it has the endorsement of Susan T. Fiske.

P.S. Oooh, that Fiske slam was so unnecessary, you say. But she still hasn’t apologized for falling asleep on the job and greenlighting himmicanes, air rage, and ages ending in 9.

Organizations that defend junk science are pitiful suckers get conned and conned again

So. Cornell stands behind Wansink, and Ohio State stands behind Croce. George Mason University bestows honors on Weggy. Penn State trustee disses “so-called victims.” Local religious leaders aggressively defend child abusers in their communities. And we all remember how long it took for Duke University to close the door on Dr. Anil Potti.

OK, I understand all these situations. It’s the sunk cost fallacy: you’ve invested a lot of your reputation in somebody; you don’t want to admit that, all this time, they’ve been using you.

Still, it makes me sad.

These organizations—Cornell, Ohio State, etc.—are victims as much as perpetrators. Wansink, Croce, etc., couldn’t have done it on their own: in their quest to illegitimately extract millions of corporate and government dollars, they made use of their prestigious university affiliations. A press release from a “Cornell professor” sounds so much more credible than a press release from some fast-talking guy with a P.O. box.

Cornell, Ohio State, etc., they’ve been played, and they still don’t realize it.

Remember, a key part of the long con is misdirection: make the mark think you’re his friend.

Causal inference conference in North Carolina

Michael Hudgens announces:

Registration for the 2017 Atlantic Causal Inference Conference is now open. The registration site is here. More information about the conference, including the poster session and the Second Annual Causal Inference Data Analysis Challenge can be found on the conference website here.

We held the very first Atlantic Causal Inference Conference here at Columbia twelve years ago, and it’s great to see that it has been continuing so successfully.

The Efron transition? And the wit and wisdom of our statistical elders

Stephen Martin writes:

Brad Efron seems to have transitioned from “Bayes just isn’t as practical” to “Bayes can be useful, but EB is easier” to “Yes, Bayes should be used in the modern day” pretty continuously across three decades.

http://www2.stat.duke.edu/courses/Spring10/sta122/Handouts/EfronWhyEveryone.pdf
http://projecteuclid.org/download/pdf_1/euclid.ss/1028905930
http://statweb.stanford.edu/~ckirby/brad/other/2009Future.pdf

Also, Lindley’s comment in the first article is just GOLD:
“The last example with [lambda = theta_1theta_2] is typical of a sampling theorist’s impractical discussions. It is full of Greek letters, as if this unusual alphabet was a repository of truth.” To which Efron responded “Finally, I must warn Professor Lindley that his brutal, and largely unprovoked, attack has been reported to FOGA (Friends of the Greek Alphabet). He will be in for a very nasty time indeed if he wishes to use as much as an epsilon or an iota in any future manuscript.”

“Perhaps the author has been falling over all those bootstraps lying around.”

“What most statisticians have is a parody of the Bayesian argument, a simplistic view that just adds a woolly prior to the sampling-theory paraphernalia. They look at the parody, see how absurd it is, and thus dismiss the coherent approach as well.”

I pointed Stephen to this post and this article (in particular the bottom of page 295). Also this, I suppose.

Causal inference conference at Columbia University on Sat 6 May: Varying Treatment Effects

Hey! We’re throwing a conference:

Varying Treatment Effects

The literature on causal inference focuses on estimating average effects, but the very notion of an “average effect” acknowledges variation. Relevant buzzwords are treatment interactions, situational effects, and personalized medicine. In this one-day conference we shall focus on varying effects in social science and policy research, with particular emphasis on Bayesian modeling and computation.

The focus will be on applied problems in social science.

The organizers are Jim Savage, Jennifer Hill, Beth Tipton, Rachael Meager, Andrew Gelman, Michael Sobel, and Jose Zubizarreta.

And here’s the schedule:

9:30 AM
1. Heterogeneity across studies in meta-analyses of impact evaluations.
– Michael Kremer, Harvard
– Greg Fischer, LSE
– Rachael Meager, MIT
– Beth Tipton, Columbia
10-45 – 11 coffee break

11:00
2. Heterogeneity across sites in multi-site trials.
– David Yeager, UT Austin
– Avi Feller, Berkeley
– Luke Miratrix, Harvard
– Ben Goodrich, Columbia
– Michael Weiss, MDRC

12:30-1:30 Lunch

1:30
3. Heterogeneity in experiments versus quasi-experiments.
– Vivian Wong, University of Virginia
– Michael Gechter, Penn State
– Peter Steiner, U Wisconsin
– Bryan Keller, Columbia

3:00 – 3:30 afternoon break

3:30
4. Heterogeneous effects at the structural/atomic level.
– Jennifer Hill, NYU
– Peter Rossi, UCLA
– Shoshana Vasserman, Harvard
– Jim Savage, Lendable Inc.
– Uri Shalit, NYU

5pm
Closing remarks: Andrew Gelman

Please register for the conference here. Admission is free but we would prefer if you register so we have a sense of how many people will show up.

We’re expecting lots of lively discussion.

P.S. Signup for outsiders seems to have filled up. Columbia University affiliates who are interested in attending should contact me directly.

I wanna be ablated

Mark Dooris writes:

I am senior staff cardiologist from Australia. I attach a paper that was presented at our journal club some time ago. It concerned me at the time. I send it as I suspect you collect similar papers. You may indeed already be aware of this paper. I raised my concerns about the “too good to be true” and plethora of “p-values” all in support of desired hypothesis. I was decried as a naysayer and some individuals wanted to set their own clinics on the basis of the study (which may have been ok if it was structured as a replication prospective randomized clinical trial).

I would value your views on the statistical methods and the results…it is somewhat pleasing: fat bad…lose fat good and may even be true in some specific sense but please look at the number of comparisons, which exceed the number of patients and how they are almost perfectly consistent with an amazing dose response esp structural changes.

I am not at all asserting there is fraud I am just pointing out how anomalous this is. Perhaps it is most likely that many of these tests were inevitably unable to be blinded…losing 20 kg would be an obvious finding in imaging. Many of the claimed detected differences in echocardiography seem to exceed the precision of the test (a test which has greater uncertainty in measurements in the obese patients). Certainly the blood parameters may be real but there has been accounting for multiple comparisons.

PS: I do not know, work with or have any relationship with the authors. I am an interventional cardiologist (please don’t hold that against me) and not an electrophysiologist.

The paper that he sent is called “Long-Term Effect of Goal-Directed Weight Management in an Atrial Fibrillation Cohort: A Long-Term Follow-Up Study (LEGACY),” it’s by Rajeev K. Pathak, Melissa E. Middeldorp, Megan Meredith, Abhinav B. Mehta, Rajiv Mahajan, Christopher X. Wong, Darragh Twomey, Adrian D. Elliott, Jonathan M. Kalman, Walter P. Abhayaratna, Dennis H. Lau, and Prashanthan Sanders, and it appeared in 2015 in the Journal of the American College of Cardiology.

The topic of atrial fibrillation concerns me personally! But my body mass index is less than 27 so I don’t seem to be in the target population for this study.

Anyway, I did take a look. The study in question was observational: they divided the patients into three groups, not based on treatments that had been applied, but based on weight loss (>=10%, 3-9%, <3%; all patients had been counseled to try to lose weight). As Dooris writes, the results seem almost too good to be true: For all five of their outcomes (atrial fibrillation frequency, duration, episode severity, symptom subscale, and global well-being), there is a clean monotonic stepping down from group 1 to group 2 to group 3. I guess maybe the symptom subscale and the global well-being measure are combinations of the first three outcomes? So maybe it’s just three measures, not five, that are showing such clean trends. All the measures show huge improvements from baseline to follow-up in all groups, which I guess just demonstrates that the patients were improving in any case. Anyway, I don’t really know what to make of all this but I thought I’d share it with you.

P.S. Dooris adds:

I must admit to feeling embarrassed for my, perhaps, premature and excessive skepticism. I read the comments with interest.

I am sorry to read that you have some personal connection to atrial fibrillation but hope that you have made (a no doubt informed) choice with respect to management. It is an “exciting” time with respect to management options. I am not giving unsolicited advice (and as I have expressed I am just a “plumber” not an “electrician”).
I remain skeptical about the effect size and the complete uniformity of the findings consistent with the hypothesis that weight loss is associated with reduced symptoms of AF, reduced burden of AF, detectable structural changes on echocardiography and uniformly positive effects on lipid profile.
I want to be clear:
  • I find the hypothesis plausible
  • I find the implications consistent with my pre-conceptions and my current advice (this does not mean they are true or based on compelling evidence)
  • The plausibility (for me) arises from
    • there are relatively small studies and meta-analyses that suggest weight loss is associated with “beneficial” effects on blood pressure and lipids. However, the effects are variable. There seems to be differences between genders and differences between methods of weight loss. The effect size is generally smaller than in the LEGACY trial
    • there is evidence of cardiac structural changes: increase chamber size, wall thickness and abnormal diastolic function and some studies suggest that the changes are reversible, perhaps the most change in patients with diastolic dysfunction. I note perhaps the largest change detected with weight loss is reduction in epicardial fat. Some cardiac MRI studies (which have better resolution) have supported this
    • there is electrophysiological data  in suggesting differences in electrophysiological properties in patients with atrial fibrillation related to obesity
  • What concerned me about the paper was the apparent homogeneity of this particular population that seemed to allow the detection of such a strong and consistent relationship.  This seemed “too good to be true”.  I think it does not show the variability I would have expected:
    • gender
    • degree of diastolic dysfunction
    • smoking
    • what other changes during the period were measured?: medication, alcohol etc
    • treatment interaction: I find it difficult to work out who got ablated, how many attempts. Are the differences more related to successful ablations or other factors
    • “blinding”: although the operator may have been blinded to patient category patients with smaller BMI are easier to image and may have less “noisy measurements”. Are the real differences, therefore, smaller than suggested
  • I accept that the authors used repeated measures ANOVA to account for paired/correlated nature of the testing.  However, I do not see the details of the model used.
  • I would have liked to see the differences rather than the means and SD as well as some graphical presentation of the data to see the variability as well as modeling of the relationship between weight loss and effect.
I guess I have not seen a paper where everything works out like you want.  I admit that I should have probably suppressed my disbelief (and waited for replication). What’s the down side? “We got the answer we all want”. “It fits with the general results of other work.” I still feel uneasy not at least asking some questions.
I think as a profession, we medical practitioners  have been guilty of “p-hacking” and over-reacting to small studies with large effect sizes. We have spent too much time in “the garden of forking paths” and believe where have got too after picking throw the noise every apparent signal that suits our preconceptions.  We have wonderful large scale randomized clinical trials that seem to answer narrow but important questions and that is great. However, we still publish a lot of lower quality stuff and promulgate “p-hacking” and related methods to our trainees. I found the Smaldino and McElreath paper timely and instructive (I appreciate you have already seen it).
So, I sent you the email because I felt uneasy (perhaps guilty about my “p-hacking” sins of commission of the past and acceptance of such work of others).