Skip to content

After Peptidegate, a proposed new slogan for PPNAS. And, as a bonus, a fun little graphics project.

Someone pointed me to this post by “Neuroskeptic”:

A new paper in the prestigious journal PNAS contains a rather glaring blooper. . . . right there in the abstract, which states that “three neuropeptides (β-endorphin, oxytocin, and dopamine) play particularly important roles” in human sociality. But dopamine is not a neuropeptide. Neither are serotonin or testosterone, but throughout the paper, Pearce et al. refer to dopamine, serotonin and testosterone as ‘neuropeptides’. That’s just wrong. A neuropeptide is a peptide active in the brain, and a peptide in turn is the term for a molecule composed of a short chain of amino acids. Neuropeptides include oxytocin, vasopressin, and endorphins – which do feature in the paper. But dopamine and serotonin aren’t peptides, they’re monoamines, and testosterone isn’t either, it’s a steroid. This isn’t a matter of opinion, it’s basic chemistry.

The error isn’t just an isolated typo: ‘neuropeptide’ occurs 27 times in the paper, while the correct terms for the non-peptides are never used.

Neuroskeptic speculates on how this error got in:

It’s a simple mistake; presumably whoever wrote the paper saw oxytocin and vasopressin referred to as “neuropeptides” and thought that the term was a generic one meaning “signalling molecule.” That kind of mistake could happen to anyone, so we shouldn’t be too harsh on the authors . . .

The authors of the papers work in a psychology department so I guess they’re rusty on their organic chemistry.

Fair enough; I haven’t completed a chemistry class since 11th grade, and I didn’t know what a peptide is, either. Then again, I’m not writing articles on peptides for the National Academy of Sciences.

But how did this get through the review process? Let’s take a look at the published article:

Ahhhh, now I understand. The editor is Susan Fiske, notorious as the person who opened the gates of PPNAS for the articles on himmicanes, air rage, and ages ending in 9. I wonder who were the reviewers of this new paper. Nobody who knows what a peptide is, I guess. Or maybe they just read it very quickly, flipped through to the graphs and the conclusions, and didn’t read a lot of the words.

Did you catch that? Neuroskeptic refers to “the prestigious journal PNAS.” That’s PPNAS for short. This is fine, I guess. Maybe the science is ok. Based on a quick scan of the paper, I don’t think we should take a lot of the specific claims seriously, as they seem to based on the difference between “significant” and “non-significant.”

In particular, I’m not quite sure what is their support for the statement from the abstract that “each neuropeptide is quite specific in its domain of influence.” They’re rejecting various null hypotheses but I don’t know that this is supporting their substantive claims in the way that they’re saying.

I might be missing something here—I might be missing a lot—but in any case there seem to be some quality control problems at PPNAS. This should be no surprise: PPNAS is a huge journal, publishing over 3000 papers each year.

On their website they say, “PNAS publishes only the highest quality scientific research,” but this statement is simply false. I can’t really comment on this particular paper—it doesn’t seem like “the highest quality scientific research” to me, but, again, maybe I’m missing something big here. But I can assure you that the papers on himmicanes, air rage, and ages ending in 9 are not “the highest quality scientific research.” They’re not high quality research at all! What they are, is low-quality research that happens to be high-quality clickbait.

OK, let’s be fair. This is not a problem unique to PPNAS. The Lancet publishes crap papers, Psychological Science published crap papers, even JASA and APSR have their share of duds. Statistical Science, to its eternal shame, published that Bible Code paper in 1994. That’s fine, it’s how the system operates. Editors are only human.

But, really, do we have to make statements that we know are false? Platitudes are fine but let’s avoid intentional untruths.

So, instead of “PNAS publishes only the highest quality scientific research,” how about this: “PNAS aims to publish only the highest quality scientific research.” That’s fair, no?

P.S. Here’s a fun little graphics project: Redo Figure 1 as a lineplot. You’ll be able to show a lot more comparisons much more directly using lines rather than bars. The current grid of barplots is not the worst thing in the world—it’s much better than a table—but it could be much improved.

P.P.S. Just to be clear: (a) I don’t know anything about peptides so I’m offering no independent judgment of the paper in question; (b) whatever the quality is of this particular paper, does not affect my larger point that PPNAS publishes some really bad papers and so they should change their slogan to something more accurate.

P.P.P.S. The relevant Pubpeer page pointed to the following correction note that was posted on the PPNAS site after I wrote the above post but before it was posted:

The authors wish to note, “We used the term ‘neuropeptide’ in referring to the set of diverse neurochemicals that we examined in this study, some of which are not peptides; dopamine and serotonin are neurotransmitters and should be listed as such, and testosterone should be listed as a steroid. Our usage arose from our primary focus on the neuropeptides endorphin and oxytocin. Notwithstanding the biochemical differences between these neurochemicals, we note that these terminological issues have no implications for the significance of the findings reported in this paper.”

On deck through the rest of the year (and a few to begin 2018)

Here they are. I love seeing all the titles lined up in one place; it’s like a big beautiful poem about statistics:

  • After Peptidegate, a proposed new slogan for PPNAS. And, as a bonus, a fun little graphics project.
  • “Developers Who Use Spaces Make More Money Than Those Who Use Tabs”
  • Question about the secret weapon
  • Incentives Matter (Congress and Wall Street edition)
  • Analyze all your comparisons. That’s better than looking at the max difference and trying to do a multiple comparisons correction.
  • Problems with the jargon the jargon “statistically significant” and “clinically significant”
  • Capitalist science: The solution to the replication crisis?
  • Bayesian, but not Bayesian enough
  • Let’s stop talking about published research findings being true or false
  • Plan 9 from PPNAS
  • No, I’m not blocking you or deleting your comments!
  • “Furthermore, there are forms of research that have reached such a degree of complexity in their experimental methodology that replicative repetition can be difficult.”
  • “The Null Hypothesis Screening Fallacy”?
  • What is a pull request?
  • Turks need money after expensive weddings
  • Statisticians and economists agree: We should learn from data by “generating and revising models, hypotheses, and data analyzed in response to surprising findings.”
  • My unpublished papers
  • Bigshot psychologist, unhappy when his famous finding doesn’t replicate, won’t consider that he might have been wrong; instead he scrambles furiously to preserve his theories
  • Night Hawk
  • Why they aren’t behavioral economists: Three sociologists give their take on “mental accounting”
  • Further criticism of social scientists and journalists jumping to conclusions based on mortality trends
  • Daryl Bem and Arthur Conan Doyle
  • Classical statisticians as Unitarians
  • Slaying Song
  • What is “overfitting,” exactly?
  • Graphs as comparisons: A case study
  • Should we continue not to trust the Turk? Another reminder of the importance of measurement
  • “The ‘Will & Grace’ Conjecture That Won’t Die” and other stories from the blogroll
  • His concern is that the authors don’t control for the position of games within a season.
  • How does a Nobel-prize-winning economist become a victim of bog-standard selection bias?
  • “Bayes factor”: where the term came from, and some references to why I generally hate it
  • A stunned Dyson
  • Applying human factors research to statistical graphics
  • Recently in the sister blog
  • Adding a predictor can increase the residual variance!
  • Died in the Wool
  • “Statistics textbooks (including mine) are part of the problem, I think, in that we just set out ‘theta’ as a parameter to be estimated, without much reflection on the meaning of ‘theta’ in the real world.”
  • An improved ending for The Martian
  • Delegate at Large
  • Iceland education gene trend kangaroo
  • Reproducing biological research is harder than you’d think
  • The fractal zealots
  • Giving feedback indirectly by invoking a hypothetical reviewer
  • It’s hard to know what to say about an observational comparison that doesn’t control for key differences between treatment and control groups, chili pepper edition
  • PPNAS again: If it hadn’t been for the jet lag, would Junior have banged out 756 HRs in his career?
  • Look. At. The. Data. (Hollywood action movies example)
  • “This finding did not reach statistical sig­nificance, but it indicates a 94.6% prob­ability that statins were responsible for the symptoms.”
  • Wolfram on Golomb
  • Irwin Shaw, John Updike, and Donald Trump
  • What explains my lack of openness toward this research claim? Maybe my cortex is just too damn thick and wrinkled
  • I love when I get these emails!
  • Consider seniority of authors when criticizing published work?
  • Does declawing cause harm?
  • Bird fight! (Kroodsma vs. Podos)
  • The Westlake Review
  • “Social Media and Fake News in the 2016 Election”
  • Also holding back progress are those who make mistakes and then label correct arguments as “nonsensical.”
  • Just google “Despite limited statistical power”
  • It is somewhat paradoxical that good stories tend to be anomalous, given that when it comes to statistical data, we generally want what is typical, not what is surprising. Our resolution of this paradox is . . .
  • “Babbage was out to show that not only was the system closed, with a small group controlling access to the purse strings and the same individuals being selected over and again for the few scientific honours or paid positions that existed, but also that one of the chief beneficiaries . . . was undeserving.”
  • Irish immigrants in the Civil War
  • Mixture models in Stan: you can use log_mix()
  • Don’t always give ’em what they want: Practicing scientists want certainty, but I don’t want to offer it to them!
  • Cumulative residual plots seem like they could be useful
  • Sucker MC’s keep falling for patterns in noise
  • Nice interface, poor content
  • “From that perspective, power pose lies outside science entirely, and to criticize power pose would be a sort of category error, like criticizing The Lord of the Rings on the grounds that there’s no such thing as an invisibility ring, or criticizing The Rotter’s Club on the grounds that Jonathan Coe was just making it all up.”
  • Chris Moore, Guy Molyneux, Etan Green, and David Daniels on Bayesian umpires
  • Using statistical prediction (also called “machine learning”) to potentially save lots of resources in criminal justice
  • “Mainstream medicine has its own share of unnecessary and unhelpful treatments”
  • What are best practices for observational studies?
  • The Groseclose endgame: Getting from here to there.
  • Causal identification + observational study + multilevel model
  • All cause and breast cancer specific mortality, by assignment to mammography or control
  • Iterative importance sampling
  • Rosenbaum (1999): Choice as an Alternative to Control in Observational Studies
  • Gigo update (“electoral integrity project”)
  • How to design and conduct a subgroup analysis?
  • Local data, centralized data analysis, and local decision making
  • Too much backscratching and happy talk: Junk science gets to share in the reputation of respected universities
  • Selection bias in the reporting of shaky research: An example
  • Self-study resources for Bayes and Stan?
  • Looking for the bottom line
  • “How conditioning on post-treatment variables can ruin your experiment and what to do about it”
  • Trial by combat, law school style
  • Causal inference using data from a non-representative sample
  • Type M errors studied in the wild
  • Type M errors in the wild—really the wild!
  • Where does the discussion go?
  • Maybe this paper is a parody, maybe it’s a semibluff
  • As if the 2010s never happened
  • Using black-box machine learning predictions as inputs to a Bayesian analysis
  • It’s not enough to be a good person and to be conscientious. You also need good measurement. Cargo-cult science done very conscientiously doesn’t become good science, it just falls apart from its own contradictions.
  • Air rage update
  • Getting the right uncertainties when fitting multilevel models
  • Chess records page
  • Weisburd’s paradox in criminology: it can be explained using type M errors
  • “Cheerleading with an agenda: how the press covers science”
  • Automated Inference on Criminality Using High-tech GIGO Analysis
  • Some ideas on using virtual reality for data visualization: I don’t really agree with the details here but it’s all worth discussing
  • Contribute to this pubpeer discussion!
  • For mortality rate junkies
  • The “fish MRI” of international relations studies.
  • “5 minutes? Really?”
  • 2 quick calls
  • Should we worry about rigged priors? A long discussion.
  • I’m not on twitter
  • I disagree with Tyler Cowen regarding a so-called lack of Bayesianism in religious belief
  • “Why bioRxiv can’t be the Central Service”
  • Sudden Money
  • The house is stronger than the foundations
  • Please contribute to this list of the top 10 do’s and don’ts for doing better science
  • Partial pooling with informative priors on the hierarchical variance parameters: The next frontier in multilevel modeling
  • Does racquetball save lives?
  • When do we want evidence-based change? Not “after peer review”
  • “I agree entirely that the way to go is to build some model of attitudes and how they’re affected by recent weather and to fit such a model to “thick” data—rather than to zip in and try to grab statistically significant stylized facts about people’s cognitive illusions in this area.”
  • “Bayesian evidence synthesis”
  • Freelance orphans: “33 comparisons, 4 are statistically significant: much more than the 1.65 that would be expected by chance alone, so what’s the problem??”
  • Beyond forking paths: using multilevel modeling to figure out what can be learned from this survey experiment
  • From perpetual motion machines to embodied cognition: The boundaries of pseudoscience are being pushed back into the trivial.
  • Why I think the top batting average will be higher than .311: Over-pooling of point predictions in Bayesian inference
  • “La critique est la vie de la science”: I kinda get annoyed when people set themselves up as the voice of reason but don’t ever get around to explaining what’s the unreasonable thing they dislike.
  • How to discuss your research findings without getting into “hypothesis testing”?
  • Does traffic congestion make men beat up their wives?
  • The Publicity Factory: How even serious research gets exaggerated by the process of scientific publication and reporting
  • I think it’s great to have your work criticized by strangers online.
  • In the open-source software world, bug reports are welcome. In the science publication world, bug reports are resisted, opposed, buried.
  • If you want to know about basketball, who ya gonna trust, the Irene Blecker Rosenfeld Professor of Psychology at Cornell University and author of “The Wisest One in the Room: How You Can Benefit from Social Psychology’s Most Powerful Insights,” . . . or that poseur Phil Jackson??
  • Quick Money
  • An alternative to the superplot
  • Where the money from Wiley Interdisciplinary Reviews went . . .
  • Retract or correct, don’t delete or throw into the memory hole
  • Using Mister P to get population estimates from respondent driven sampling
  • “Americans Greatly Overestimate Percent Gay, Lesbian in U.S.”
  • “It all reads like a classic case of faulty reasoning where the reasoner confuses the desirability of an outcome with the likelihood of that outcome.”
  • Pseudoscience and the left/right whiplash
  • The time reversal heuristic (priming and voting edition)
  • The Night Riders
  • Why you can’t simply estimate the hot hand using regression
  • Stan to improve rice yields
  • When people proudly take ridiculous positions
  • “A mixed economy is not an economic abomination or even a regrettably unavoidable political necessity but a natural absorbing state,” and other notes on “Whither Science?” by Danko Antolovic
  • Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger.
  • What should this student do? His bosses want him to p-hack and they don’t even know it!
  • Fitting multilevel models when predictors and group effects correlate
  • I hate that “Iron Law” thing
  • High five: “Now if it is from 2010, I think we can make all sorts of assumptions about the statistical methods without even looking.”
  • “What is a sandpit?”
  • No no no no no on “The oldest human lived to 122. Why no person will likely break her record.”
  • Tips when conveying your research to policymakers and the news media
  • Graphics software is not a tool that makes your graphs for you. Graphics software is a tool that allows you to make your graphs.
  • Spatial models for demographic trends?
  • A pivotal episode in the unfolding of the replication crisis
  • We start by talking reproducible research, then we drift to a discussion of voter turnout
  • Wine + Stan + Climate change = ?
  • Stan is a probabilistic programming language
  • Using output from a fitted machine learning algorithm as a predictor in a statistical model
  • Poisoning the well with a within-person design? What’s the risk?
  • “Dear Professor Gelman, I thought you would be interested in these awful graphs I found in the paper today.”
  • I know less about this topic than I do about Freud.
  • Driving a stake through that ages-ending-in-9 paper
  • What’s the point of a robustness check?
  • Oooh, I hate all talk of false positive, false negative, false discovery, etc.
  • Trouble Ahead
  • A new definition of the nerd?
  • Orphan drugs and forking paths: I’d prefer a multilevel model but to be honest I’ve never fit such a model for this sort of problem
  • Popular expert explains why communists can’t win chess championships!
  • The four missing books of Lawrence Otis Graham
  • “There was this prevalent, incestuous, backslapping research culture. The idea that their work should be criticized at all was anathema to them. Let alone that some punk should do it.”
  • Loss of confidence
  • “How to Assess Internet Cures Without Falling for Dangerous Pseudoscience”
  • Ed Jaynes outta control!
  • A reporter sent me a Jama paper and asked me what I thought . . .
  • Workflow, baby, workflow
  • Two steps forward, one step back
  • Yes, you can do statistical inference from nonrandom samples. Which is a good thing, considering that nonrandom samples are pretty much all we’ve got.
  • The Night Riders
  • The piranha problem in social psychology / behavioral economics: The “take a pill” model of science eats itself
  • Ready Money
  • Stranger than fiction
  • “The Billy Beane of murder”?
  • Red doc, blue doc, rich doc, rich doc
  • Working Class Postdoc
  • “We wanted to reanalyze the dataset of Nelson et al. However, when we asked them for the data, they said they would only share the data if we were willing to include them as coauthors.”
  • UNDER EMBARGO: the world’s most unexciting research finding
  • Setting up a prior distribution in an experimental analysis
  • Walk a Crooked MiIe
  • It’s . . . spam-tastic!
  • The failure of null hypothesis significance testing when studying incremental changes, and what to do about it
  • Robust standard errors aren’t for me
  • Stupid-ass statisticians don’t know what a goddam confidence interval is
  • Forking paths plus lack of theory = No reason to believe any of this.
  • Turn your scatterplots into elegant apparel and accessories!
  • Your (Canadian) tax dollars at work

And a few to begin 2018:

  • The Ponzi threshold and the Armstrong principle
  • I’m with Errol: On flypaper, photography, science, and storytelling
  • Politically extreme yet vital to the nation
  • How does probabilistic computation differ in physics and statistics?
  • “Each computer run would last 1,000-2,000 hours, and, because we didn’t really trust a program that ran so long, we ran it twice, and it verified that the results matched. I’m not sure I ever was present when a run finished.”

Enjoy.

We’ll also intersperse topical items as appropriate.

Not everyone’s aware of falsificationist Bayes

Stephen Martin writes:

Daniel Lakens recently blogged about philosophies of science and how they relate to statistical philosophies. I thought it may be of interest to you. In particular, this statement:

From a scientific realism perspective, Bayes Factors or Bayesian posteriors do not provide an answer to the main question of interest, which is the verisimilitude of scientific theories. Belief can be used to decide which questions to examine, but it can not be used to determine the truth-likeness of a theory.

My response, TLDR:
1) frequentism and NP require more subjectivity than they’re given credit for (assumptions, belief in perfectly known sampling distributions, Beta [and thus type-2 error ‘control’] requires subjective estimate of the alternative effect size)

2) Bayesianism isn’t inherently more subjective, it just acknowledges uncertainty given the data [still data-driven!]

3) Popper probably wouldn’t like the NHST ritual, given that we use p-values to support hypotheses, not to refute an accepted hypothesis [the nil-hypothesis of 0 is not an accepted hypothesis in most cases]

4) Refuting falsifiable hypotheses can be done in Bayes, which is largely what Popper cared about anyway

5) Even in a NP or LRT framework, people don’t generally care about EXACT statistical hypotheses, they care about substantive hypotheses, which map to a range of statistical/estimate hypotheses, and YET people don’t test the /range/, they test point values; bayes can easily ‘test’ the hypothesized range.

My [Martin’s] full response is here.

I agree with everything that Martin writes above. And, for that matter, I agree with most of Lakens wrote too. The starting point for all of this is my 2011 article, Induction and deduction in Bayesian data analysis. Also relevant are my 2013 article with Shalizi, Philosophy and the practice of Bayesian statistics and our response to the ensuing discussion, and my recent article with Hennig, Beyond subjective and objective in statistics.

Lakens covers the same Popper-Lakatos ground that we do, although he (Lakens) doesn’t appear to be aware of the falsificationist view of Bayesian data analysis, as expressed in chapter 6 of BDA and the articles listed above. Lakens is stuck in a traditionalist view of Bayesian inference as based on subjectivity and belief, rather than what I consider a more modern approach of conditionality, where Bayesian inference works out the implications of a statistical model or system of assumptions, the better to allow us to reveal problems that motivate improvements and occasional wholesale replacements of our models.

Overall I’m glad Lakens wrote his post because he’s reminding people of important issues that are not handled well in traditional frequentist or subjective-Bayes approaches, and I’m glad that Martin filled in some of the gaps. The audience for all of this seems to be psychology researchers, so let me re-emphasize a point I’ve made many times, the distinction between statistical models and scientific models. A statistical model is necessarily specific, and we should avoid the all-too-common mistake of rejecting some uninteresting statistical model and taking this as evidence for a preferred scientific model. That way lies madness.

Breaking the dataset into little pieces and putting it back together again

Alex Konkel writes:

I was a little surprised that your blog post with the three smaller studies versus one larger study question received so many comments, and also that so many people seemed to come down on the side of three smaller studies. I understand that Stephen’s framing led to some confusion as well as practical concerns, but I thought the intent of the question was pretty straightforward.

At the risk of beating a dead horse, I wanted to try asking the question a different way: if you conducted a study (or your readers, if you want to put this on the blog), would you ever divide up the data into smaller chunks to see if a particular result appeared in each subset? Ignoring cases where you might want to examine qualitatively different groups, of course; would you ever try to make fundamentally homogeneous/equivalent subsets? Would you ever advise that someone else do so?

For those caught up in the details, assume an extremely simple design. A simple comparison of two groups ending in a (Bayesian) t-test with no covariates, nothing fancy. In a very short time period you collected 450 people in each group using exactly the same procedure for each one; there is zero reason to believe that the data were affected by anything other than your group assignment. Would you forego analyzing the entire sample and instead break them into three random chunks?

My personal experience is that empirically speaking, no one does this. Except for cases where people are interested in avoiding model overfitting and so use some kind of cross validation or training set vs testing set paradigm, I have never seen someone break their data into small groups to increase the amount of information or strengthen their conclusions. The blog comments, however, seem to come down on the side of this being a good practice. Are you (or your readers) going to start doing this?

My reply:

From a Bayesian standpoint, the result is the same, whether you consider all the data at once, or stir in the data one-third at a time. The problem would come if you make intermediate decisions that involve throwing away information, for example if you take parts of the data and just describe them as statistically significant or not.

Don’t say “improper prior.” Say “non-generative model.”

[cat picture]

In Bayesian Data Analysis, we write, “In general, we call a prior density p(θ) proper if it does not depend on data and integrates to 1.” This was a step forward from the usual understanding which is that a prior density is improper if an infinite integral.

But I’m not so thrilled with the term “proper” because it has different meanings for different people.

Then the other day I heard Dan Simpson and Mike Betancourt talking about “non-generative models,” and I thought, Yes! this is the perfect term! First, it’s unambiguous: a non-generative model is a model for which it is not possible to generate data. Second, it makes use of the existing term, “generative model,” hence no need to define a new concept of “proper prior.” Third, it’s a statement about the model as a whole, not just the prior.

I’ll explore the idea of a generative or non-generative model through some examples:

Classical iid model, y_i ~ normal(theta, 1), for i=1,…,n. This is not generative because there’s no rule for generating theta.

Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with uniform prior density, p(theta) proportional to 1 on the real line. This is not generative because you can’t draw theta from a uniform on the real line.

Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with data-based prior, theta ~ normal(y_bar, 10), where y_bar is the sample mean of y_1,…,y_n. This model is not generative because to generate theta, you need to know y, but you can’t generate y until you know theta.

In contrast, consider a Bayesian model, y_i ~ normal(theta, 1), for i=1,…,n, with non-data-based prior, theta ~ normal(0, 10). This is generative: you draw theta from the prior, then draw y given theta.

Some subtleties do arise. For example, we’re implicitly conditioning on n. For the model to be fully generative, we’d need a prior distribution for n as well.

Similarly, for a regression model to be fully generative, you need a prior distribution on x.

Non-generative models have their uses; we should just recognize when we’re using them. I think the traditional classification of prior, labeling them as improper if they have infinite integral, does not capture the key aspects of the problem.

P.S. Also relevant is this comment, regarding some discussion of models for the n:

As in many problems, I think we get some clarity by considering an existing problem as part of a larger hierarchical model or meta-analysis. So if we have a regression with outcomes y, predictors x, and sample size n, we can think of this as one of a larger class of problems, in which case it can make sense to think of n and x as varying across problems.

The issue is not so much whether n is a “random variable” in any particular study (although I will say that, in real studies, n typically is not precisely defined ahead of time, what with difficulties of recruitment, nonresponse, dropout, etc.) but rather that n can vary across the reference class of problems for which a model will be fit.

Where’d the $2500 come from?

Brad Buchsbaum writes:

Sometimes I read the New York Times “Well” articles on science and health. It’s a mixed bag, sometimes it’s quite good and sometimes not. I came across this yesterday:

What’s the Value of Exercise? $2,500

For people still struggling to make time for exercise, a new study offers a strong incentive: You’ll save $2,500 a year.

The savings, a result of reduced medical costs, don’t require much effort to accrue — just 30 minutes of walking five days a week is enough.

The findings come from an analysis of 26,239 men and women, published today in the Journal of the American Heart Association. . . .

I [Buchsbaum] thought: I wonder where the number came from? So I tracked down the paper referred to in the article (which was unhelpfully not linked or properly named).

I was horrified to find that the $2500 figure appears to be nowhere in the paper (see table 2). Moreover, the closest number I could find ($1900) was based on a regression model without covarying age, sex, ethnicity, income, or anything else. Of course older people exercise less and spend more on healthcare!

I sent the following email (see below) to the NYTimes author, but she has not responded.

At any rate, I thought this example of very high-profile science-blogging to be particularly egregious, so I thought I’d bring it to your attention.

The research article is Economic Impact of Moderate-Vigorous Physical Activity Among Those With and Without Established Cardiovascular Disease: 2012 Medical Expenditure Panel Survey, by Javier Valero-Elizondo, Joseph Salami, Chukwuemeka Osondu, Oluseye Ogunmoroti, Alejandro Arrieta, Erica Spatz, Adnan Younus, Jamal Rana, Salim Virani, Ron Blankstein, Michael Blaha, Emir Veledar, and Khurram Nasir.

And here’s Buchsbaum’s letter to Gretchen Reynolds, the author of that news article:

I very much enjoy your health articles for the New York Times. Sometimes I try and find the paper and examine the data, just for my own benefit.

After perusing the paper, I’m was not quite sure where the $2500 figure came from. In table 2 (see attached paper), the unadjusted expenditures are reported over all subjects.

non-optimal PA: $5397, optimal PA: $3443 for a difference of $1900.

This is close to $2500 but your number is higher.

However, remember, this is an *unadjusted model*. It does not account for age, sex, family income, race/ethnicity, insurance type, geographical location or comorbidity.

In other words, it’s a virtually useless model.

Lets look at Model 3, which does account for the above factors.

non-optimal PA: $4867, optimal PA: $4153 for a difference of $714

So $714 closer to the mark.

BUT, this includes ALL subjects, including those with cardiovascular disease (CVD).

If you look at people without CVD then the estimates depend on the cardiovascular risk profile (CRF). If you have an average or optimal profile then the difference is around $430 or $493. If you have a “poor” profile, then the difference is around $1060 (although the 95% confidence intervals overlapped, meaning the effect was not reliable).

What is my conclusion?

I’m afraid the title of your article is misleading since it is larger (by $600) than the $1900 estimate based on the meaningless unadjusted model! Even if the title was “What’s the Value of Exercise? $700”, it would still be misleading, because it implicitly assumes a causal relationship between exercise and expenditure.

Remember also that the adjusted variables are only the measures the authors happened to record. There are dozens of potentially other mediating variables which are related to both physical exercise and health expenditures. Including these other adjusting factors might further reduce the estimates.

Best Regards,

It’s just a news article so some oversimplification is perhaps unavoidable. But I do wonder where the $2500 number came from. I’m guessing it’s from some press release but I don’t know.

Also, I’m surprised the reporter didn’t respond to the email. But maybe New York Times reporters get too many emails to respond to, or even read. I should also emphasize that I did not read that news article or the scientific paper in detail, so I’m not endorsing (or disagreeing with) Buchsbaum’s claim. Here I’m just interested the general challenge of tracking down numbers like that $2500 that have no apparent source.

Stan Weekly Roundup, 16 June 2017

We’re going to be providing weekly updates for what’s going on behind the scenes with Stan. Of course, it’s not really behind the scenes, because the relevant discussions are at

  • stan-dev GitHub organization: this is the home of all of our source repos; design discussions are on the Stan Wiki

  • Stan Discourse Groups: this is the home of our user and developer lists (they’re all open); feel free to join the discussion—we try to be friendly and helpful in our responses, and there is a lot of statistical and computational expertise in the wings from our users, who are increasingly joining the discussion. By the way, thanks for that—it takes a huge load off us to get great answers from users to other user questions. We’re up to about 15 or so active discussion threads a day or thereabouts (active topics in the last 24 hours include AR(K) models, web site reorganization, ragged arrays, order statitic priors, new R packages built on top of Stan, docker images for Stan on AWS, and many more!)

OK, let’s get started with the weekly review, though this is a special summer double issue, just like the New Yorker.

Your news here: If you have any Stan news you’d like to share, please let me know at carp@alias-i.com (we’ll probably get a more standardized way to do this in the future).

New web site: Michael Betancourt redesigned the Stan web site; hopefully this will be easier to use. We’re no longer trying to track the literature. If you want to see the Stan literature in progress, do a search for “Stan Development Team” or “mc-stan.org” on Google Scholar; we can’t keep up! Do let us know either in an issue on GitHub for the web site or in the user group on Discourse if you have comments or suggestions.

New user and developer lists: We’ve shuttered our Google group and moved to Discourse for both our user and developer lists (they’re consolidated now in categories on one list). It’s easy to signup with GitHub or Google IDs and much easier to search and use online.
See Stan Discourse Groups and for the old discussions, Stan’s shuttered Google group for users and Stan’s shuttered Google group for developers“. We’re not removing any of the old content, but we are prohibiting new posts.

GPU support: Rok Cesnovar and Steve Bronder have been getting GPU support working for linear algebra operations. They’re starting with Cholesky decomposition because it’s a bottleneck for Gaussian process (GP) models and because it has the pleasant property of being quadratic in data and cubic in computation.
See math pull request 529

Distributed computing support: Sebastian Weber is leading the charge into distributed computing using the MPI framework (multi-core or multi-machine) by essentially coding up map-reduce for derivatives inside of Stan. Together with GPU support, distributed computing of derivatives will give us a TensorFlow-like flexibility to accelerate computations. Sebastian’s also looking into parallelizing the internals of the Boost and CVODES ordinary differential equation (ODE) solvers using OpenCL.
See math issue 101, math issue 551,

Logging framework: Daniel Lee added a logging framework to Stan to allow finer-grained control of

Operands and partials: Sean Talts finished the refactor of our underlying operands and partials data structure, which makes it much simpler to write custom derivative functions

See pull request 547

Autodiff testing framework: Bob Carpenter finished the first use case for a generalized autodiff tester to test all of our higher-order autodiff thoroughly
See math pull request 562

C++11: We’re all working toward the 2.16 release, which will be our last release before we open the gates of C++11 (and some of C++14). This is going to make our code a whole lot easier to write and maintain, and will open up awesome possibilities like having closures to define lambdas within the Stan language, as well as consolidating many of our uses of Boost into standard template library.

Append arrays: Ben Bales added signatures for append_array, to work like our appends for vectors and matrices.
See pull request 554 and pull request 550

ODE system size checks: Sebastian Weber pushed a bug fix that cleans up ODE system size checks to avoid seg faults at run time.
See pull request 559

RNG consistency in transformed data: A while ago we relaced the generated-quantities-only nature of _rng functions by allowing them in transformed data (so you can fit fake data generated wholly within Stan or represent posterior uncertainty of some other process, allowing “cut”-like models to be formulated as a two-stage process); Mitzi Morris just cleaned these up so we use the same RNG seed for all chains so that we can perform converence monitoring; multiple replications would then be done by running the whole multi-chain process multiple times.
See Stan pull request 2313

NSF Grant: CI-SUSTAIN: Stan for the Long Run: We (Bob Carpenter, Andrew Gelman, Michael Betancourt) were just awarded an NSF grant for Stan sustainability. This was a follow-on from the first Compute Resource Initiative (CRI) grant we got after building the system. Yea! This adds roughly a year of funding for the team at Columbia University. Our goal is to put in governance processes for sustaining the project as well as shore up all of our unit tests and documentation.

Hiring: We hired two full-time Stan staff at Columbia. Sean Talts joins as a developer at Columbia and Breck Baldwin as a business manager for the project, both at Columbia. Sean had already been working as a contractor for us, hence all the pull requests. (Pro tip: The best way to get a foot in the door for an open-source project is to submit a useful pull request.)

SPEED: Parallelizing Stan using the Message Passing Interface (MPI)

Sebastian Weber writes:

Bayesian inference has to overcome tough computational challenges and thanks to Stan we now have a scalable MCMC sampler available. For a Stan model running NUTS, the computational cost is dominated by gradient calculations of the model log-density as a function of the parameters. While NUTS is scalable to huge parameter spaces, this scalability becomes more of a theoretical one as the computational cost explodes. Models which involve ordinary differential equations (ODE) are such an example, where the runtimes can be of the order of days.

The obvious speedup when using Stan is to run multiple chains at the same time on different computer cores. However, this cannot reduce the total runtime per chain, which requires within-chain parallelization.

Hence, a viable approach is to parallelize the gradient calculation within a chain. As many Bayesian models facilitate hierarchical models over groupings we can often calculate contributions to the log-likelihood separately for each of these groups.

Therefore, the concept of an embarrassingly parallel program can be applied in this setting, i.e. one can calculate these independent work chunks on separate CPU cores and then collect the results.

For reasons implied by Stan’s internals (the gradient calculation must not run in a threaded program) we are restricted in applicable techniques. One possibility is the Message Passing Interface (MPI) which spawns multiple CPU cores by firing off independent processes. A root process will send packets of work (sets of parameters) to the child nodes which do the work and then send back the results (function return values and the gradients). A first toy example shows dramatic speedups (3 ODEs, 7 parameters). That is, when going from 1 core runtime of 5.2h we can crank it down to just 17 minutes by using 20 cores (18x speedup) on a single machine with 20 cores. MPI scales also across machines and when throwing 40 cores at the problem we are down to 10 minutes which is “only” a 31x speedup (see the above plot).

Of course, the MPI approach works best on clusters with many CPU
cores. Overall, this is fantastic news for big models as this opens the door to scale out large problems onto clusters which are available nowadays in many research facilities.

The source code for this prototype is on our github repository. This code should be regarded as working research code and we are currently working on bringing this feature into the main Stan distribution.

Wow. This is a big deal. There are lots of problems where this method will be useful.

P.S. What’s with the weird y-axis labels on that graph? I think it would work better to just go 1, 2, 4, 8, 16, 32 on both axes. I like the wall-time markings on the line, though; that helped me follow what was going on.

Pizzagate gets even more ridiculous: “Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature . . . in the later study they again found the exact opposite, but did not comment on the discrepancy.”

Background

Several months ago, Jordan Anaya​, Tim van der Zee, and Nick Brown reported that they’d uncovered 150 errors in 4 papers published by Brian Wansink, a Cornell University business school professor and who describes himself as a “world-renowned eating behavior expert for over 25 years.”

150 errors is pretty bad! I make mistakes myself and some of them get published, but one could easily go through an entire career publishing less than 150 mistakes. So many in a single paper is kind of amazing.

After the Anaya et al. paper came out, people dug into other papers of Wansink and his collaborators and found lots more errors.

Wansink later released a press release pointing to a website which he said contained data and code from the 4 published papers.

In that press release he described his lab as doing “great work,” which seems kinda weird to me, given that their published papers are of such low quality. Usually we would think that if a lab does great work, this would show up in its publications, but this did not seem to have happened in this case.

In particular, even if the papers in question had no data-reporting errors at all, we would have no reason to believe any of the scientific claims that were made therein, as these claims were based on p-values computed from comparisons selected from uncontrolled and abundant researcher degrees of freedom. These papers are exercises in noise mining, not “great work” at all, not even good work, not even acceptable work.

The new paper

As noted above, Wansink shared a document that he said contained the data from those studies. In a new paper, Anaya, van der Zee, and Brown analyzed this new dataset. They report some mistakes they (Anaya et al.) had made in their earlier paper, and many places where Wanink’s papers misreported his data and data collection protocols.

Some examples:

All four articles claim the study was conducted over a 2-week period, however the senior author’s blog post described the study as taking one month (Wansink, 2016), the senior author told Retraction Watch it was a two-month study (McCook, 2017b), a news article indicated the study was at least 3 weeks long (Lazarz, 2007), and the data release states the study took place from October 18 to December 8, 2007 (Wansink and Payne, 2007). Why the articles claimed the study only took two weeks when all the other reports indicate otherwise is a mystery.

Furthermore, articles 1, 2, and 4 all claim that the study took place in spring. For the Northern Hemisphere spring is defined as the months March, April, and May. However, the news report was dated November 18, 2007, and the data release states the study took place between October and December.

And this:

Article 1 states that the diners were asked to estimate how much they ate, while Article 3 states that the amount of pizza and salad eaten was unobtrusively observed, going so far as to say that appropriate subtractions were made for uneaten pizza and salad. Adding to the confusion Article 2 states:
“Unfortunately, given the field setting, we were not able to accurately measure consumption of non-pizza food items.”

In Article 3 the tables included data for salad consumed, so this statement was clearly inaccurate.

And this:

Perhaps the most important question is why did this study take place? In the blog post the senior author did mention having a “Plan A” (Wansink, 2016), and in a Retraction Watch interview revealed that the original hypothesis was that people would eat more pizza if they paid more (McCook, 2017a). The origin of this “hypothesis” is likely a previous study from this lab, at a different pizza buffet, with nearly identical study design (Just and Wansink, 2011). In that study they found diners who paid more ate significantly more pizza, but the released data set for the present study actually suggests the opposite, that diners who paid less ate more. So was the goal of this study to replicate their earlier findings? And if so, did they find it concerning that not only did they not replicate their earlier result, but found the exact opposite? Did they not think this was worth reporting?
Another similarity between the two pizza studies is the focus on taste of the pizza. Article 1 specifically states:

“Our reading of the literature leads us to hypothesize that one would rate pizza from an $8 pizza buffet as tasting better than the same pizza at a $4 buffet.”

Either they did not read their own previous pizza buffet study, or they do not consider it to be part of the literature, because in that paper they found ratings for overall taste, taste of first slice, and taste of last slice to all be higher in the lower price group, albeit with different levels of significance (Just and Wansink, 2011). However, in the later study they again found the exact opposite, but did not comment on the discrepancy.

Anaya et al. summarize:

Of course, there is a parsimonious explanation for these contradictory results in two apparently similar studies, namely that one or both sets of results are the consequence of modeling noise. Given the poor quality of the released data from the more recent articles . . . it seems quite likely that this is the correct explanation for the second set of studies, at least.

And this:

No good theory, no good data, no good statistics, no problem. Again, see here for the full story.

Not the worst of it

And, remember, those 4 pizzagate papers are not the worst things Wansink has published. They’re only the first four articles that anyone bothered to examine carefully enough to see all the data problems.

There was this example dug up by Nick Brown:

A further lack of randomness can be observed in the last digits of the means and F statistics in the three published tables of results . . . Here is a plot of the number of times each decimal digit appears in the last position in these tables:

These don’t look like so much like real data but they do seem consistent with someone making up numbers and not wanting them to seem too round, and not being careful to include enough 0’s and 5’s in the last digits.

And this discovery by Tim van der Zee:

Wansink, B., Cheney, M. M., & Chan, N. (2003). Exploring comfort food preferences across age and gender. Physiology & Behavior, 79(4), 739-747.

Citations: 334

Using the provided summary statistics such as mean, test statistics, and additional given constraints it was calculated that the data set underlying this study is highly suspicious. For example, given the information which is provided in the article the response data for a Likert scale question should look like this:

Furthermore, although this is the most extreme possible version given the constraints described in the article, it is still not consistent with the provided information.

In addition, there are more issues with impossible or highly implausible data.

And:

Sığırcı, Ö, Rockmore, M., & Wansink, B. (2016). How traumatic violence permanently changes shopping behavior. Frontiers in Psychology, 7,

Citations: 0

This study is about World War II veterans. Given the mean age stated in the article, the distribution of age can only look very similar to this:

The article claims that the majority of the respondents were 18 to 18.5 years old at the end of WW2 whilst also have experienced repeated heavy combat. Almost no soldiers could have had any other age than 18.

In addition, the article claims over 20% of the war veterans were women, while women only officially obtained the right to serve in combat very recently.

There’s lots more at the link.

From the NIH guidelines on research misconduct:

Falsification: Manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.

Ride a Crooked Mile

Joachim Krueger writes:

As many of us rely (in part) on p values when trying to make sense of the data, I am sending a link to a paper Patrick Heck and I published in Frontiers in Psychology. The goal of this work is not to fan the flames of the already overheated debate, but to provides some estimates about what p can and cannot do. Statistical inference will always require experience and good judgment regardless of which school of thought (Bayesian, frequentist, or other) we are leaning to.

I have three reactions.

1. I don’t think there’s any “overheated debate” about the p-value; it’s a method that has big problems and is part of the larger problem that is null hypothesis significance testing (see my article, The problems with p-values are not just with p-values); also p-values are widely misunderstood (see also here).

From a Bayesian point of view, p-values are most cleanly interpreted in the context of uniform prior distributions—but the setting of uniform priors, where there’s nothing special about zero, is the scenario where p-values are generally irrelevant.

So I don’t have much use for p-values. They still get used in practice—a lot—so there’s room for lots more articles explaining them to users, but I’m kinda tired of the topic.

2. I disagree with Krueger’s statement that “statistical inference will always require experience and good judgment.” For better or worse, lots of statistical inference is done using default methods by people with poor judgment and little if any relevant experience. Too bad, maybe, but that’s how it is.

Does statistical inference require experience and good judgment? No more than driving a car requires experience and good judgment. All you need is gas in the tank and the key in the ignition and you’re ready to go. The roads have all been paved and anyone can drive on them.

3. In their article, Krueger and Heck write, “Finding p = 0.055 after having found p = 0.045 does not mean that a bold substantive claim has been refuted (Gelman and Stern, 2006).” Actually, our point was much bigger than that. Everybody knows that 0.05 is arbitrary and there’s no real difference between 0.045 and 0.055. Our point was that apparent huge differences in p-values are not actually stable (“statistically significant”). For example, a p-value of 0.20 is considered to be useless (indeed, it’s often taken, erroneously, as evidence of no effect), and a p-value of 0.01 is considered to be strong evidence. But a p-value of 0.20 corresponds to a z-score of 1.28, and a p-value of 0.01 corresponds to a z-score of 2.58. The difference is 1.3, which is not close to statistically significant. (The difference between two independent estimates, each with standard error 1, has a standard error of sqrt(2); thus a difference in z-scores of 1.3 is actually less than 1 standard error away from zero!) So I fear that, by comparing 0.055 to 0.045, they are minimizing the main point of our paper.

More generally I think that all the positive aspects of the p-value they discuss in their paper would be even more positive if researchers were to use the z-score and not ever bother with the misleading transformation into the so-called p-value. I’d much rather see people reporting z-scores of 1.5 or 2 or 2.5 than reporting p-values of 0.13, 0.05, and 0.01.

Kaiser Fung’s data analysis bootcamp

Kaiser Fung announces a new educational venture he’s created, a bootcamp (12-week full-time in-person program with a curriculum) of short courses with a goal of getting people their first job in an analytics role for a business unit (not engineering or software development, so he is not competing directly with MS Data Science or data science bootcamps). Their curriculum is deliberately designed to be broad but not deep.

I asked Kaiser if he had anything else he wanted to share, and he wrote:

I think our major differentiation from other bootcamps out there includes:

a. There are lots of jobs in these other business units outside engineering and software development. Hiring managers in marketing, operations, servicing, etc. are looking for the ability to interpret and reason with data, and use data to solve business problems. Our broad-based curriculum caters to this need.

b. I don’t believe that coding is the end-all of data science. Coding schools teach people how to code but knowing what to code is more important. Therefore, our curriculum covers R, Python, and machine learning but also statistical reasoning, survey design, Excel, intro to marketing, intro to finance, etc.

c. We provide quality through small class size, in-person instruction and instructors who are industry practitioners. The average instructor has 10 years of industry experience, and is in a director or higher level position. These instructors know what hiring managers want since they are hiring managers themselves.

d. We are building a diverse class. We take social scientists, designers as well as STEM people. We just require some exposure to programming concepts and data analyses, and a good college degree.

Statistical Challenges of Survey Sampling and Big Data (my remote talk in Bologna this Thurs, 15 June, 4:15pm)

Statistical Challenges of Survey Sampling and Big Data

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University, New York

Big Data need Big Model. Big Data are typically convenience samples, not random samples; observational comparisons, not controlled experiments; available data, not measurements designed for a particular study. As a result, it is necessary to adjust to extrapolate from sample to population, to match treatment to control group, and to generalize from observations to underlying constructs of interest. Big Data + Big Model = expensive computation, especially given that we do not know the best model ahead of time and thus must typically fit many models to understand what can be learned from any given dataset. We discuss Bayesian methods for constructing, fitting, checking, and improving such models.

It’ll be at the 5th Italian Conference on Survey Methodology, at the Department of Statistical Sciences of the University of Bologna. A low-carbon remote talk.

Criminology corner: Type M error might explain Weisburd’s Paradox

[silly cartoon found by googling *cat burglar*]

Torbjørn Skardhamar, Mikko Aaltonen, and I wrote this article to appear in the Journal of Quantitative Criminology:

Simple calculations seem to show that larger studies should have higher statistical power, but empirical meta-analyses of published work in criminology have found zero or weak correlations between sample size and estimated statistical power. This is “Weisburd’s paradox” and has been attributed by Weisburd, Petrosino, and Mason (1993) to a difficulty in maintaining quality control as studies get larger, and attributed by Nelson, Wooditch, and Dario (2014) to a negative correlation between sample sizes and the underlying sizes of the effects being measured. We argue against the necessity of both these explanations, instead suggesting that the apparent Weisburd paradox might be explainable as an artifact of systematic overestimation inherent in post-hoc power calculations, a bias that is large with small N. Speaking more generally, we recommend abandoning the use of statistical power as a measure of the strength of a study, because implicit in the definition of power is the bad idea of statistical significance as a research goal.

I’d never heard of Weisburd’s paradox before writing this article. What happened is that the journal editors contacted me suggesting the topic, I then read some of the literature and wrote my article, then some other journal editors didn’t think it was clear enough so we found a couple of criminologists to coauthor the paper and add some context, eventually producing the final version linked here. I hope it’s helpful to researchers in that field and more generally. I expect that similar patterns hold with published data in other social science fields and in medical research too.

PhD student fellowship opportunity! in Belgium! to work with us! on the multiverse and other projects on improving the reproducibility of psychological research!!!

[image of Jip and Janneke dancing with a cat]

Wolf Vanpaemel and Francis Tuerlinckx write:

We at the Quantitative Psychology and Individual Differences, KU Leuven, Belgium are looking for a PhD candidate. The goal of the PhD research is to develop and apply novel methodologies to increase the reproducibility of psychological science. More information can be found on the job website or by contacting us at wolf.vanpaemel@kuleuven.be or francis.tuerlinckx@kuleuven.be. The deadline for application is Monday June 26, 2017.

One of the themes a successful candidate may work on is the further development of the multiverse. I expect to be an active collaborator in this work.

So please apply to this one. We’d like to get the best possible person to be working on this exciting project.

Why I’m not participating in the Transparent Psi Project

I received the following email from psychology researcher Zoltan Kekecs:

I would like to ask you to participate in the establishment of the expert consensus design of a large scale fully transparent replication of Bem’s (2011) ‘Feeling the future’ Experiment 1. Our initiative is called the ‘Transparent Psi Project’. [https://osf.io/jk2zf/wiki/home/] Our aim is to develop a consensus design that is mutually acceptable for both psi proponent and mainstream researchers, containing clear criteria for credibility.

I replied:

Thanks for the invitation. I am not so interested in this project because I think that all the preregistration in the world won’t solve the problem of small effect sizes and poor measurements. It is my impression from Bem’s work and others that the field of psi is plagued by noisy measurements and poorly specified theories. Sure, preregistration etc. would stop many of the problems–in particular, there’s no way that Bem would’ve seen 9 out of 9 statistically significant p-values, or whatever that was. But I can’t in good conscience recommend the spending of effort in this way. I think any serious work in this area would have to go beyond the phenomenological approach and perform more direct measurements, as for example here: http://marginalrevolution.com/marginalrevolution/2014/11/telepathy-over-the-internet.html . I’ve not actually read the paper linked there so this may be a bad example but the point is that one could possibly study such things scientifically with a physical model of the process. To just keep taking Bem-style measurements, though, I think that’s hopeless: it’s the kangaroo problem (http://andrewgelman.com/2015/04/21/feather-bathroom-scale-kangaroo/). Better to preregister than not, but better still not to waste time on this or similarly-hopeless problems (studying sex ratios in samples of size 3000, estimating correlations of monthly cycle on political attitudes using between-person comparisons, power pose, etc.). I recognize that some of these ideas, ESP included, had some legitimate a priori plausibility, but, at this point, a Bem-style experiment seems like a shot in the dark. And, of course, even with preregistration, there’s a 5% chance you’ll see something statistically significant just by chance, leading to further confusion. In summary, preregistration and consensus helps with the incentives, but all the incentives in the world are no substitute for good measurements. (See the discussion of “in many cases we are loath to recommend pre-registered replication” here: http://andrewgelman.com/2017/02/11/measurement-error-replication-crisis/).

Kekecs wrote back:

Thank you for your feedback. We fully realize the problem posed by small effect size. However, this problem in itself can be solved simply by throwing a larger sample at it. In fact based on our simulations we plan to collect 14,000-60,000 data points (700 – 3,000 participants) using bayesian analysis and optional stopping, aiming to reach a Bayes factor threshold of 60 or 1/60. Our simulations show that using these parameters we only have a p = 0.0004 false positive chance, so it is highly unlikely that we would accidentally generate more confusion on the field just by conducting the replication. On the contrary, by doing our study, we will effectively more than double the amount of total data accumulated so far by Bem´s and others studies using this paradigm, which should help with clarity on the field by introducing good quality, credible data.

You might be right though that the measurements itself is faulty, and that we cannot expect precognition to work in an environmentally invalid situation like this. But in reality, we don’t have any information on how precognition should works if it really does exist, so I am not sure what would be a better way of measuring it than seeing how effective are people at predict future events.

Our main goal here is not really to see whether precognition exists or not. The ultimate aim of our efforts is to do a proof of concept study where we will see whether it is possible to come to a consensus on criterion of acceptability and credibility in a field this divided, and to come up with ways in which we can negate all possibilities of questionable research practice. This approach can then be transferred to other fields as well.

I then responded:

I still think it’s hopeless. The problem (which I’ll say using generic units as I’m not familiar with the ESP experiment) is: suppose you have a huge sample size and can detect an effect of 0.003 (on some scale) with standard error 0.001. Statistically significant, preregistered, the whole deal. Fine. But then you could very well see an effect of -0.002 with different people, in a different setting. And -0.003 somewhere else. And 0.001 somewhere else. Etc. You’re talking about effects that are indistinguishable given various sources of leakage in the experiment.

I support your general goal but I recommend you choose a more promising topic than ESP or power pose or various other topics that get talked about so much.

Kekecs replied:

We are already committed to follow through with this particular setting. But I agree with you that our approach can be easily transferred to the research of other effects and we fully intend to do that.

If you put it that way, your question is all about construct validity. Whether we can detect the effect that we really want to detect, or are there other confounds that bias the measurement. In this particular experimental setting which is simple as stone (basically people are guessing about the outcomes of future coin flips) the types of bias that we can expect are more related to questionable research practices (QRPs) than anything else. The only way other types of bias, such as personal differences in ability (sampling bias), participant expectancy, and demand characteristics, etc., can have an effect is if there is truly an anomalous effect. For example if we detected an effect of 0.003 with 0.001 SE only because we accidentally sampled people with high psi abilities, our conclusion that there is a psi effect would still be true (although our effect size estimate would be slightly off).

That is why in this project we are focusing mainly on negating all possibilities of QRPs and full transparency. I am not sure what other types of leakage can we have in this particular experiment if we addressed all possible QRPs. Would you care to elaborate?

I responded:

Just in answer to that last question: I’m not sure what other types of leakage might exist—it’s my impression that Bem’s experiments had various problems, so I guess it depends how exact a replication you’re talking about. My real point, though, is if we think ESP exists at all, then an effect that’s +0.003 on Monday and -0.002 on Tuesday and +0.001 on Wednesday probably isn’t so interesting. This becomes clearer if we move the domain away from possible null phenomena such as ESP or homeopathy, to things like social priming, which presumably has some effect, but which varies so much by person and by context to be generally unpredictable and indistinguishable from noise. I don’t think ESP is such a good model for psychology research because it’s one of the few things people study that really could be zero.

And then Kekecs closed out the discussion:

In response, I find doing this effort on the field of ESP interesting exactly because the effect could potentially be zero. Positive findings have an overwhelming dominance in both psi literature, and social sciences literature in general. In the case of most other social science research, it is a theoretical possibility (but unrealistic) that researchers just get lucky all the time and they always ask the right questions, that is why they are so effective in finding positive effects. Again, this is obviously cannot be true for the entirety of the literature, but for each topic studied individually, it can be quite probable that there is an effect if ever so small, which blurs the picture about publication bias and other types of bias in the literature. However, it may be that there is no ESP effect at all. In that case, we would have a field where the effect of bias in research can be studied in its purest form.

From another perspective, precognition in particular is a perfect research topic exactly because these designs by their nature are very well protected from the usual threats to internal validity, at least in the positive direction. It is hard to see what could make a person perform better at predicting the outcome of a state of the art random number generator if there is no psi effect. Bias can always be introduced by different questionable research practices (QRPs), but if we are able to design a study completely immune the QRPs, there is no real possibility for bias toward type I error. Of course, if the effect really exists, all the usual threats to validity can have an influence (for example, it is possible that people can get “psi fatigue” if they perform a lot of trials, or that events and contextual features, or even expectancy can have an effect on performance), but we cannot make a type I error in that case, because the effect exists, we can only make errors in estimating the size of the effect, or a type II error.

So understanding what is underlying the dominance of positive effects in ESP research is very important. If there is no effect, psi literature can serve as a case study for bias in its purest form, which can help us understand it in other research fields. On the other hand, if we find an effect when all QRPs are controlled for, we may need to really rethink our current paradigm.

I continue to think that the study of ESP is irrelevant for psychology, both for substantive reasons—there is no serious underlying theory or clear evidence for ESP, it’s all just hope and intuition—and for methodological reasons, in that zero is a real possibility. In contrast, even silly topics such as power pose and embodied cognition seem to me to have some relevance to psychology and also involve the real challenge that there are no zeroes. Standing in an unusual position for two minutes will have some effect on your thinking and behavior; the debate is what are the consistent effects, if any. That’s my take, anyway; but I wanted to share Kekecs’s view too, given all the effort he’s putting into this project.

Financial anomalies are contingent on being unknown

Jonathan Falk points us to this article by Kewei Hou, Chen Xue, and Lu Zhang, who write:

In retrospect, the anomalies literature is a prime target for p-hacking. First, for decades, the literature is purely empirical in nature, with little theoretical guidance. Second, with trillions of dollars invested in anomalies-based strategies in the U.S.market alone, the financial interest is overwhelming. Third, more significant results make a bigger splash, and are more likely to lead to publications as well as promotion, tenure, and prestige in academia. As a result, armies of academics and practitioners engage in searching for anomalies, and the anomalies literature is most likely one of the biggest areas in finance and accounting. Finally, as we explain later, empiricists have much flexibility in sample criteria, variable definitions, and empirical methods, which are all tools of p-hacking in chasing statistical significance.

Falk writes:

A weakness in this study is that the use of a common data period obscures the fact that financial anomalies are contingent on being unknown: known (true) anomalies will be arbitraged away so that they no longer exist. Their methodology continues to estimate many of these anomalies after the results of the studies were public knowledge and heavily scrutinized. This should attenuate the results. (It would be interesting to see if the results weakened the earlier the study was published. On a low-hanging fruit theory, it should be just the opposite.) It’s as if Power Pose worked until Amy Cuddy wrote about it, at which point everyone wised up and the effect went away. Effects like that are really hard to replicate.

Falk’s comment, about financial anomalies being contingent on being unknown, reminds me of something: In finance (so I’m told), when someone has a great idea, they keep it secret and try to milk all the advantage out of it that they can. This also happens in some fields of science: we’ve all heard of bio labs that refuse to share their data or their experimental techniques because they want to squeeze out a couple more papers in Nature and Cell. Given all the government funding involved, that’s not cool, but it’s how it goes. But in statistics, when think we have a good idea, we put it out there for free, we scream about it and get angry that other people aren’t using our wonderful methods and our amazing free software. Funny, that.

P.S. For an image, I went and googled *cat anomaly*. I recommend you don’t do that. The pictures were really disturbing to me.

UK election summary

The Conservative party, led by Theresa May, defeated the Labour party, led by Jeremy Corbyn.

The Conservative party got 42% of the vote, Labour got 40% of the vote, and all the other parties received 18% between them. The Conservatives ended up with 51.5% of the two-party vote, just a bit less than Hillary Clinton’s share last November.

In the previous U.K. general election, two years ago, Conservative beat Labour, 37%-30%, that’s 55% of the two-party vote.

The time before, the Conservatives received 36%, compared to Labour’s 29%. The Conservatives again had received 55% of the two-party vote.

As with the Clinton-Trump presidential election and the “Brexit” election in the U.K. last year, the estimates from the polls turned out to give pretty good forecasts.

The predictions were not perfect—the 318-262 split in parliament was not quite the 302-269 that was predicted, and the estimated 42-38 vote split didn’t quite predict the 43.5-41.0 split that actually happened (those latter figures, for Great Britain only, come from the Yougov post-election summary). And the accuracy of the seat forecast has to be attributed in part to luck, given the wide predictive uncertainty bounds (based on pre-election polls, the Conservatives were forecast to win between 269 and 334 seats). The predictions were done using Mister P and Stan.

The Brexit and Clinton-Trump poll forecasts looked bad at the time because they got the outcome wrong, but as forecasts of public opinion they were solid, only off by a percentage point or two in each case. In general we’d expect polls to do better in two-party races or, more generally, in elections with two clear options, because then there are fewer reasons for prospective voters to change their opinions. In most parts of the U.K., this 2017 election was a two-party affair, hence it should be no surprise that the final polls were accurate (after suitable adjustment for nonresponse), even if, again, there was some luck that they were as close as shown in these graphs by Jack Blumenau:

P.S. I like Yougov and some of our research is supported by Yougov, but I’m kinda baffled cos when I googled I found this page by Anthony Wells, which estimates 42% for the Conservatives, 35% for Labour, and a prediction of “an increased Conservative majority in the Commons,” which seems to contradict their page that I linked to above, with that prediction of a hung parliament. That’s the forecast I take seriously because it used MRP, but then it makes me wonder why their “Final call” was different. Once you have a model and a series of polls, why throw all that away when making your final call?

The (Lance) Armstrong Principle

If you push people to promise more than they can deliver, they’re motivated to cheat.

“Bombshell” statistical evidence for research misconduct, and what to do about it?

Someone pointed me to this post by Nick Brown discussing a recent article by John Carlisle regarding scientific misconduct.

Here’s Brown:

[Carlisle] claims that he has found statistical evidence that a surprisingly high proportion of randomised controlled trials (RCTs) contain data patterns that cannot have arisen by chance. . . . the implication is that some percentage of these impossible numbers are the result of fraud. . . .

I thought I’d spend some time trying to understand exactly what Carlisle did. This post is a summary of what I’ve found out so far. I offer it in the hope that it may help some people to develop their own understanding of this interesting forensic technique, and perhaps as part of the ongoing debate about the limitations of such “post publication analysis” techniques . . .

I agree with Brown that these things are worth studying. The funny thing is, it’s hard for me to get excited about this particular story, even though Brown, who I respect, calls it a “bombshell” that he anticipates will “have quite an impact.”

There are two reasons this new paper doesn’t excite me.

1. Dog bites man. By now, we know there’s lots of research misconduct in published papers. I use “misconduct” rather than “fraud” because from, the user’s perspective, I don’t really care so much whether Brian Wansink, for example, was fabricating data tables, or had students make up raw data, or was counting his carrots in three different ways, or was incompetent in data management, or was actually trying his best all along and just didn’t realize that it can be detrimental to scientific progress to be fast and loose with your data. Or some combination of all of these. Clarke’s Law.

Anyway, the point is, it’s no longer news when someone goes into a literature of p-value-based papers in a field with noisy data, and finds that people have been “manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record.” At this point, it’s to be expected.

2. As Stalin may have said, “When one man dies it’s a tragedy. When thousands die it’s statistics.” Similarly, the story of Satoshi Kanazawa or Brian Wansink or Daryl Bem has human interest. And even the stories without direct human interest have some sociological interest, one might say. For example, I can’t even remember who wrote the himmicanes paper or the ages-ending-in-9 paper, but in each case I’m still interested in the interplay between the plausible-but-oh-so-flexible theory, the weak data analysis, the poor performance of the science establishment, and the media hype. This new paper by Carlisle, though: it’s so general, so it’s hard to grab onto the specifics of any single paper or set of papers. Also, for me, medical research is less interesting than social science.

Finally, I want to briefly discuss the current and future reactions to this study. I did a quick google and found it was covered on Retraction Watch, where Ivan Oransky quotes Andrew Klein, editor of Anaesthesia, as saying:

No doubt some of the data issues identified will be due to simple errors that can be easily corrected such as typos or decimal points in the wrong place, or incorrect descriptions of statistical analyses. It is important to clarify and correct these in the first instance. Other data issues will be more complex and will require close inspection/re-analysis of the original data.

This is all fine, and, sure, simple typos should just be corrected. But . . . if a paper has real mistakes I think the entire paper should be flagged as suspect. If the authors have so little control over their data and methods, then we may have no good reason to believe their claims about what their data and methods imply about the external world.

One of the frustrating things about the Richard Tol saga was that we became aware of more and more errors in his published article, but the journal never retracted it. Or, to take a more mild case, Cuddy, Norton, and Fiske published a paper with a bunch of errors. Fiske assures us that correction of the errors doesn’t change the paper’s substantive conclusions, and maybe that’s true and maybe it’s not. But . . . why should we believe her? On what evidence should we believe the claims of a paper where the data are mishandled?

To put it another way, I think it’s unfortunate that retractions and corrections are considered to be such a big deal. If a paper has errors in its representation of data or research procedures, that should be enough for the journal to want to put a big fat WARNING on it. That’s fine, it’s not so horrible. I’ve published mistakes too. Publishing mistakes doesn’t mean you have to be a bad person, nobody’s perfect.

So, if Anaesthesia and other journals wants to correct incorrect descriptions of statistical analyses, numbers that don’t add up, etc., that’s fine. But I hope that when making these corrections—and when identifying suspicious patterns in reported data—they also put some watermark on the article so that future readers will know to be suspicious. Maybe something like this:

The authors of the present paper were not careful with their data. Their main claims were supported by t statistics reported as 5.03 and 11.14, but the actual values were 1.8 and 3.3.

Or whatever. The burden of proof should not be on the people who discovered the error to demonstrate that it’s consequential. Rather, the revelation of the error provides information about the quality of the data collection and analysis. And, again, I say this as a person who’d published erroneous claims myself.

Workshop on reproducibility in machine learning

Alex Lamb writes:

My colleagues and I are organizing a workshop on reproducibility and replication for the International Conference on Machine Learning (ICML). I’ve read some of your blog posts on the replication crisis in the social sciences and it seems like this workshop might be something that you’d be interested in.

We have three main goals in holding a workshop on reproducing and replicating results:

1. Provide a venue in Machine Learning for publishing replications, both successful and unsuccessful. This helps to give credit and visibility to researchers who work on replicating results as well as researchers whose results are replicated.

2. A place to share new ideas about software and tools for making research more reproducible.

3. A forum for discussing how reproducing research and replication effects different parts of the machine learning community. For example, what does it mean to reproduce the results of a recommendations engine which interacts with live humans?

I agree that this is a super-important topic because the fields of statistical methodology and machine learning are full of hype. Lots of algorithms that work in the test examples but then fail in new problems. This happens even with my own papers!