Random Reflections on Ceiling Effects and Replication Studies

In a blog post from December of 2013, I described our attempts to replicate two studies testing the claim that priming cleanliness makes participants less judgmental on a series of 6 moral vignettes. My original post has recently received criticism for my timing and my tone. In terms of timing, I blogged about a paper that was accepted for publication and there was no embargo on the work. In terms of tone, I tried to ground everything I wrote with data but I also editorialized a bit. It can be hard to know what might be taken as offensive when you are describing an unsuccessful replication attempt. The title (“Go Big or Go Home – A Recent Replication Attempt”) might have been off putting in hindsight. In the grand scope of discourse in the real world, however, I think my original blog post was fairly tame.

Most importantly: I was explicit in the original post about the need for more research. I will state again for the record: I don’t think this matter has been settled and more research is needed. We also said this in the Social Psychology paper. It should be widely understood that no single study is ever definitive.

As noted in the current news article for Science about the special issue of Social Psychology, there is some debate about ceiling effects with our replication studies. We discuss this issue at some length in our rejoinder to the commentary. I will provide some additional context and observations in this post. Readers just interested in gory details can read #4. This is a long and tedious post so I apologize in advance.

1. The original studies had relatively small sample sizes. There were 40 total participants in the original scrambled sentence study (Study 1) and 43 total participants in the original hand washing study (Study 2). It takes 26 participants per cell to have an approximately 80% change to detect a d of .80 with alpha set to .05 using a two-tailed significance test. A d of .80 would be considered a large effect size in many areas of psychology.

2. The overall composite did not attain statistical significance using the conventional alpha level of .05 with a two-tailed test in the original Study 1 (p = .064). (I have no special love for NHST but many people in the literature rely on this tool for drawing inferences). Only one of the six vignettes attained statistical significance at the p < .05 level in the original Study 1 (Kitten). Two different vignettes attained statistical significance in the original Study 2 (Trolley and Wallet). The kitten vignette did not. Effect size estimates for these contrasts are in our report. Given the sample sizes, these estimates were large but they had wide confidence intervals.

3. The dependent variables were based on moral vignettes created for a different study originally conducted at the University of Virginia.These measures were originally pilot tested with 8 participants according to a PSPB paper (Schnall, Haidt, Clore, & Jordan, 2008, p.1100). College students from the United States were used to develop the measures that served as the dependent variables. There was no a priori reason to think the measures would “not work” for college students from Michigan. We registered our replication plan and Dr. Schnall was a reviewer on the proposal. No special concerns were raised about our procedures or the nature of our sample. Our sample sizes provided over .99 power to detect the original effect size estimates.

4. The composite DVs were calculated by averaging across the six vignettes and those variables had fairly normal distributions in our studies. In Study 1, the mean for our control condition was 6.48 (SD = 1.13, Median = 6.67, Skewness = -.55, Kurtosis = -.24, n = 102) whereas it was 5.81 in the original paper (SD = 1.47, Median = 5.67, Skewness = -.33, Kurtosis = -.44, n = 20). The average was higher in our sample but the scores theoretically range from 0 to 9. We found no evidence of a priming effect using the composites in Study 1. In Study 2, the mean for our control condition was 5.65 (SD = 0.59, Median = 5.67, Skewness = -.31, Kurtosis = -.19, n = 68) whereas it was 5.43 in the original paper (SD = 0.69, Median = 5.67, Skewness = -1.58, Kurtosis = 3.45, n = 22). The scores theoretically range from 1 to 7. We found no hand washing effect using the composites in Study 2. These descriptive statistics provide additional context for the discussion about ceiling effects. The raw data are posted and critical readers can and should verify these numbers. I have a standing policy to donate $20 to the charity of choice for the first person who notes a significant (!) statistical mistake in my blog posts.

Schnall (2014) undertook a fairly intense screening of our data. This is healthy for the field and the open science framework facilitated this inquiry because we were required to post the data. Dr. Schnall noted that the responses to the individual moral vignettes tended toward the extreme in our samples. I think the underlying claim is that students in our samples were so moralistic that any cleanliness priming effects could not have overpowered their pre-existing moral convictions. This is what the ceiling effect argument translates to in real world terms: The experiments could not have worked in Michigan because the samples tended to have a particular mindset.

It might be helpful to be a little more concrete about the distributions. For many of the individual vignettes, the “Extremely Wrong” option was a common response. Below is a summary of the six vignettes and some descriptive information about the data from the control conditions of Study 1 across the two studies (ours and the original). I think readers will have to judge for themselves as to what kinds of distributions to expect from samples of college students. Depending on your level of self-righteousness, these results could be viewed positively or negatively. Remember, we used their original materials.

Dog (53% versus 30%): Morality of eating a pet dog that was just killed in a car accident.
Trolley (2% versus 5%): Morality of killing one person in the classic trolley dilemma.
Wallet (44% versus 20%): Morality of keeping cash from a wallet found on the street.
Plane (43% versus 30%): Morality of killing an injured boy to save yourself and another person from starving after a plane crash.
Resume (29% versus 15%): Morality of enhancing qualifications on a resume.
Kitten (56% versus 70%): Morality of using a kitten for sexual gratification.

Note: All comparisons are from the Control conditions for our replication Study 1 compared to Study 1 in Schnall et al. (2008). Percentages reflect the proportion of the sample selecting the “extremely wrong” option (i.e., selecting the “9” on the original 0 to 9 scale). For example, 53% of our participants thought it was extremely wrong for Frank to eat his dead dog for dinner whereas 30% of the participants in the original study provided that response.

To recap, we did not find evidence for the predicted effects and we basically concluded more research was necessary. Variable distributions are useful pieces of information and non-parametric tests were consistent with the standard t-tests we used in the paper. Moreover, their kitten distribution was at least as extreme as ours, and yet they found the predicted result on this particular vignette in Study 1. Thus, I worry that any ceiling argument only applies when the results are counter to the original predictions.

One reading of our null results is that there are unknown moderators of the cleanliness priming effects. We have tested for some moderators (e.g., private body consciousness, political orientation) in our replication report and rejoinder, but there could be other possibilities. For example, sample characteristics can make it difficult to find the predicted cleanliness priming results with particular measures. If researchers have a sample of excessively moralistic/judgmental students who think using kittens for sexual gratification is extremely wrong, then cleaning primes may not be terribly effective at modulating their views. Perhaps a different set of vignettes that are more morally ambiguous (say more in line with the classic trolley problem) will show the predicted effects. This is something to be tested in future research.

The bottom line for me is that we followed through on our research proposal and we reported our results. The raw data were posted. We have no control over the distributions. At the very least, researchers might need to worry about using this particular measure in the future based on our replication efforts. In short, the field may have learned something about how to test these ideas in the future. In the end, I come full circle to the original conclusion in the December blog post– More research is needed.

Postscript

I am sure reactions to our work and the respective back-and-forth will break on partisan grounds. The “everything is fine” crew will believe that Dr. Schnall demolished our work whereas the “replication is important” crew will think we raised good points. This is all fine and good as it relates to the insider baseball and sort of political theater that exists in our world. However, I hope these pieces do not just create a bad taste in people’s mouth. I feel badly that this single paper and exchange have diverted attention from the important example of reform taken by Lakens and Nosek. They are helping to shape the broader narrative about how to do things differently in psychological science.

Quick Update on Timelines (23 May 2014)

David sent Dr. Schnall the paper we submitted to the editors on 28 October 2013 with a link to the raw materials. He wrote “I’ve attached the replication manuscript we submitted to Social Psychology based on our results to give you a heads up on what we found.” He added: “If you have time, we feel it would be helpful to hear your opinions on our replication attempt, to shed some light on what kind of hidden moderators or other variables might be at play here.”

Dr. Schnall emailed back on 28 October 2013 asking for 2 weeks to review the material before we proceeded. David emailed back on 31 October 2013 apologizing for any miscommunication and that we had submitted the paper. He added we were still interested in her thoughts.

That was the end of our exchanges. We learned about the ceiling effect concern when we received the commentary in early March of 2014.

Author: mbdonnellan

Professor Social and Personality Psychology Texas A &M University View all posts by mbdonnellan

27 thoughts on “Random Reflections on Ceiling Effects and Replication Studies”

replicatereplicate says:

May 21, 2014 at 6:23 pm

well, that seems quite reasonable. Looking at the percentages though, it may well be that the groups already significantly differ in their moral preferences. If they are indeed significantly different in their moral attitudes, then it’s likely it’s not a very close replication (i.e., you have a different type of sample).

Reply
1. replicatereplicate says:
  
  May 21, 2014 at 7:12 pm
  
  by the way – if the sample is indeed so different, than that is fascinating to know, of course! It may add to any model (whether the effect is true or not). It will help create boundary conditions to the effect (if it is true), and it adds information that in the old days would have been unknown!
  
  Reply
2. mbdonnellan says:
  
  May 21, 2014 at 7:14 pm
  
  This is reasonable explanation. A good first step would be to try to get more baseline data on these vignettes to see what the distributions look like in a number of samples from other universities. This would give us a broader frame of reference for interpreting the current studies. Researchers at other universities can try it to see if college student characteristics are moderators by trying the sentence scrambling study as it is the easiest of the 2 studies to implement. We could also try to do the sentence scrambling priming study on mTurk but some people don’t think priming on the internet is valid.
  
  Reply
  1. replicatereplicate says:
    
    May 21, 2014 at 7:18 pm
    
    Perhaps with much, much greater samples? (I could understand that MTurk simply introduces A LOT of noise). That said, I also wonder about how effective the SST is. But, a university more comparable to the original one would be great.
3. David Johnson says:
  
  May 21, 2014 at 7:22 pm
  
  I think this is an important point, especially in the context of whether there are hidden moderators of the cleanliness effect (e.g, population differences). But we had no a priori predictions about differences between our samples (nor did Dr. Schnall mention such suspicions when she reviewed our proposal). Even if it is the case that our samples are different, this information useful for those planning to do research in this area, as it suggests that there are moderators that need to be ferreted out.
  
  Reply
  1. replicatereplicate says:
    
    May 21, 2014 at 7:24 pm
    
    entirely agreed. And an update of the model can be one function of replications.
  2. mbdonnellan says:
    
    May 21, 2014 at 7:25 pm
    
    We have the task programmed in Qualtrics and we are happy to share it. We can even try an mTurk sample once we get some cash and amend the IRB.
  3. replicatereplicate says:
    
    May 21, 2014 at 7:27 pm
    
    Also, I don’t think we should expect that original researchers know everything about the model. Being wrong about your model is something that all scientists experience in their lives. We usually are wrong. It’s a sign of progress. We should also try to be open to that end (let’s forgive each other for not having all predictions yet).
  4. mbdonnellan says:
    
    May 21, 2014 at 7:32 pm
    
    Agree 100%. Would RT if I could.
  5. replicatereplicate says:
    
    May 21, 2014 at 8:11 pm
    
    This is – by the way – a good example of why the results should be reviewed too. This is both helpful for the replication authors and original authors, so that together we get to better models.
4. Marek Vranka says:
  
  May 24, 2014 at 1:52 pm
  
  in order to keep it in perspective – do not forget that those %s from original study are based on 20(!) participants. as everything in that study, these estimates are highly unreliable.
  
  Reply
  1. Gregg Collins says:
    
    May 30, 2014 at 7:15 pm
    
    This is the second forum where I’ve seen a post from you that sounds like a broad-brush, emotional attack on the original paper (“…as everything in that study [sic]… highly unreliable.”). I have no ax to grind here whatsoever–I’m a computer scientists who is fascinated by the replication controversy, but has no a priori opinion about the people, studies, events or methods being discussed. But from where I sit, your behavior seems to lend some credence to those who are complaining about bullying.
  2. simine says:
    
    June 25, 2014 at 3:08 pm
    
    (i’m a little late here, sorry)
    Gregg – ‘reliability’ is a technical term. almost by definition, a sample statistic that is based on very few observations is an unreliable estimate of the true (population) parameter. so saying that the results are unreliable is not an emotional attack but a scientific criticism. you can disagree (e.g., if you think the sample was not small), but it seems wrong to call this an emotional attack, much less bullying.
    some relevant thoughts:
    http://sometimesimwrong.typepad.com/wrong/2014/06/self-correction-hurts.html
    http://funderstorms.wordpress.com/2014/06/23/when-did-we-get-so-delicate/
replicatereplicate says:

May 21, 2014 at 6:26 pm

Also, I am very disturbed by this comment:

“The “everything is fine” crew will believe that Dr. Schnall demolished our work whereas the “replication is important” crew will think we raised good points.”

I really like that more replications are done. But, let’s not divide us into camps. There’s a lot of people in the “in-between” area, that also criticise bad replications, but also see the merit of good replications. We know from the literature what categorization does to our perception of people…

Reply
1. mbdonnellan says:
  
  May 21, 2014 at 6:58 pm
  
  I think this is a good point. We are all in this science thing together.
  
  Reply
  1. J S says:
    
    May 25, 2014 at 5:52 am
    
    I agree, but as predicted, such has been the case. As a graduate student, I have been extremely disappointed by the discourse, or lack thereof, following the publication of the special issue. Specifically, statements made by Dan Gilbert and JP de Ruiter in a back-and-forth on twitter. It has been disheartening to watch psychologists that I admire behave in such a way.
mbdonnellan says:

May 21, 2014 at 6:33 pm

I have a new policy on comments that I made prior to posting this based on the discussion at Science. I will only approve comments from people using real names.

Reply
1. mbdonnellan says:
  
  May 21, 2014 at 6:48 pm
  
  I will amend this on a case by case basis. I just had a colleague point out that new people might have good ideas but need some protection. I would prefer to have everything out in the open but I also value free speech. But I won’t tolerate nasty comments from anonymous sources!
  
  Reply
Pingback: Simone Schnall on her Experience with a Registered Replication Project | Character and Context
Pingback: Psychology News Round-Up (May 23rd) | Character and Context
Rob Folger says:

May 25, 2014 at 1:55 pm

I’m surprised that no one above has yet mentioned the ideas Danny Kahneman has expressed about grace, politeness, collegiality, and collaborative efforts. I’m doing this from memory, so it’s possible the ones I have in mind might not have been published yet. As I recall, he has written about this issue twice: once pointing out the nature of the problem & subseqently speaking more to the issue of etiquette. Perhaps the latter has appeared only online; if so, I hope subsequent posts will cite the source (I’m not in a good posiiton to look it up right now). Agree with Danny or not, it’s at least worthwhile to include his thoughts in the discussion. In my opinion, they are relevant when we consider not only distributive justice (our verdict on the presumed correctness of one set of results or another) but also the other elements identified as concerns many people have (evidenced by a vast amount of empirical literature in psychology that I assume is replicable!): interpersonal justice, informational justice, and procedural justice–see especially the work by scholars such as Tom Tyler, E. Allan Lind, and Bob Bies. The interpersonal realm pertains to treating people with respect and dignity, and I think at this point we ought to bend over backwards in our attempts to live up to those ideals. Issues of informational justice apparently were centered on transparency at the outset, which is all to the good. The aftermath, however, has suggested we might not yet have worked out all the necessary details. I think that overlaps with procedural justice (e.g., see criteria suggested by Gerald Leventhal), and those “how to” details might take longer than we realize to achieve something like a decent amount of consensus. Like beauty, after all, fairness is in the eye of the beholder!

Reply
Pingback: “Replication Bullying:” Who replicates the replicators? | Political Science Replication
Pingback: Rejoinder to Schnall (2014) in Social Psychology | Thinking is For Doing
Pingback: Felix Schönbrodt's website
Pingback: An undergraduate’s experience with replications | Cohen's b(rain)
Pingback: sometimes i'm wrong: another $*%#! blog post about repligate*
sbk says:

May 9, 2015 at 1:06 pm

As too often the case, the essential aspect of this issue is sidestepped or unnoticed. What, pray tell, is the theory (and I do not mean stipulation of folk psychological likelihood) and conceptual apparatus that mediates and enables the presumed priming effect? What is priming? I submit there are a number of variants and that they do not all follow from a single (or dual) mechanism(s).

I could go on at length, but I already have done so in print and will not here. I just want to make my point — psychology (generally, but not exclusively) is more concerned with demonstration-driven then theory-driven findings. And by “theory” I mean serious, conceptually sophisticated and (possibly) parametrically predictive structures that both explain as well as predict (more than a simple effect present or absent).

In a real sense, much of what passes as psychological “science” is little more than folk using the right techniques (cf. Feynman) to flash their demos, but lacking conceptual heft to qualify as nomologically meaningful explication of nature.

This, not simply hand wringing about replication, will ultimately either sink psychology or force a Kuhnian shift. It seems we have a sort of neo-behaviorism in which stipulated magical mental entities are now allowed with apparent impunity.

Reply

Share this:

Like this:

Related

Author: mbdonnellan

27 thoughts on “Random Reflections on Ceiling Effects and Replication Studies”

Leave a Reply Cancel reply