Dr. R’s Blog about Replicability

StickyFebruary 5, 2016Bayes-Factor, Default-Baysian-t-test, Median Observed Power, Meta-Analysis, Observed Power, Pcurve, Post-Hoc Power, Posteriori Power Analysis, Power, Psychology, Puniform, R-Index science R-Index4Science, Replicability, Replicability Ranking, Replicability Report, Yuan and MaxwellDr. R

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication (Cohen, 1994).

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

DEFINITION OF REPLICABILITY: In empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Latest
(March 1, 2017)
2016 Replicability Rankings of 103 Psychology Journals

(February, 2, 2017)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
REPLICABILITY REPORTS: Examining the replicability of research topics

RR No1. (April 19, 2016) Is ego-depletion a replicable effect?
RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

TOP TEN LIST

1. 2016 Replicability Rankings of 103 Psychology Journals
Rankings of 103 Psychology Journals according to the average replicability of a pulished significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2016 and a comparison of psychological disciplines (cognitive, clinical, social, developmental, personality, biological, applied).

2. Z-Curve: Estimating replicability for sets of studies with heterogeneous power (e.g., Journals, Departments, Labs)
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal. The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores. The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests. A description of the new method will be published when extensive simulation studies are completed.

3. Replicability-Rankings of Psychology Departments
This blog presents rankings of psychology departments on the basis of the replicability of significant results published in 105 psychology journals (see the journal rankings for a list of journals). Reported success rates in psychology journals are over 90%, but this percentage is inflated by selective reporting of significant results. After correcting for selection bias, replicability is 60%, but there is reliable variation across departments.

4. An Introduction to the R-Index
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

5. The Test of Insufficient Variance (TIVA)
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one. Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed). If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient. The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

6. Validation of Meta-Analysis of Observed (post-hoc) Power
This post examines the ability of various estimation methods to estimate power of a set of studies based on the reported test statistics in these studies. The results show that most estimation methods work well when all studies have the same effect size (homogeneous) or if effect sizes are heterogeneous and symmetrically distributed (heterogeneous). However, most methods fail when effect sizes are heterogeneous and have a skewed distribution. The post does not yet include the more recent method that uses the distribution of z-scores (powergraphs) to estimate observe power because this method was developed after this blog was posted.

7. Roy Baumeister’s R-Index
Roy Baumeister was a reviewer of my 2012 article that introduced the Incredibiliy Index to detect publication bias and dishonest reporting practices. In his review and in a subsequent email exchange, Roy Baumeister admitted that his published article excluded studies that failed to produce results in support of his theory that blood-glucose is important for self-regulation (a theory that is now generally considered to be false), although he disagrees that excluding these studies was dishonest. The R-Index builds on the incredibility index and provides an index of the strength of evidence that corrects for the influence of dishonest reporting practices. This post reports the R-Index for Roy Baumeister’s most cited articles. The R-Index is low and does not justify the nearly perfect support for empirical predictions in these articles. At the same time, the R-Index is similar to R-Indices for other sets of studies in social psychology. This suggests that dishonest reporting practices are the norm in social psychology and that published articles exaggerate the strength of evidence in support of social psychological theories.

8. How robust are Stereotype-Threat Effects on Women’s Math Performance?
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance. This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting. After correcting for these effects, the stereotype-threat effect was negligible. This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat. These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

9. The R-Index for 18 Multiple-Study Psychology Articles in the Journal SCIENCE.
Francis (2014) demonstrated that nearly all multiple-study articles by psychology researchers that were published in the prestigious journal SCIENCE showed evidence of dishonest reporting practices (disconfirmatory evidence was missing). Francis (2014) used a method similar to the incredibility index. One problem of this method is that the result is a probability that is influenced by the amount of bias and the number of results that were available for analysis. As a result, an article with 9 studies and moderate bias is treated the same as an article with 4 studies and a lot of bias. The R-Index avoids this problem by focusing on the amount of bias (inflation) and the strength of evidence. This blog post shows the R-Index of the 18 studies and reveals that many articles have a low R-Index.

10. The Problem with Bayesian Null-Hypothesis Testing
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect). They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist. This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1). As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2). A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

How Replicable are Focal Hypothesis Tests in the Journal Psychological Science?

May 15, 2017Psychological Science, Replicability, Replicability-Ranking, Replicability-Report, ReplicationDr. R

Over the past five years, psychological science has been in a crisis of confidence. For decades, psychologists have assumed that published significant results provide strong evidence for theoretically derived predictions, especially when authors presented multiple studies with internal replications within a single article (Schimmack, 2012). However, even multiple significant results provide little empirical evidence, when journals only publish significant results (Sterling, 1959; Sterling et al., 1995). When published results are selected for significance, statistical significance loses its ability to distinguish replicable effects from results that are difficult to replicate or results that are type-I errors (i.e., the theoretical prediction was false).

The crisis of confidence led to several initiatives to conduct independent replications. The most informative replication initiative was conducted by the Open Science Collaborative (Science, 2015). It replicated close to 100 significant results published in three high-ranked psychology journals. Only 36% of the replication studies replicated a statistically significant result. The replication success rate varied by journal. The journal “Psychological Science” achieved a success rate of 42%.

The low success rate raises concerns about the empirical foundations of psychology as a science. Without further information, a success rate of 42% implies that it is unclear which published results provide credible evidence for a theory and which findings may not replicate. It is impossible to conduct actual replication studies for all published studies. Thus, it is highly desirable to identify replicable findings in the existing literature.

One solution is to estimate replicability for sets of studies based on the published test statistics (e.g., F-statistic, t-values, etc.). Schimmack and Brunner (2016) developed a statistical method, Powergraphs, that estimates the average replicability of a set of significant results. This method has been used to estimate replicability of psychology journals using automatic extraction of test statistics (2016 Replicability Rankings, Schimmack, 2017). The results for Psychological Science produced estimates in the range from 55% to 63% for the years 2010-2016 with an average of 59%. This is notably higher than the success rate for the actual replication studies, which only produced 42% successful replications.

There are two explanations for this discrepancy. First, actual replication studies are not exact replication studies and differences between the original and the replication studies may explain some replication failures. Second, the automatic extraction method may overestimate replicability because it may include non-focal statistical tests. For example, significance tests of manipulation checks can be highly replicable, but do not speak to the replicability of theoretically important predictions.

To address the concern about automatic extraction of test statistics, I estimated replicability of focal hypothesis tests in Psychological Science with hand-coded, focal hypothesis tests. I used three independent data sets.

Study 1

For Study 1, I hand-coded focal hypothesis tests of all studies in the 2008 Psychological Science articles that were used for the OSC reproducibility project (Science, 2015).

The powergraphs show the well-known effect of publication bias in that most published focal hypothesis tests report a significant result (p < .05, two-tailed, z > 1.96) or at least a marginally significant result (p < .10, two-tailed or p < .05, one-tailed, z > 1.65). Powergraphs estimate the average power of studies with significant results on the basis of the density distribution of significant z-scores. Average power is an estimate of replicabilty for a set of exact replication studies. The left graph uses all significant results. The right graph uses only z-scores greater than 2.4 because questionable research practices may produce many just-significant results and lead to biased estimates of replicability. However, both estimation methods produce similar estimates of replicability (57% & 61%). Given the small number of statistics the 95%CI is relatively wide (left graph: 44% to 73%). These results are compatible with the low actual success rate for actual replication studies (42%) and the estimate based on automated extraction (59%).

Study 2

The second dataset was provided by Motyl et al. (JPSP, in press), who coded a large number of articles from social psychology journals and psychological science. Importantly, they coded a representative sample of Psychological Science studies from the years 2003, 2004, 2013, and 2014. That is, they did not only code social psychology articles published in Psychological Science. The dataset included 281 test statistics from Psychological Science.

The powergraph looks similar to the powergraph in Study 1. More important, the replicability estimates are also similar (57% & 52%). The 95%CI for Study 1 (44% to 73%) and Study 2 (left graph: 49% to 65%) overlap considerably. Thus, two independent coding schemes and different sets of studies (2008 vs. 2003-2004/2013/2014) produce very similar results.

Study 3

Study 3 was carried out in collaboration with Sivaani Sivaselvachandran, who hand-coded articles from Psychological Science published in 2016. The replicability rankings showed a slight positive trend based on automatically extracted test statistics. The goal of this study was to examine whether hand-coding would also show an increase in replicability. An increase was expected based on an editorial by D. Stephen Linday, incoming editor in 2015, who aimed to increase replicability of results published in Psychological Science by introducing badges for open data and preregistered hypotheses. However, the results failed to show a notable increase in average replicability.

The replicability estimate was similar to those in the first two studies (59% & 59%). The 95%CI ranged from 49% to 70%. These wide confidence intervals make it difficult to notice small improvements, but the histogram shows that just significant results (z = 2 to 2.2) are still the most prevalent results reported in Psychological Science and that non-significant results that are to be expected are not reported.

Combined Analysis

Given the similar results in all three studies, it made sense to pool the data to obtain the most precise estimate of replicability of results published in Psychological Science. With 479 significant test statistics, replicability was estimated at 58% with a 95%CI ranging from 51% to 64%. This result is in line with the estimated based on automated extraction of test statistics (59%). The reason for the close match between hand-coded and automated results could be that Psych Science publishes short articles and authors may report mostly focal results because space does not allow for extensive reporting of other statistics. The hand-coded data confirm that replicabilty in Psychological Science is likely to be above 50%.

It is important to realize that the 58% estimate is an average. Powergraphs also show average replicability for segments of z-scores. Here we see that replicabilty for just-significant results (z < 2.5 ~ p > .01) is only 35%. Even for z-score between 2.5 and 3.0 (~ p > .001) is only 47%. Once z-scores are greater than 3, average replicabilty is above 50% and with z-scores greater than 4, replicability is greater than 80%. For any single study, p-values can vary greatly due to sampling error, but in general a published result with a p-value < .001 is much more likely to replicate than a p-value > .01 (see also OSC, Science, 2015).

Conclusion

This blog-post used hand-coding of test-statistics published in Psychological Science, the flagship journal of the Association for Psychological Science, to estimate replicabilty of published results. Three dataset produced convergent evidence that the average replicabilty of exact replication studies is 58% +/- 7%. This result is consistent with estimates based on automatic extraction of test statistics. It is considerably higher than the success rate of actual replication studies in the OSC reproducibility project (42%). One possible reason for this discrepancy is that actual replication studies are never exact replication studies, which makes it more difficult to obtain statistical significance if the original studies are selected for significance. For example, the original study may have had an outlier in the experimental group that helped to produce a significant result. Not removing this outlier is not considered a questionable research practice, but an exact replication study will not reproduce the same outlier and may fail to reproduce a just-significant result. More broadly, any deviation from the assumptions underlying the computation of test statistics will increase the bias that is introduced by selecting significant results. Thus, the 58% estimate is an optimistic estimate of the maximum replicability under ideal conditions.

At the same time, it is important to point out that 58% replicability for Psychological Science does not mean psychological science is rotten to the core (Motyl et al., in press) or that most reported results are false (Ioannidis, 2005). Even results that did not replicate in actual replication studies are not necessarily false positive results. It is possible that more powerful studies would produce a significant result, but with a smaller effect size estimate.

Hopefully, these analyses will spur further efforts to increase replicability of published results in Psychological Science and in other journals. We are already near the middle of 2017 and can look forward to the 2017 results.

How replicable are statistically significant results in social psychology? A replication and extension of Motyl et al. (in press).

May 4, 2017Power, r-index, Replicability, Social Psychology, Statistical PowerDr. R

Forthcoming article:
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J., Sun, J., Washburn, A. N., Wong, K., Yantis, C. A., & Skitka, L. J. (in press). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology. (preprint)

Brief Introduction

Since JPSP published incredbile evidence for mental time travel (Bem, 2011), the credibility of social psychological research has been questioned. There is talk of a crisis of confidence, a replication crisis, or a credibility crisis. However, hard data on the credibility of empirical findings published in social psychology journals are scarce.

There have been two approaches to examine the credibility of social psychology. One approach relies on replication studies. Authors attempt to replicate original studies as closely as possible. The most ambitious replication project was carried out by the Open Science Collaboration (Science, 2015) that replicated 1 study from 100 articles; 54 articles were classified as social psychology. For original articles that reported a significant result, only a quarter replicated a significant result in the replication studies. This estimate of replicability suggests that researches conduct many more studies than are published and that effect sizes in published articles are inflated by sampling error, which makes them difficult to replicate. One concern about the OSC results is that replicating original studies can be difficult. For example, a bilingual study in California may not produce the same results as a bilingual study in Canada. It is therefore possible that the poor outcome is partially due to problems of reproducing the exact conditions of original studies.

A second approach is to estimate replicability of published results using statistical methods. The advantage of this approach is that replicabiliy estimates are predictions for exact replication studies of the original studies because the original studies provide the data for the replicability estimates. This is the approach used by Motyl et al.

The authors sampled 30% of articles published in 2003-2004 (pre-crisis) and 2013-2014 (post-crisis) from four major social psychology journals (JPSP, PSPB, JESP, and PS). For each study, coders identified one focal hypothesis and recorded the statistical result. The bulk of the statistics were t-values from t-tests or regression analyses and F-tests from ANOVAs. Only 19 statistics were z-tests. The authors applied various statistical tests to the data that test for the presence of publication bias or whether the studies have evidential value (i.e., reject the null-hypothesis that all published results are false positives). For the purpose of estimating replicability, the most important statistic is the R-Index.

The R-Index has two components. First, it uses the median observed power of studies as an estimate of replicability (i.e., the percentage of studies that should produce a significant result if all studies were replicated exactly). Second, it computes the percentage of studies with a significant result. In an unbiased set of studies, median observed power and percentage of significant results should match. Publication bias and questionable research practices will produce more significant results than predicted by median observed power. The discrepancy is called the inflation rate. The R-Index subtracts the inflation rate from median observed power because median observed power is an inflated estimate of replicability when bias is present. The R-Index is not a replicability estimate. That is, an R-Index of 30% does not mean that 30% of studies will produce a significant result. However, a set of studies with an R-Index of 30 will have fewer successful replications than a set of studies with an R-Index of 80. An exception is an R-Index of 50, which is equivalent with a replicability estimate of 50%. If the R-Index is below 50, one would expect more replication failures than successes.

Motyl et al. computed the R-Index separately for the 2003/2004 and the 2013/2014 results and found “the R-index decreased numerically, but not statistically over time, from .62 [CI95% = .54, .68] in 2003-2004 to .52 [CI95% = .47, .56] in 2013-2014. This metric suggests that the field is not getting better and that it may consistently be rotten to the core.”

I think this interpretation of the R-Index results is too harsh. I consider an R-Index below 50 an F (fail). An R-Index in the 50s is a D, and an R-Index in the 60s is a C. An R-Index greater than 80 is considered an A. So, clearly there is a replication crisis, but social psychology is not rotten to the core.

The R-Index is a simple tool, but it is not designed to estimate replicability. Jerry Brunner and I developed a method that can estimate replicability, called z-curve. All test-statistics are converted into absolute z-scores and a kernel density distribution is fitted to the histogram of z-scores. Then a mixture model of normal distributions is fitted to the density distribution and the means of the normal distributions are converted into power values. The weights of the components are used to compute the weighted average power. When this method is applied only to significant results, the weighted average power is the replicability estimate; that is, the percentage of significant results that one would expect if the set of significant studies were replicated exactly. Motyl et al. did not have access to this statistical tool. They kindly shared their data and I was able to estimate replicability with z-curve. For this analysis, I used all t-tests, F-tests, and z-tests (k = 1,163). The Figure shows two results. The left figure uses all z-scores greater than 2 for estimation (all values on the right side of the vertical blue line). The right figure uses only z-scores greater than 2.4. The reason is that just-significant results may be compromised by questionable research methods that may bias estimates.

The key finding is the replicability estimate. Both estimations produce similar results (48% vs. 49%). Even with over 1,000 observations there is uncertainty in these estimates and the 95%CI can range from 45 to 54% using all significant results. Based on this finding, it is predicted that about half of these results would produce a significant result again in a replication study.

However, it is important to note that there is considerable heterogeneity in replicability across studies. As z-scores increase, the strength of evidence becomes stronger, and results are more likely to replicate. This is shown with average power estimates for bands of z-scores at the bottom of the figure. In the left figure, z-scores between 2 and 2.5 (~ .01 < p < .05) have only a replicability of 31%, and even z-scores between 2.5 and 3 have a replicability below 50%. It requires z-scores greater than 4 to reach a replicability of 80% or more. Similar results are obtained for actual replication studies in the OSC reproducibilty project. Thus, researchers should take the strength of evidence of a particular study into account. Studies with p-values in the .01 to .05 range are unlikely to replicate without boosting sample sizes. Studies with p-values less than .001 are likely to replicate even with the same sample size.

Independent Replication Study

Schimmack and Brunner (2016) applied z-curve to the original studies in the OSC reproducibility project. For this purpose, I coded all studies in the OSC reproducibility project. The actual replication project often picked one study from articles with multiple studies. 54 social psychology articles reported 173 studies. The focal hypothesis test of each study was used to compute absolute z-scores that were analyzed with z-curve.

The two estimation methods (using z > 2.0 or z > 2.4) produced very similar replicability estimates (53% vs. 52%). The estimates are only slightly higher than those for Motyl et al.’s data (48% & 49%) and the confidence intervals overlap. Thus, this independent replication study closely replicates the estimates obtained with Motyl et al.’s data.

Automated Extraction Estimates

Hand-coding of focal hypothesis tests is labor intensive and subject to coding biases. Often studies report more than one hypothesis test and it is not trivial to pick one of the tests for further analysis. An alternative approach is to automatically extract all test statistics from articles. This makes it also possible to base estimates on a much larger sample of test results. The downside of automated extraction is that articles also report statistical analysis for trivial or non-critical tests (e.g., manipulation checks). The extraction of non-significant results is irrelevant because they are not used by z-curve to estimate replicability. I have reported the results of this method for various social psychology journals covering the years from 2010 to 2016 and posted powergraphs for all journals and years (2016 Replicability Rankings). Further analyses replicated the results from the OSC reproducibility project that results published in cognitive journals are more replicable than those published in social journals. The Figure below shows that the average replicability estimate for social psychology is 61%, with an encouraging trend in 2016. This estimate is about 10% above the estimates based on hand-coded focal hypothesis tests in the two datasets above. This discrepancy can be due to the inclusion of less original and trivial statistical tests in the automated analysis. However, a 10% difference is not a dramatic difference. Neither 50% nor 60% replicability justify claims that social psychology is rotten to the core, nor do they meet the expectation that researchers should plan studies with 80% power to detect a predicted effect.

Moderator Analyses

Motyl et al. (in press) did extensive coding of the studies. This makes it possible to examine potential moderators (predictors) of higher or lower replicability. As noted earlier, the strength of evidence is an important predictor. Studies with higher z-scores (smaller p-values) are, on average, more replicable. The strength of evidence is a direct function of statistical power. Thus, studies with larger population effect sizes and smaller sampling error are more likely to replicate.

It is well known that larger samples have less sampling error. Not surprisingly, there is a correlation between sample size and the absolute z-scores (r = .3). I also examined the R-Index for different ranges of sample sizes. The R-Index was the lowest for sample sizes between N = 40 and 80 (R-Index = 43), increased for N = 80 to 200 (R-Index = 52) and further for sample sizes between 200 and 1,000 (R-Index = 69). Interestingly, the R-Index for small samples with N < 40 was 70. This is explained by the fact that research designs also influence replicability and that small samples often use more powerful within-subject designs.

A moderator analysis with design as moderator confirms this. The R-Indices for between-subject designs is the lowest (R-Index = 48) followed by mixed designs (R-Index = 61) and then within-subject designs (R-Index = 75). This pattern is also found in the OSC reproducibility project and partially accounts for the higher replicability of cognitive studies, which often employ within-subject designs.

Another possibility is that articles with more studies package smaller and less replicable studies. However, number of studies in an article was not a notable moderator: 1 study R-Index = 53, 2 studies R-Index = 51, 3 studies R-Index = 60, 4 studies R-Index = 52, 5 studies R-Index = 53.

Conclusion

Motyl et al. (in press) coded a large and representative sample of results published in social psychology journals. Their article complements results from the OSC reproducibility project that used actual replications, but a much smaller number of studies. The two approaches produce different results. Actual replication studies produced only 25% successful replications. Statistical estimates of replicability are around 50%. Due to the small number of actual replications in the OSC reproducibility project, it is important to be cautious in interpreting the differences. However, one plausible explanation for lower success rates in actual replication studies is that it is practically impossible to redo a study exactly. This may even be true when researchers conduct three similar studies in their own lab and only one of these studies produces a significant result. Some non-random, but also not reproducible, factor may have helped to produce a significant result in this study. Statistical models assume that we can redo a study exactly and may therefore overestimate the success rate for actual replication studies. Thus, the 50% estimate is an optimistic estimate for the unlikely scenario that a study can be replicated exactly. This means that even though optimists may see the 50% estimate as “the glass half full,” social psychologists need to increase statistical power and pay more attention to the strength of evidence of published results to build a robust and credible science of social behavior.

Hidden Figures: Replication Failures in the Stereotype Threat Literature

April 7, 2017Publication Bias, r-index, Replicability, Replication, Research Integrity, Statistical Power, TIVAStereotype-Threat, Test of Insufficient Variance, TIVADr. R

In the past five years, it has become apparent that many classic and important findings in social psychology fail to replicate (Schimmack, 2016). The replication crisis is often considered a new phenomenon, but failed replications are not entirely new. Sometimes these studies have simply been ignored. These studies deserve more attention and need to be reevaluated in the context of the replication crisis in social psychology.

In the past, failed replications were often dismissed because seminal articles were assumed to provide robust empirical support for a phenomenon, especially if an article presented multiple studies. The chance of reporting a false positive results in a multiple study article is low because the risk of a false positive decreases exponentially (Schimmack, 2012). However, the low risk of a false positive is illusory if authors only publish studies that worked. In this case, even false positives can be supported by significant results in multiple studies, as demonstrated in the infamous ESP study by Bem (2011). As a result, publication bias undermines the reporting of statistical significance as diagnostic information about the risk of false positives (Sterling, 1959) and many important theories in social psychology rest on shaky empirical foundations that need to be reexamined.

Research on stereotype threat and women’s performance on math tests is one example where publication bias undermines the findings in a seminal study that produced a large literature of studies on gender differences in math performance. After correcting for publication bias, this literature shows very little evidence that stereotype threat has a notable and practically significant effect on women’s math performance (Flore & Wicherts, 2014).

Another important line of research has examined the contribution of stereotype threat to differences between racial groups on academic performance tests. This blog post examines the strength of the empirical evidence for stereotype threat effects in the seminal article by Steele and Aronson (1995). This article is currently the 12th most cited article in the top journal for social psychology, Journal of Personality and Social Psychology (2,278 citations so far).

According to the abstract, “stereotype threat is being at risk of confirming, as self-characteristic, a negative stereotype about one’s group.” Studies 1 and 2 showed that “reflecting the pressure of this vulnerability, Blacks underperformed in relation to Whites in the ability-diagnostic condition but not in the nondiagnostic condition (with Scholastic Aptitude Tests controlled).” “Study 3 validated that ability-diagnosticity cognitively activated the racial stereotype in these participants and motivated them not to conform to it, or to be judged by it.” “Study 4 showed that mere salience of the stereotype could impair Blacks’ performance even when the test was not
ability diagnostic.”

The results of Study 4 motivated Stricker and colleagues to examine the influence of stereotype-treat on test performance in a real-world testing situation. These studies had large samples and were not limited to students at Stanford. One study was reported in a College Board Report (Stricker and Ward, 1998). Another two studies were published in the Journal of Applied Social Psychology (Stricker & Ward, 2004). This article received only 52 citations, although it reported two studies with an experimental manipulation of stereotype threat in a real assessment context. One group of participants were asked about their gender or ethnicity before the text, the other group did not receive these questions. As noted in the abstract, neither the inquiry about race, nor about gender, had a significant effect on test performance. In short, this study failed to replicate Study 4 of the classic and widely cited article by Steele and Aronson.

Stricker and Ward’s Abstract
Steele and Aronson (1995) found that the performance of Black research participants on
ability test items portrayed as a problem-solving task, in laboratory experiments, was affected adversely when they were asked about their ethnicity. This outcome was attributed to stereotype threat: Performance was disrupted by participants’ concerns about fulfilling the negative stereotype concerning Black people’s intellectual ability. The present field experiments extended that research to other ethnic groups and to males and females taking operational tests. The experiments evaluated the effects of inquiring about ethnicity and gender on the performance of students taking 2 standardized tests-the Advanced Placement Calculus AB Examination, and the Computerized Placement Tests-in actual test administrations. This inquiry did not have any effects on the test performance of Black, female, or other subgroups of students that were both statistically and practically significant.

The article also mentions a personal communication with Steele, in which Steele mentions an unpublished study that also failed to demonstrate the effect under similar conditions.

“In fact, Steele found in an unpublished pilot study that inquiring about ethnicity did not affect Black participants’ performance when the task was described as diagnostic of their ability (C. M. Steele, personal communication, May 2 1, 1997), in contrast to the
substantial effect of inquiring when the task was described as nondiagnostic.”

A substantive interpretation of this finding is that inquires about race or gender do not produce stereotype threat effects when a test is diagnostic because a diagnostic test already activates stereotype threat. However, if this were a real moderator, it would be important to document this fact and it is not clear why this finding obtained in an earlier study by Steele remained unpublished. Moreover, it is premature to interpret the significant result in the published study with a non-diagnostic task and the non-significant result in an unpublished study with a diagnostic task as evidence that diagnosticity moderates the effect of the stereotype-threat manipulation. A proper test of this moderator hypothesis would require the demonstration of a three-way interaction between race, inquiry about race, and diagnosticity. Absent this evidence, it remains possible that diagnosticity is not a moderator and that the published result is a false positive (or a positive result with an inflated effect size estimate). In contrast, there appears to be consistent evidence that inquiries about race or gender before a real assessment of academic performance does not influence performance. This finding is not widely publicized, but is important for a better understanding of performance differences in real world settings.

The best way to examine the replicability of Steele and Aronson’s seminal finding with non-diagnostic tasks would be to conduct an exact replication study. However, exact replication studies are difficult and costly. An alternative is to examine the robustness of the published results by taking a closer look at the strength of the statistical results reported by Steele and Aronson, using modern statistical tests of publication bias and statistical power like the R-Index (Schimmack, 2014) and the Test of Insufficient Variance (TIVA, Schimmack, 2014).

Replicability Analysis of Steele and Aronson’s four studies

Study 1. The first study had a relatively large sample of N = 114 participants, but it is not clear how many of the participants were White or Black. The study also had a 2 x 3 design, which leaves less than 20 participants per condition. The study produced a significant main effect of condition, F(2, 107) = 4.74, and race, F(1,107) = 5.22, but the critical condition x race interaction was not significant (reported as p > .19). However, a specific contrast showed significant differences between Black participants in the diagnostic condition and the non-diagnostic condition, t(107) = 2.88, p = .005, z = 2.82. The authors concluded “in sum, then, the hypothesis was supported by the pattern of contrasts, but when tested over the whole design, reached only marginal significance” (p. 800). In other words, Study 1 provided only weak support for the stereotype threat hypothesis.

Study 2. Study 2 eliminated one of the three experimental conditions. Participants were 20 Black and 20 White participants. This means there were only 10 participants in each condition of a 2 x 2 design. The degrees of freedom further indicate that the actual sample size was only 38 participants. Given the weak evidence in Study 1, there is no justification for a reduction in the number of participants per cell, although the difficulty of recruiting Black participants at Stanford may explain this inadequate sample size. Nevertheless, the study showed a significant interaction between race and test description, F(1,35) = 8.07, p = .007. The study also replicated the contrast from Study 1 that Black participants in the diagnostic condition performed significantly worse than Black participants in the non-diagnostic group, t(35) = 2.38, p = .023, z = 2.28.

Studies 1 and 2 are close replications of each other. The consistent finding across the two studies that supports stereotype-treat theory is the finding that merely changing the description of an assessment task changes Black participants performance, as revealed by significant differences between the diagnostic and non-diagnostic condition in both studies. The problem is that both studies had small numbers of Black participants and that small samples have low power to produce significant results. As a result, it is unlikely that a pair of studies would produce significant results in both studies.

Observed power in the two studies is .81 and .62 with median observed power of .71. Thus, the actual success rate of 100% (2 out of 2 significant results) is 29 percentage points higher than the expected success rate. Moreover, when inflation is evident, median observed power is also inflated. To correct for this inflation, the Replicability-Index (R-Index) subtracts inflation from median observed power, which yields an R-Index of 42. Any value below 50 is considered unacceptably low and I give it a letter grade F, just like students at American Universities receive an F for exams with less than 50% correct answers. This does not mean that stereotype threat is not a valid theory or that there was no real effect in this pair of studies. It simply means that the evidence in this highly cited article is insufficient to make strong claims about the causes of Black’s performance on academic tests.

The Test of Insufficient Variance (TIVA) provides another way to examine published results. Test statistics like t-values vary considerably from study to study even if the exact same study is conducted twice (or if one larger sample is randomly split into two sub-samples). When test-statistics are converted into z-scores, sampling error (the random variability from sample to sample) follows approximately a standard normal distribution with a variance of 1. If the variance is considerably smaller than 1, it suggests that the reported results represent a selected sample. Often the selection is a result of publication bias. Applying TIVA to the pair of studies, yields a variance of Var(z) = 0.15. As there are only two studies, it is possible that this outcome occurred by chance, p = .300, and it does not imply intentional selection for significance or other questionable research practices. Nevertheless, it suggests that future replication studies will be more variable and produce some non-significant results.

In conclusion, the evidence presented in the first two studies is weaker than we might assume if we focused only on the fact that both studies produced significant contrasts. Given publication bias, the fact that both studies reported significant results provides no empirical evidence because virtually all published studies report significant results. The R-Index quantifies the strength of evidence for an effect while taking the influence of publication bias into account and it shows that the two studies with small samples provide only weak evidence for an effect.

Study 3. This study did not examine performance. The aim was to demonstrate activation of stereotype threat with a sentence completion task. The sample size of 68 participants (35 Black, 33 White) implied that only 11 or 12 participants were assigned to one of the six cells in a 2 (race) by 3 (task description) design. The study produced main effects for race and condition, but most importantly it produced a significant interaction effect, F(2,61) = 3.30, p = .044. In addition, Black participants in the diagnostic condition had more stereotype-related associations than Black participants in the non-diagnostic condition, t(61) = 3.53,

Study 4. This study used inquiry about race to induce stereotype-threat. Importantly, the task was described as non-diagnostic (as noted earlier, a similar study produced no significant results when the task was described as diagnostic). The design was a 2 x 2 design with 47 participants, which means only 11 or 12 participants were allocated to the four conditions. The degrees of freedom indicated that cell frequencies were even lower. The study produced a significant interaction effect, F(1,39) = 7.82, p = .008. The study also produced a significant contrast between Blacks in the race-prime condition and the no-prime condition, t(39) = 2.43, p = .020.

The contrast effect in Study 3 is strong, but it is not a performance measure. If stereotype threat mediates the effect of task characteristics and performance, we would expect a stronger effect on the measure of the mediator than on the actual outcome of interest, task performance. The key aim of stereotype threat theory is to explain differences in performance. With a focus on performance outcomes, it is possible to examine the R-Index and TIVA of Studies 1, 2, and 4. All three studies reported significant contrasts between Black students randomly assigned to two groups that were expected to show performance differences (Table 1).

Table 1

Study	Test Statistic	p-value	z-score	obs.pow
Study 1	t(107) = 2.88	0.005	2.82	0.81
Study 2	t(35)=2.38	0.023	2.28	0.62
Study 4	t(39) = 2.43	0.020	2.33	0.64

Median observed power is 64 and the R-Index is well below 50, 64 – 36 = 28 (F). The variance in z-scores is Var(z) = 0.09, p = .086. These results cast doubt about the replicability of the performance effects reported in Steele and Aronson’s seminal stereotype threat article.

Conclusion

Racial stereotypes and racial disparities are an important social issue. Social psychology aims and promises to contribute to the understanding of this issue by conducting objective, scientific studies that can inform our understanding of these issues. In order to live up to these expectations, social psychology has to follow the rules of science and listen to the data. Just like it is important to get the numbers right to send men and women into space (and bring them back), it is important to get the numbers right when we use science to understand women and men on earth. Unfortunately, social psychologists have not followed the examples of astronomers and the numbers do not add up.

The three African American women, features in this years movie “Hidden Figures”***, Katherine Johnson, Dorothy Vaughan, and Mary Jackson might not approve of the casual way social psychologists use numbers in their research, especially the wide-spread practice of hiding numbers that do not match expectations. No science that wants to make a real-world contribution can condone this practice. It is also not acceptable to simply ignore published results from well-conducted studies with large samples that challenge a prominent theory.

Surely, the movie Hidden Figures dramatized some of the experiences of Black women at NASA, but there is little doubt that Katherine Johnson, Dorothy Vaughan, and Mary Jackson encountered many obstacles that might be considered stereotype threatening situations. Yet, they prevailed and they paved the way for future generations of stereotyped groups. Understanding racial and gender bias and performance differences remains an important issue and that is the reason why it is important to shed a light on hidden numbers and put simplistic theories under the microscope. Stereotype threat is too often used as a simple explanation that avoids tackling deeper and more difficult issues that cannot be easily studied in a quick laboratory experiment with undergraduate students at top research universities. It is time for social psychologists to live up to its promises by tackling real world issues with research designs that have real world significance that produce real evidence using open and transparent research practices.

————————————————————————————————————————————

*** If you haven’t seen the movie, I highly recommend it.

Personalized Adjustment of p-values for publication bias

March 13, 2017UncategorizedDr. R

The logic of null-hypothesis significance testing is straightforward (Schimmack, 2017). The observed signal in a study is compared against the noise in the data due to sampling variation. This signal to noise ratio is used to compute a probability; p-value. If this p-value is below a threshold, typically p < .05, it is assumed that the observed signal is not just noise and the null-hypothesis is rejected in favor of the hypothesis that the observed signal reflects a true effect.

NHST aims to keep the probability of a false positive discovery at a desirable rate. With p < .05, no more than 5% of ALL statistical tests can be false positives. In other words, the long-run rate of false positive discoveries cannot exceed 5%.

The problem with the application of NHST in practice is that not all statistical results are reported. As a result, the rate of false positive discoveries can be much higher than 5% (Sterling, 1959; Sterling et al., 1995) and statistical significance no longer provides meaningful information about the probability of false positive results.

In order to produce meaningful statistical results it would be necessary to know how many statistical tests were actually performed to produce published significant results. This set of studies includes studies with non-significant results that remained unpublished. This set of studies is often called researchers’ file-drawer (Rosenthal, 1979). Schimmack and Brunner (2016) developed a statistical method that estimates the size of researchers’ file drawer. This makes it possible to correct reported p-values for publication bias so that p-values resume their proper function of providing statistical evidence about the probability of observing a false-positive result.

The correction process is first illustrated with a powergraph for statistical results reported in 103 journals in the year 2016 (see 2016 Replicability Rankings for more details). Each test statistic is converted into an absolute z-score. Absolute z-scores quantify the signal to noise ratio in a study. Z-scores can be compared against the standard normal distribution that is expected from studies without an effect (the null-hypothesis). A z-score of 1.96 (see red dashed vertical line in the graph) corresponds to the typical p < .05 (two-tailed) criterion. The graph below shows that 63% of reported test statistics were statistically significant using this criterion.

All.2016.Ranking.Journals.Combined

Powergraphs use a statistical method, z-curve (Schimmack & Brunner, 2016) to model the distribution of statistically significant z-scores (z-scores > 1.96). Based on the model results, it estimates how many non-significant results one would expect. This expected distribution is shown with the grey curve in the figure. The grey curve overlaps with the green and black curve. It is clearly visible that the estimated number of non-significant results is much larger than the actually reported number of non-significant results (the blue bars of z-scores between 0 and 1.96). This shows the size of the file-drawer.

Powergraphs provide important information about the average power of studies in psychology. Power is the average probability of obtaining a statistically significant result in the set of all statistical tests that were conducted, including the file drawer. The estimated power is 39%. This estimate is consistent with other estimates of power (Cohen, 1962; Sedlmeier & Gigerenzer, 1989), and below the acceptable minimum of 50% (Tversky and Kahneman, 1971).

Powergraphs also provide important information about the replicability of significant results. A published significant result is used to support the claim of a discovery. However, even a true discovery may not be replicable if the original study had low statistical power. In this case, it is likely that a replication study produces a false negative result; it fails to affirm the presence of an effect with p < .05, even though an effect actually exists. The powergraph estimate of replicability is 70%. That is, any randomly drawn significant effect published in 2016 has only a 70% chance of reproducing a significant result again in an exact replication study.

Importantly, replicability is not uniform across all significant results. Replicabilty increases with the signal to noise ratio (Open Science Collective, 2015). In 2017 powergraphs were enhanced by providing information about the replicability for different levels of strength of evidence. In the graph below, z-scores between 0 and 6 are divided into 12 categories with a width of 0.5 standard deviations (0-0.5, 0.5-1, …. 5.5-6). For significant results, these values are the average replicability for z-scores in the specified range.

The graph shows a replicability estimate of 46% for z-scores between 2 and 2.5. Thus, a z-score greater than 2.5 is needed to meet the minimum standard of 50% replicability. More important, these power values can be converted into p-values because power and p-values are monotonically related (Hoenig & Heisey, 2001). If p < .05 is the significance criterion, 50% power corresponds to a p-value of .05. This also means that all z-scores less than 2.5 correspond to p-values greater than .05 once we take the influence of publication bias into account. A z-score of 2.6 roughly corresponds to a p-value of .01. Thus, a simple heuristic for readers of psychology journals is to consider only p < .01 values as significant, if they want to maintain the nominal error rate of 5%.

One problem with a general adjustment is that file drawers can differ across journals or authors. The adjustment based on the general publication bias across journals will penalize authors who invest resources into well-designed studies with high power and it will fail to adjust fully for the effect of publication bias for authors that conduct many underpowered studies that capitalize on chance to produce significant results. It is widely recognized that scientific markets reward quantity of publications over quality. A personalized adjustment can solve this problem because authors with large file drawers will get a bigger adjustment and many of their nominally significant result will no longer be significant after an adjustment for publication bias has been made.

I illustrate this with two real world examples. The first example shows the powergraph of Marcel Zeelenberg. The left powergraph shows a model that assumes no file drawer. The model fits the actual distribution of z-scores rather well. However, the graph shows a small bump of just significant results (z = 2 to 2.2) that is not explained by the model. This bump could reflect the use of questionable research practices (QRPs)but it is relatively small (as we will see shortly). The graph on the right side uses only statistically significant results. This is important because only these results were published to claim a discovery. We see how the small bump leads has a strong effect on the estimate of the file drawer. It would require a large set of non-significant results to produce this bump. It is more likely that QRPs were used to produce it. However, the bump is small and overall replicability is higher than the average for all journals. We also see that z-scores between 2 and 2.5 have an average replicability estimate of 52%. This means no adjustment is needed and p-values reported by Marcel Zellenberg can be interpreted without adjustment. Over the 15 year period, Marcel Zellenberg reported 537 significant results and we can conclude from this analysis that no more than 5% (27) of these results are false positive results.

Powergraphs for Marcel Zeelenberg.spex.png

A different picture emerges for the powergraph based on Ayalet Fishbach’s statistical results. The left graph shows a big bump of just significant results that is not explained by a model without publication bias. The right graph shows that the replicabilty estimate is much lower than for Marcel Zeelenberg and for the analysis of all journals in 2016.

Powergraphs for Ayelet Fishbach.spex.png

The average replicabilty estimate for z-values between 2 and 2.5 is only 33%. This means that researchers are unlikely to obtain a significant result, if they attempted an exact replication study of one of these findings. More important, it means that p-values adjusted for publication bias are well above p > .05. Even z-scores in the 2.5 to 3 band average only a replicabilty estimate of 46%. This means that only z-scores greater than 3 produce significant results after the correction for publication bias is applied.

Non-Significance Does Not Mean Null-Effect

It is important to realize that a non-significant result does not mean that there is no effect. Is simply means that the signal to noise ratio is too weak to infer that an effect was present. it is entirely possible that Ayelet Fishbach made theoretically correct predictions. However, to provide evidence for her hypotheses, she conducted studies with a high failure rate and many of these studies failed to support her hypotheses. These failures were not reported but they have to be taken into account in the assessment of the risk of a false discovery. A p-value of .05 is only meaningful in the context of the number of attempts that have been made. Nominally a p-value of .03 may appear to be the same across statistical analysis. But the real evidential value of a p-value is not equivalent. Using powergraphs to equate evidentival value, a p-value of .05 published by Marcel Zeelenberg is equivalent to a p-value of .005 (z = 2.8) published by Ayelet Fischbach.

The Influence of Questionable Research Practices

Powergraphs assume that an excessive number of significant results is caused by publication bias. However, questionable research practices also contribute to the reporting of mostly successful results. Replicability estimates and the p-value ajdustment for publication bias may itself be biased by the use of QRPs. Unfortunately, this effect is difficult to predict because different QRPs have different effects on replicability estimates. Some QRPs will lead to an overcorrection. Although this creates uncertainty about the right amount of adjustment, a stronger adjustment may have the advantage that it could deter researchers from using QRPs because it would undermine the credibility of their published results.

Conclusion

Over the past five years, psychologists have contemplated ways to improve the credibitliy and replicability of published results. So far, these ideas have yet to show a notable effect on replicability (Schimmack, 2017). One reason is that the incentive structure rewards number of publications and replicability is not considered in the review process. Reviewers and editors treat all p-values as equal, when they are not. The ability to adjust p-values based on the true evidential value that they provide may help to change this. Journals may lose their impact once readers adjust p-values and realize that many nominally significant result are actually not statistically significant after taking publication bias into account.

Meta-Psychology: A new discipline and a new journal (draft)

March 5, 2017UncategorizedDr. R

Ulrich Schimmack and Rickard Carlsson

Psychology is a relatively young science that is just over 100 years old. During its 100 years if existence, it has seen major changes in the way psychologists study the mind and behavior. The first laboratories used a mix of methods and studied a broad range of topics. In the 1950s, behaviorism started to dominate psychology with studies of animal behavior. Then cognitive psychology took over and computerized studies with reaction time tasks started to dominate. In the 1990s, neuroscience took off and no top ranked psychology department can function without one or more MRI magnets. Theoretical perspectives have also seen major changes. In the 1960s, personality traits were declared non-existent. In the 1980, twin studies were used to argue that everything is highly heritable, and nowadays gene-environment interactions and epigenetics are dominating theoretical perspectives on the nature-nurture debate. These shifts in methods and perspectives are often called paradigm shifts.

It is hard to keep up with all of these paradigm shifts in a young science like psychology. Moreover, many psychology researchers are busy just keeping up with developments in their paradigm. However, the pursuit of advancing research within a paradigm can be costly for researchers and a science as a whole because this research may become obsolete after a paradigm shift. One senior psychologist once expressed regret that he was a prisoner of a paradigm. To avoid a similar fate, it is helpful to have a broader perspective of developments in the field and to understand how progress in one area of psychology fits into the broader goal of understanding humans’ minds and behaviors. This is the aim of meta-psychology. Meta-psychology is the scientific investigation of psychology as a science. It questions the basic assumptions that underpin research paradigm and monitors the progress of psychological science as a whole.

Why we Need a Meta-Psychology Journal

Most scientific journals focus on publishing original research articles or review articles (meta-analyses) of studies on a particular topic. This makes it difficult to publish meta-psychological articles. As publishing in peer-reviewed journals is used to evaluate researchers, few researches dedicated time and energy to meta-psychology and those that did often had difficulties finding an outlet for their work.

In 2006, Ed Diener created Perspectives on Psychological Science (PPS) published by the Association for Psychological Science. The journal aims to publish an “eclectic mix of provocative reports and articles, including broad integrative reviews, overviews of research programs, meta-analyses, theoretical statements, and articles on topics such as the philosophy of science, opinion pieces about major issues in the field, autobiographical reflections of senior members of the field, and even occasional humorous essays and sketches” Not all of the articles in PPS are meta-psychology. However, PPS created a home for meta-psychological articles. We carefully examined articles in PPS to identify content areas of meta-psychology.

We believe that MP can fulfill an important role in the growing number of psychology journals. Most important, PPS can only publish a small number of articles. For profit journals like PPS pride themselves on their high rejection rates. We believe that high rejection rates create a problem and give editors and reviewers too much power to shape the scientific discourse and direction of psychology. The power of editors is itself an important topic in meta-psychology. In contrast to PPS, MP is an online journal with no strict page limits. We will let the quality of published articles rather than rejection rates determine the prestige of our journal.

PPS is a for profit journal and published content is hidden behind paywalls. We think this is a major problem and does not serve the interest of scientists. All articles published in MP will be open access. One problem with some open access journals is that they charge high fees for authors to get their work published. This gives authors from rich countries with grants a competitive advantage. MP will not charge any fees.

In short, while we appreciate the contribution PPS has made to the development of meta-psychology, we see MP as a modern journal that meets the need of psychology as a science for a journal that is dedicated to publishing meta-psychological articles without high rejection rates and without high costs to authors and readers.

Content Areas of Meta-Psychology

1. Critical reflections on the process of data collection.

1.1. Sampling

Amazon’s Mechanical Turk: A New Source of Inexpensive, Yet High-Quality, Data?
By: Buhrmester, Michael; Kwang, Tracy; Gosling, Samuel D.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 6 Issue: 1 Pages: 3-5 Published: JAN 2011

1.2. Experimental Paradigms

Using Smartphones to Collect Behavioral Data in Psychological Science: Opportunities, Practical Considerations, and Challenges
By: Harari, Gabriella M.; Lane, Nicholas D.; Wang, Rui; et al.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 11 Issue: 6 Pages: 838-854 Published: NOV 2016

1.3. Validity

What Do Implicit Measures Tell Us? Scrutinizing the Validity of Three Common Assumptions
By: Gawronski, Bertram; Lebel, Etienne P.; Peters, Kurt R.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 2 Issue: 2 Pages: 181-193 Published: JUN 2007

2. Critical reflections on statistical methods / tutorials on best practices

2.1. Philosophy of Statistics

Bayesian Versus Orthodox Statistics: Which Side Are You On?
By: Dienes, Zoltan
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 6 Issue: 3 Pages: 274-290 Published: MAY 2011

2.2. Tutorials

Sailing From the Seas of Chaos Into the Corridor of Stability Practical Recommendations to Increase the Informational Value of Studies
By: Lakens, Daniel; Evers, Ellen R. K.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 9 Issue: 3 Pages: 278-292 Published: MAY 2014

3. Critical reflections on published results / replicability

3.1. Fraud

Scientific Misconduct and the Myth of Self-Correction in Science
By: Stroebe, Wolfgang; Postmes, Tom; Spears, Russell
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 7 Issue: 6 Pages: 670-688 Published: NOV 2012

3.2. Publication Bias

Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition
By: Vul, Edward; Harris, Christine; Winkielman, Piotr; et al.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 4 Issue: 3 Pages: 274-290 Published: MAY 2009

3.3. Quality of Peer-Review

The Air We Breathe: A Critical Look at Practices and Alternatives in the Peer-Review Process
By: Suls, Jerry; Martin, Rene
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 4 Issue: 1 Pages: 40-50 Published: JAN 2009

4. Critical reflections on Paradigms and Paradigm Shifts

4.1 History

Sexual Orientation Differences as Deficits: Science and Stigma in the History of American Psychology
By: Herek, Gregory M.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 5 Issue: 6 Pages: 693-699 Published: NOV 2010

4.2. Topics

Domain Denigration and Process Preference in Academic Psychology
By: Rozin, Paul
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 1 Issue: 4 Pages: 365-376 Published: DEC 2006

4.3 Incentives

Giving Credit Where Credit’s Due: Why It’s So Hard to Do in Psychological Science
By: Simonton, Dean Keith
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 11 Issue: 6 Pages: 888-892 Published: NOV 2016

4.5 Politics

Political Diversity in Social and Personality Psychology
By: Inbar, Yoel; Lammers, Joris
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 7 Issue: 5 Pages: 496-503 Published: SEP 2012

4.4. Paradigms

Why the Cognitive Approach in Psychology Would Profit From a Functional Approach and Vice Versa
By: De Houwer, Jan
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 6 Issue: 2 Pages: 202-209 Published: MAR 2011

5. Critical reflections on teaching and dissemination of research

5.1 Teaching

Teaching Replication
By: Frank, Michael C.; Saxe, Rebecca
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE Volume: 7 Issue: 6 Pages: 600-604 Published: NOV 2012

5.2. Coverage of research in textbooks

N.A.

5.2 Coverage of psychology in popular books

N.A.

5.3 Popular Media Coverage of Psychology

N.A.

5.4. Social Media and Psychology

N.A.

Vision and Impact Statement

Currently PPS ranks number 7 out of all psychology journals with an Impact Factor of 6.08. The broad appeal of meta-psychology accounts for this relatively high impact factor. We believe that many articles published in MP will also achieve high citation rates, but we do not compete for the highest ranking. A journal that publishes only 1 article a year, will get a higher ratio of citations per article than a journal that publishes 10 articles a year. We recognize that it is difficult to predict which articles will become citation classics and we rather publish one gem and nine so-so articles than miss out on publishing the gem. We anticipate that MP will publish many gems that PPS rejected and we will be happy to give these articles a home.

This does not mean, MP will publish everything. We will harness the wisdom of crowds and we encourage authors to share their manuscripts on pre-publication sites or on social media for critical commentary. In addition, reviewers will help authors to improve their manuscript, while authors can be assured that investing in major revisions will be rewarded with a better publication rather than an ultimate rejection that requires further changes to please editors at another journal.

2016 Replicability Rankings of 103 Psychology Journals

March 1, 20172016, Replicability Ranking, Replicability-Ranking, Social PsychologyDr. R

I post the rankings on top. Detailed information and statistical analysis are provided below the table. You can click on the journal title to see Powergraphs for each year.

Rank	Journal	Change	2016	2015	2014	2013	2012	2011	2010	Mean
1	Social Indicators Research	10	90	70	65	75	65	72	73	73
2	Psychology of Music	-13	81	59	67	61	69	85	84	72
3	Journal of Memory and Language	11	79	76	65	71	64	71	66	70
4	British Journal of Developmental Psychology	-9	77	52	61	54	82	74	69	67
5	Journal of Occupational and Organizational Psychology	13	77	59	69	58	61	65	56	64
6	Journal of Compa rative Psychology	13	76	71	77	74	68	61	66	70
7	Cognitive Psychology	7	75	73	72	69	66	74	66	71
8	Epilepsy & Behavior	5	75	72	79	70	68	76	69	73
9	Evolution & Human Behavior	16	75	57	73	55	38	57	62	60
10	International Journal of Intercultural Relations	0	75	43	70	75	62	67	62	65
11	Pain	5	75	70	75	67	64	65	74	70
12	Psychological Medicine	4	75	57	66	70	58	72	61	66
13	Annals of Behavioral Medicine	10	74	50	63	62	62	62	51	61
14	Developmental Psychology	17	74	72	73	67	61	63	58	67
15	Judgment and Decision Making	-3	74	59	68	56	72	66	73	67
16	Psychology and Aging	6	74	66	78	65	74	66	66	70
17	Aggressive Behavior	16	73	70	66	49	60	67	52	62
18	Journal of Gerontology-Series B	3	73	60	65	65	55	79	59	65
19	Journal of Youth and Adolescence	13	73	66	82	67	61	57	66	67
20	Memory	5	73	56	79	70	65	64	64	67
21	Sex Roles	6	73	67	59	64	72	68	58	66
22	Journal of Experimental Psychology – Learning, Memory & Cognition	4	72	74	76	71	71	67	72	72
23	Journal of Social and Personal Relationships	-6	72	51	57	55	60	60	75	61
24	Psychonomic Review and Bulletin	8	72	79	62	78	66	62	69	70
25	European Journal of Social Psychology	5	71	61	63	58	50	62	67	62
26	Journal of Applied Social Psychology	4	71	58	69	59	73	67	58	65
27	Journal of Experimental Psychology – Human Perception and Performance	-4	71	68	72	69	70	78	72	71
28	Journal of Research in Personality	9	71	75	47	65	51	63	63	62
29	Journal of Child and Family Studies	0	70	60	63	60	56	64	69	63
30	Journal of Cognition and Development	5	70	53	62	54	50	61	61	59
31	Journal of Happiness Studies	-9	70	64	66	77	60	74	80	70
32	Political Psychology	4	70	55	64	66	71	35	75	62
33	Cognition	2	69	68	70	71	67	68	67	69
34	Depression & Anxiety	-6	69	57	66	71	77	77	61	68
35	European Journal of Personality	2	69	61	75	65	57	54	77	65
36	Journal of Applied Psychology	6	69	58	71	55	64	59	62	63
37	Journal of Cross-Cultural Psychology	-4	69	74	69	76	62	73	79	72
38	Journal of Psychopathology and Behavioral Assessment	-13	69	67	63	77	74	77	79	72
39	JPSP-Interpersonal Relationships and Group Processes	15	69	64	56	52	54	59	50	58
40	Social Psychology	3	69	70	66	61	64	72	64	67
41	Achive of Sexual Behavior	-2	68	70	78	73	69	71	74	72
42	Journal of Affective Disorders	0	68	64	54	66	70	60	65	64
43	Journal of Experimental Child Psychology	2	68	71	70	65	66	66	70	68
44	Journal of Educational Psychology	-11	67	61	66	69	73	69	76	69
45	Journal of Experimental Social Psychology	13	67	56	60	52	50	54	52	56
46	Memory and Cognition	-3	67	72	69	68	75	66	73	70
47	Personality and Individual Differences	8	67	68	67	68	63	64	59	65
48	Psychophysiology	-1	67	66	65	65	66	63	70	66
49	Cognitve Development	6	66	78	60	65	69	61	65	66
50	Frontiers in Psychology	-8	66	65	67	63	65	60	83	67
51	Journal of Autism and Developmental Disorders	0	66	65	58	63	56	61	70	63
52	Journal of Experimental Psychology – General	5	66	69	67	72	63	68	61	67
53	Law and Human Behavior	1	66	69	53	75	67	73	57	66
54	Personal Relationships	19	66	59	63	67	66	41	48	59
55	Early Human Development	0	65	52	69	71	68	49	68	63
56	Attention, Perception and Psychophysics	-1	64	69	70	71	72	68	66	69
57	Consciousness and Cognition	-3	64	65	67	57	64	67	68	65
58	Journal of Vocactional Behavior	5	64	78	66	78	71	74	57	70
59	The Journal of Positive Psychology	14	64	65	79	51	49	54	59	60
60	Behaviour Research and Therapy	7	63	73	73	66	69	63	60	67
61	Child Development	0	63	66	62	65	62	59	68	64
62	Emotion	-1	63	61	56	66	62	57	65	61
63	JPSP-Personality Processes and Individual Differences	1	63	56	56	59	68	66	51	60
64	Schizophrenia Research	1	63	65	68	64	61	70	60	64
65	Self and Identity	-4	63	52	61	62	50	55	71	59
66	Acta Psychologica	-6	63	66	69	69	67	68	72	68
67	Behavioral Brain Research	-3	62	67	61	62	64	65	67	64
68	Child Psychiatry and Human Development	5	62	72	83	73	50	82	58	69
69	Journal of Child Psychology and Psychiatry and Allied Disciplines	10	62	62	56	66	64	45	55	59
70	Journal of Consulting and Clinical Psychology	0	62	56	50	54	59	58	57	57
71	Journal of Counseling Psychology	-3	62	70	60	74	72	56	72	67
72	Behavioral Neuroscience	1	61	66	63	62	65	58	64	63
73	Developmental Science	-5	61	62	60	62	66	65	65	63
74	Journal of Experimental Psychology – Applied	-4	61	61	65	53	69	57	69	62
75	Journal of Social Psychology	-11	61	56	55	55	74	70	63	62
76	Social Psychology and Personality Science	-5	61	42	56	59	59	65	53	56
77	Cognitive Therapy and Research	0	60	68	54	67	70	62	58	63
78	Hormones & Behavior	-1	60	55	55	54	55	60	58	57
79	Motivation and Emotion	1	60	60	57	57	51	73	52	59
80	Organizational Behavior and Human Decision Processes	3	60	63	65	61	68	67	51	62
81	Psychoneuroendocrinology	5	60	58	58	56	53	59	53	57
82	Social Development	-10	60	50	66	62	65	79	57	63
83	Appetite	-10	59	57	57	65	64	66	67	62
84	Biological Psychology	-6	59	60	55	57	57	65	64	60
85	Journal of Personality Psychology	17	59	59	60	62	69	37	45	56
86	Psychological Science	6	59	63	60	63	59	55	56	59
87	Asian Journal of Social Psychology	0	58	76	67	56	71	64	64	65
88	Behavior Therapy	0	58	63	66	69	66	52	65	63
89	Britsh Journal of Social Psychology	0	58	57	44	59	51	59	55	55
90	Social Influence	18	58	72	56	52	33	59	46	54
91	Developmental Psychobiology	-9	57	54	61	60	70	64	62	61
92	Journal of Research on Adolescence	2	57	59	61	82	71	75	40	64
93	Journal of Abnormal Psychology	-5	56	52	57	58	55	66	55	57
94	Social Cognition	-2	56	54	52	54	62	69	46	56
95	Personality and Social Psychology Bulletin	2	55	57	58	55	53	56	54	55
96	Cognition and Emotion	-14	54	66	61	62	76	69	69	65
97	Health Psychology	-4	51	67	56	72	54	69	56	61
98	Journal of Clinical Child and Adolescence Psychology	1	51	66	61	74	64	58	54	61
99	Journal of Family Psychology	-7	50	52	63	61	57	64	55	57
100	Group Processes & Intergroup Relations	-5	49	53	68	64	54	62	55	58
101	Infancy	-8	47	44	60	55	48	63	51	53
102	Journal of Consumer Psychology	-5	46	57	55	51	53	48	61	53
103	JPSP-Attitudes & Social Cognition	-3	45	69	62	39	54	54	62	55

Notes.
1. Change scores are the unstandardized regression weights with replicabilty estimates as outcome variable and year as predictor variable. Year was coded from 0 for 2010 to 1 for 2016 so that the regression coefficient reflects change over the full 7 year period. This method is preferable to a simple difference score because estimates in individual years are variable and are likely to overestimate change.
2. Rich E. Lucas, Editor of JRP, noted that many articles in JRP do not report t of F values in the text and that the replicability estimates based on these statistics may not be representative of the bulk of results reported in this journal. Hand-coding of articles is required to address this problem and the ranking of JRP, and other journals, should be interpreted with caution (see further discussion of these issues below).

Introduction

I define replicability as the probability of obtaining a significant result in an exact replication of a study that produced a significant result. In the past five years, it has become increasingly clear that psychology suffers from a replication crisis. Even results that are replicated internally by the same author multiple times fail to replicate in independent replication attempts (Bem, 2011). The key reason for the replication crisis is selective publishing of significant results (publication bias). While journals report over 95% significant results (Sterling, 1959; Sterling et al., 1995), a 2015 article estimated that less than 50% of these results can be replicated (OSC, 2015).

The OSC reproducibility made an important contribution by demonstrating that published results in psychology have low replicability. However, the reliance on actual replication studies has a a number of limitations. First, actual replication studies are expensive or impossible (e.g., a longitudinal study spanning 20 years). Second, studies selected for replication may not be representative because the replication team lacks expertise to replicate some studies. Finally, replication studies take time and replicability of recent studies may not be known for several years. This makes it difficult to rely on actual replication studies to rank journals and to track replicability over time.

Schimmack and Brunner (2016) developed a statistical method (z-curve) that makes it possible to estimate average replicability for a set of published results based on the original results in published articles. This statistical approach to the estimation of replicability has several advantages over the use of actual replication studies. Replicability can be assessed in real time, it can be estimated for all published results, and it can be used for expensive studies that are impossible to reproduce. Finally, it has the advantage that actual replication studies can be criticized (Gilbert, King, Pettigrew, & Wilson, 2016). Estimates of replicabilty based on original studies do not have this problem because they are based on published results in original articles.

Z-curve has been validated with simulation studies and can be used when replicability varies across studies and when there is selection for significance, and is superior to similar statistical methods that correct for publication bias (Brunner & Schimmack, 2016). I use this method to estimate the average replicability of significant results published in 103 psychology journals. Separate estimates were obtained for the years from 2010, one year before the start of the replication crisis, to 2016 to examine whether replicability increased in response to discussions about replicability. The OSC estimate of replicability was based on articles published in 2008 and it was limited to three journals. I posted replicability estimates based on z-curve for the year 2015 (2015 replicability rankings). There was no evidence that replicability had increased during this time period.

The main empirical question was whether the 2016 rankings show some improvement in replicability and whether some journals or disciplines have responded more strongly to the replication crisis than others.

A second empirical question was whether replicabilty varies across disciplines. The OSC project provided first evidence that traditional cognitive psychology is more replicable than social psychology. Replicability estimates with z-curve confirmed this finding. In the 2015 rankings, The Journal of Experimental Psychology: Learning, Memory and Cognition ranked 25 with a replicability estimate of 74, whereas the two social psychology sections of the Journal of Personality and Social Psychology ranked 73 and 99 (68% and 60% replicability estimates). For this post, I conducted more extensive analyses of disciplines.

Journals

The 103 journals that are included in these rankings were mainly chosen based on impact factors. The list also includes diverse areas of psychology, including cognitive, developmental, social, personality, clinical, biological, and applied psychology. The 2015 list included some new journals that started after 2010. These journals were excluded from the 2016 rankings to avoid missing values in statistical analyses of time trends. A few journals were added to the list and the results may change when more journals are added to the list.

The journals were classified into 9 categories: social (24), cognitive (12), development (15), clinical/medical (19), biological (8), personality (5), and applied(IO,education) (8). Two journals were classified as general (Psychological Science, Frontiers in Psychology). The last category included topical, interdisciplinary journals (emotion, positive psychology).

Data

All PDF versions of published articles were downloaded and converted into text files. The 2015 rankings were based on conversions with the free program pdf2text pilot. The 2016 program used a superior conversion program pdfzilla. Text files were searched for reports of statistical results using my own R-code (z-extraction). Only F-tests, t-tests, and z-tests were used for the rankings. t-values that were reported without df were treated as z-values which leads to a slight inflation in replicability estimates. However, the bulk of test-statistics were F-values and t-values with degrees of freedom. A comparison of the 2015 rankings using the old method and the new method shows that extraction methods have an influence on replicability estimates some differences (r = .56). One reason for the low correlation is that replicability estimates have a relatively small range (50-80%) and low retest correlations. Thus, even small changes can have notable effects on rankings. For this reason, time trends in replicability have to be examined at the aggregate level of journals or over longer time intervals. The change score of a single journal from 2015 to 2016 is not a reliable measure of improvement.

Data Analysis

The data for each year were analyzed using z-curve Schimmack and Brunner (2016). The results of individual analysis are presented in Powergraphs. Powergraphs for each journal and year are provided as links to the journal names in the table with the rankings. Powergraphs convert test statistics into absolute z-scores as a common metric for the strength of evidence against the null-hypothesis. Absolute z-scores greater than 1.96 (p < .05, two-tailed) are considered statistically significant. The distribution of z-scores greater than 1.96 is used to estimate the average true power (not observed power) of the set of significant studies. This estimate is an estimate of replicability for a set of exact replication studies because average power determines the percentage of statistically significant results. Powergraphs provide additional information about replicability for different ranges of z-scores (z-values between 2 and 2.5 are less replicable than those between 4 and 4.5). However, for the replicability rankings only the replicability estimate is used.

Results

Table 1 shows the replicability estimates sorted by replicability in 2016.

The data were analyzed with a growth model to examine time trends and variability across journals and disciplines using MPLUS7.4. I compared three models. Model 1 assumed no mean level changes and variability across journals. Model 2 assumed a linear increase. Model 3 tested assumed no change from 2010 to 2015 and allowed for an increase in 2016.

Model 1 had acceptable fit (RMSEA = .043, BIC = 5004). Model 2 increased fit (RMSEA = 0.029, BIC = 5005), but BIC slightly favored the more parsimonious Model 1. Model 3 had the best fit (RMSEA = .000, BIC = 5001). These results reproduce the results of the 2015 analysis that there was no improvement from 2010 to 2015, but there is some evidence that replicability increased in 2016. Adding a variance component to slope in Model 3 produced an unidentified model. Subsequent analyses show that this is due to insufficient power to detect variation across journals in changes over time.

The standardized loadings of individual years on the latent intercept factor ranged from .49 to .58. This shows high variabibility in replicability estimates from year to year. Most of the rank changes can be attributed to random factors. A better way to compare journals is to average across years. A moving average of five years will provide reliable information and allow for improvement over time. The reliability of the 5-year average for the years 2012 to 2016 is 68%.

Figure 1 shows the annual averages with 95%CI as well relative to the average over the full 7-year period.

A paired t-test confirmed that average replicability in 2016 was significantly higher (M = 65, SD = 8) than in the previous years (M = 63, SD = 8), t(101) = 2.95, p = .004. This is the first evidence that psychological scientists are responding to the replicability crisis by publishing slightly more replicable results. Of course, this positive result has to be tempered by the small effect size. But if this trend continuous or even increases, replicability could reach 80% in 10 years.

The next analysis examined changes in replicabilty at the level of individual journals. Replicability estimates were regressed on a dummy variable that contrasted 2016 with the previous years. This analysis produced only 7 significant increases with p < .05 (one-tailed), which is only 2 more significant results than would be expected by chance alone. Thus, the analysis failed to identify particular journals that contribute to the improvement in the average. Figure 2 compares the observed distribution of t-values to the predicted distribution based on the null-hypothesis (no change).

The blue line shows the observed density distribution, which is slightly moved to the right, but there is no set of journals with notably larger t-values. A more sustained and larger increase in replicability is needed to detect variability in change scores.

The next analyses examine stable differences between disciplines. The first analysis compared cognitive journals to social journals. No statistical tests are needed to see that cognitive journals publish more replicable results than social journals. This finding confirms the results with actual replications of studies published in 2008 (OSC, 2015). The Figure suggests that the improvement in 2016 is driven more by social journals, but only 2017 data can tell whether there is a real improvement in social psychology.

The next Figure shows the results for 5 personality journals. The large confidence intervals show that there is considerable variability among personality journals. The Figure shows the averages for cognitive and social psychology as horizontal lines. The average for personality is only slightly above the average for social and like social, personality shows an upward trend. In conclusion, personality and social psychology look very similar. This may be due to considerable overlap between the two disciplines, which is also reflected in shared journals. Larger differences may be visible for specialized social journals that focus on experimental social psychology.

The results for developmental journals show no clear time trend and the average is just about in the middle between cognitive and social psychology. The wide confidence intervals suggest that there is considerable variability among developmental journals. Table 1 shows Developmental Psychology ranks 14 / 103 and Infancy ranks 101/103. The low rank for Infancy may be due to the great difficulty of measuring infant behavior.

The clinical/medical journals cover a wide range of topics from health psychology to special areas of psychiatry. There has been some concern about replicability in medical research (Ioannidis, 2005). The results for clinical are similar to those for developmental journals. Replicability is lower than for cognitive psychology and higher than for social psychology. This may seem surprising because patient populations and samples tend to be smaller. However, a randomized controlled intervention study uses pre-post designs to boost power, whereas social and personality psychologists use comparisons across individuals, which requires large samples to reduce sampling error.

The set of biological journals is very heterogeneous and small. It includes neuroscience and classic peripheral physiology. Despite wide confidence intervals replicability for biological journals is significantly lower than replicabilty for cognitive psychology. There is no notable time trend. The average is slightly above the average for social journals.

The last category are applied journals. One journal focuses on education. The other journals focus on industrial and organizational psychology. Confidence intervals are wide, but replicabilty is generally lower than for cognitive psychology. There is no notable time trend for this set of journals.

Given the stability of replicability, I averaged replicability estimates across years. The last figure shows a comparison of disciplines based on these averages. The figure shows that social psychology is significantly below average and cognitive psychology is significantly above average with the other disciplines falling in the middle. All averages are significantly above 50% and below 80%.

Discussion

The most exciting finding is that repicability appears to have increased in 2016. This increase is remarkable because averages in the years before consistently tracked the average of 63. The increase by 2 percentage points in 2016 is not large, but it may represent a first response to the replication crisis.

The increase is particularly remarkable because statisticians have been sounding the alarm bells about low power and publication bias for over 50 years (Cohen, 1962; Sterling, 1959), but these warnings have had no effect on research practices. In 1989, Sedlmeier and Gigerenzer (1989) noted that studies of statistical power had no effect on the statistical power of studies. The present results provide the first empirical evidence that psychologists are finally starting to change their research practices.

However, the results also suggest that most journals continue to publish articles with low power. The replication crisis has affected social psychology more than other disciplines with fierce debates in journals and on social media (Schimmack, 2016). On the one hand, the comparisons of disciplines supports the impression that social psychology has a bigger replicability problem than other disciplines. However, the differences between disciplines are small. With the exception of cognitive psychology, other disciplines are not a lot more replicable than social psychology. The main reason for the focus on social psychology is probably that these studies are easier to replicate and that there have been more replication studies in social psychology in recent years. The replicability rankings predict that other disciplines would also see a large number of replication failures, if they would subject important findings to actual replication attempts. Only empirical data will tell.

Limitations

The main limitation of replicability rankings is that the use of an automatic extraction method does not distinguish theoretically important hypothesis tests and other statistical tests. Although this is a problem for the interpretation of the absolute estimates, it is less important for the comparison over time. Any changes in research practices that reduce sampling error (e.g., larger samples, more reliable measures) will not only strengthen the evidence for focal hypothesis tests, but also increase the strength of evidence for non-focal hypothesis tests.

Schimmack and Brunner (2016) compared replicability estimates with actual success rates in the OSC (2015) replication studies. They found that the statistical method overestimates replicability by about 20%. Thus, the absolute estimates can be interpreted as very optimistic estimates. There are several reasons for this overestimation. One reason is that the estimation method assumes that all results with a p-value greater than .05 are equally likely to be published. If there are further selection mechanisms that favor smaller p-values, the method overestimates replicability. For example, sometimes researchers correct for multiple comparisons and need to meet a more stringent significance criterion. Only careful hand-coding of research articles can provide more accurate estimates of replicability. Schimmack and Brunner (2016) hand-coded the articles that were included in the OSC (2015) article and still found that the method overestimated replicability. Thus, the absolute values need to be interpreted with great caution and success rates of actual replication studies are expected to be at least 10% lower than these estimates.

Implications

Power and replicability have been ignored for over 50 years. A likely reason is that replicability is difficult to measure. A statistical method for the estimation of replicability changes this. Replicability estimates of journals make it possible for editors to compete with other journals in the replicability rankings. Flashy journals with high impact factors may publish eye-catching results, but if this journal has a reputation of publishing results that do not replicate, they are not very likely to have a big impact. Science is build on trust and trust has to be earned and can be easily lost. Eventually, journals that publish replicable results may also increase their impact because more researchers are going to build on replicable results published in these journals. In this way, replicability rankings can provide a much needed correction to the current incentive structure in science that rewards publishing as many articles as possible without any concerns about the replicability of these results. This reward structure is undermining science. It is time to change it. It is no longer sufficient to publish a significant result, if this result cannot be replicate in other labs.

Many scientists feel threatened by changes in the incentive structure and the negative consequences of replication failures for their reputation. However, researchers have control over their reputation. First, researchers often carry out many conceptually related studies. In the past, it was acceptable to publish only the studies that worked (p < .05). This selection for significance by researchers is the key factor in the replication crisis. The researchers who are conducting the studies are fully aware that it was difficult to get a significant result, but the selective reporting of these successes produces inflated effect size estimates and an illusion of high replicability that inevitably lead to replication failures. To avoid these embarrassing replication failures researchers need to report results of all studies or conduct fewer studies with high power. The 2016 rankings suggest that some researchers have started to change, but we will have to wait until 2017 to see whether 2017 can replicate the positive trend in the 2016 rankings.

An Attempt at Explaining Null-Hypothesis Testing and Statistical Power with 1 Figure and 1,500 Words

February 26, 2017UncategorizedDr. R

Is a Figure worth 1,500 words?

gpower-zcurve

Gpower. http://www.gpower.hhu.de/en.html

Significance Testing

1. The red curve shows the sampling distribution if there is no effect. Most results will give a signal/noise ratio close to 0 because there is no effect (0/1 = 0)

2. Sometimes sampling error can produce large signals, but these events are rare

3. To be sure that we have a real signal, we can chose a high criterion to decide that there was an effect (reject H0). Normally, we use a 2:1 ratio (z > 2) to do so, but we could use a higher or lower criterion value. This value is shown by the green vertical line in the Figure

4. z-score greater than 2 leaves only 2.5% of the red distribution. This means we would expect only 2.5% of outcomes with z-scores greater than 2 if there is no effect. If we would use the same criterion for negative effects, we would get another 2.5% in the lower tail of the red distribution. Combined we would have 5% of cases where we have a false positive, that is, we decide that there is an effect when there was no effect. This is why we say, p < .05 to call a result significant. The probabilty (p) of a false positive result is no greater than 5% if we keep on repeating studies and using z > 2 as the criterion to claim an effect. If there is never an effect in any of the studies we are doing, we end up with 5% false positive results. A false positive is also called a type-I error. We are making the mistake to infer from our study that an effect is present when there is no effect.

Statistical Power

5. Now that you understand significance testing (LOL), we can introduce the concept of statistical power. Effects can be large or small. For example, gender differences in height are large, gender differences in the number of sexual partners are small. Also studies can have a lot of sampling error or very little sampling error. A study of 10 men and 10 women may accidentally include 2 women who are on the basketball team. A study of 1000 men and women is likely to be more representative of the population. Based on the effect size in the population and sample size, the true signal (effect size in the population) to noise (sampling error) ratio can differ. The higher the signal to noise ratio is, the further away the sampling distribution of the real data (the blue curve) will be. In the figure below the population effect size and sampling error produced a z-score of 2.8, but actual samples will never produce this value. Sampling error will again produce different z-scores above or below the expected value of 2.8. Most samples will produce values close to 2.8, but some samples will produce more extreme deviations. Samples that overestimate the expected value of 2.8 are not a problem because these values are all greater than the criterion for statistical significance. So, in all of these samples we will make the right decision to infer that an effect is present when an effect is present. A so called true positive result. Even if sampling error leads to a small underestimation of the expected value of 2.8, the values can still be above the criterion for statistical significance and we get a true positive result.

6. When sampling error leads to more extreme underestimation of the expected value of 2.8, samples may produce results with a z-score less than 2. Now the result is no longer statistically significant. These cases are called false negatives or type-II errors. We fail to infer that an effect is present, when there actually is an effect (think about a faulty pregnancy test that fails to detect that a woman is pregnant). It does not matter whether we actually infer that there is no effect or remain indecisive about the presence of an effect. We did a study where an effect exists and we failed to provide sufficient evidence for it.

7. The Figure shows the probability of making a type-II error as the area of the blue curve on the left side of the green line. In this example, 20% of the blue curve is on the left side of the green line. This means 20% of all samples with an expected value of 2.8 will produce false negative results.

8. We can also focus on the area of the blue curve on the right side of the green line. If 20% of the area is on the left side, 80% of the area must be on the right side. This means, we have an 80% probability to obtain a true positive result; that is, a statistically significant result where the observed z-score is greater than the criterion z-score of 2. This probability is called statistical power. A study with high power has a high probability to discover real effects by producing z-scores greater than the criterion value. A study with low power has a high probability to produce a false negative result by producing z-scores below the criterion value.

9. Power depends on the criterion value and the expected value. We could reduce the type-II error and increase power in the Figure by moving the green line to the left. As we reduce the criterion to claim an effect, we reduce the area of the blue curve on the left side of the line. We are now less likely to encounter false negative results when an effect is present. However, there is a catch. By moving the green line to the left, we are increasing the area of the red curve on the right side of the red curve. This means, we are increasing the probability of a false positive result. To avoid this problem we can keep the green line where it is and move the expected value of the blue line to the right. By shifting the blue curve to the right, a smaller area of the blue curve will be on the left side of green line.

10. In order to move the blue curve to the right we need to increase the effect size or reduce sampling error. In experiments it may be possible to use more powerful manipulations to increase effect sizes. However, often increasing effect sizes is not an option. How would you increase the effect size of sex on sexual partners? Therefore, your best option is to reduce sampling error. As sampling error decreases, the blue curve moves further to the right and statistical power increases.

Practical Relevance: The Hunger Games of Science: With high power the odds are always in your favor

10. Learning about statistical power is important because the outcome of your studies does not just depend on your expertise. It also depends on factors that are not under your control. Sampling error can sometimes help you to get significance by giving you z-scores higher than the expected value, but these z-scores will not replicate because sampling error can also be your enemy and lower your z-scores. In this way, each study that you do is a bit like playing the lottery or a box of chocolates. You never know how much sampling error you will get. The good news is that you are in charge of the number of winning tickets in the lottery. A study with 20% power, has only 20% winning tickets. The other 80% say, “please play again.” A study with 80% power has 80% winning tickets. You have a high chance to get a significant result and you or others will be able to redo the study and again have a high chance to replicate your original result. It can be embarrassing when somebody conducts a replication study of your significant result and ends up with a failure to replicate your finding. You can avoid this outcome by conducting studies with high statistical power.

11. Of course, there is a price to pay. Reducing sampling error often requires more time and participants. Unfortunately, the costs increase exponentially. It is easier to increase statistical power from 20% to 50% than to increase it from 50% to 80%. It is even more costly to increase it from 80% to 90%. This is what economists call diminishing marginal utility. Initially you get a lot of bang for your buck, but eventually the costs for any real gains are too high. For this reason, Cohen (1988) recommended that researchers should aim for 80% power in their studies. This means that 80% of your initial attempts to demonstrate an effect will succeed when your hard work in planning and conducting a study produced a real effect. For 20% of the study you may either give up or try again to see whether your fist study produced a true negative result (there is no effect) or a false negative result (you did everything correctly, but sampling error handed you a losing ticket. Failure is part of life, but you have some control over the amount of failures that you encounter.

12. The End. You are now ready to learn how you can conduct power analysis for actual studies to take control your fate. Be a winner, not a loser.