Domain: psychology_replication_status | Generated: 20260416_201033

Psychology's Reckoning: A Decade Later, Was the Crisis Actually Fixed?

The replication revolution promised to clean house — but the evidence suggests we tidied some rooms and quietly locked others.

Ten years ago, a team of 270 researchers published a paper in Science that shook the foundations of modern psychology. They had attempted to replicate 100 published studies — findings that had shaped textbooks, informed policy, and launched careers. Only 36% succeeded. The headline was devastating. But the more interesting question, the one that has taken a decade to begin answering, is: what happened next?

The 2015 Reproducibility Project — formally reported by Aarts (2015) — was not the first sign of trouble, but it was the loudest alarm. Psychology had built an impressive edifice on a surprisingly sandy foundation. The original studies had reported statistically significant results 97% of the time. The replications managed only 36%. Replication effect sizes were, on average, half the magnitude of the originals. This was not mere statistical noise — it was systematic. The field had, to a significant degree, been producing findings that did not hold up.

The immediate response from critics was twofold: some argued that the replications were flawed, that replicators had missed crucial contextual details, that the original findings were real but fragile. Others argued the numbers spoke for themselves. Both camps, it turns out, had partial points. But before we get to that, it is worth understanding the scale of the problem the field was inheriting.

Klein and colleagues had already begun mapping the terrain. Their 2014 Many Labs 1 project found that 10 of 13 classic effects replicated consistently across 36 independent samples and over 6,000 participants — an encouraging result that contained a sting in its tail. The three that failed — imagined contact reducing prejudice, flag priming influencing conservatism, and currency priming affecting system justification — were precisely the kinds of subtle, context-dependent social priming effects that had come to define ambitious social psychology. The field's flashiest claims were its most vulnerable.

Then came the preregistered, multi-lab assaults on individual findings. The ego depletion effect — Baumeister's influential idea that willpower is a limited resource that depletes with use — became a case study in controlled demolition. Hagger and colleagues (2016) recruited 23 labs and 2,141 participants for a standardised protocol replication. The effect size they found was d = 0.04, with a confidence interval that comfortably straddled zero. Vohs and colleagues (2021) followed with an even larger effort: 36 laboratories, 3,531 participants, a Bayesian meta-analysis that found the data were four times more likely under the null hypothesis than under the informed-prior alternative. After two of the most rigorous replication attempts in psychology's history, the resource-model of self-control looked, at minimum, profoundly overstated.

The social priming literature fared even worse. Giolla and colleagues (2024) conducted a systematic meta-analysis of 70 replication attempts across 49 unique original studies. The pattern was stark: original studies had achieved statistical significance 85.71% of the time; replications managed just 17.14%. The replication p-values (median = 0.630) bore essentially no resemblance to the originals (median = 0.017). This is not the signature of a finding that is real but contextually fragile — it is the signature of findings that were never as robust as reported. The intelligence priming literature was particularly damning: two papers in that space were retracted for outright malpractice, and at least one failed replication had been suppressed for over a decade before anyone noticed ("Priming intelligent behavior: an elusive phenomenon," 2013).

Power posing — the idea that holding an expansive body posture for two minutes could raise testosterone, lower cortisol, and increase risk tolerance — became perhaps the most publicly scrutinised case. Simmons and Simonsohn (2017) subjected the published evidence base to p-curve analysis and found that the distribution of p-values was indistinguishable from what you would expect if the true effect were zero. As they put it: the existing power posing evidence was "too weak to justify moderator searches or practical advocacy." Subsequent replication attempts confirmed the picture. Efrat and colleagues (2024) found no testosterone effects and no feelings-of-power effects, though they did find cortisol reductions and replicated risk-taking effects. Klaschinski and colleagues (2017) found no effects on dominance or hireability in job interview contexts. The literature fractured into small, inconsistent findings with no coherent theoretical core.

Precognition — Daryl Bem's extraordinary 2011 claim that people can anticipate random future events — was always the sharpest test of the system. If the field's methods could support precognition, they could support almost anything. Muhmenthaler and colleagues (2022) mounted a large-scale replication with over 2,000 participants across three experiments and found nothing. Schlitz and colleagues (2021), in two preregistered international collaborations that included Bem himself, failed to support the primary hypothesis. The "Correcting the past" replication series found that seven of seven attempts to replicate Bem's Experiments 8 and 9 failed to detect the effect. Bem's own meta-analysis (Bem et al., 2015; 2016) claims a small but statistically reliable effect across 90 studies — but this is precisely the kind of conclusion that Kvarven and colleagues (2019) showed is systematically inflated in psychology meta-analyses, where meta-analytic effect sizes average three times larger than results from pre-registered multi-lab replications.

Across the analysed literature, replication effect sizes were systematically and substantially smaller than originals — with a median 75% reduction in Cohen's d in the Many Labs 2 project alone, where the original median d of 0.60 shrank to 0.15 in replication.

So: crisis solved? The answer requires distinguishing between the meta-level question (has the field reformed?) and the object-level question (do the old findings hold up?). The answers are different.

On reform, the picture is genuinely encouraging. Preregistration has become standard practice for high-quality work. Multi-lab collaborations now routinely produce samples an order of magnitude larger than the originals being tested. Open data policies allow post-hoc scrutiny that was simply impossible in 2010. The Registered Report format — where journals commit to publication before data collection, removing the incentive to p-hack — has spread across dozens of outlets. These are real changes with measurable effects.

On the object-level, the picture is messier. Not everything failed. The body-specificity hypothesis — that right-handed people associate right-space with positive valence — replicated cross-culturally across 12 countries and 2,222 participants (Yamada et al., 2024). The Ebbinghaus forgetting curve, first observed in the 1880s, was reproduced with striking fidelity (Murre & Dros, 2015). Cognitive effects like the Stroop task, the Simon effect, and attentional priming for basic perceptual features replicated robustly online (Crump et al., 2013). The status quo bias replicated in 3 of 4 decision scenarios (Xiao et al., 2021). The ambivalence-certainty interaction on political opinion stability held in a registered voter sample (Luttrell et al., 2020). Social class effects, examined at scale across four countries by Batruch and colleagues (2025), replicated in roughly 50% of cases — a middling but not catastrophic result.

Psychology's replicability problem is not uniform. Cognitive and perceptual effects with clean operationalisations tend to replicate; subtle social priming effects and small-sample personality findings tend not to. The crisis was never equally distributed across the field, even if it was reported as though it were.

The interpretive war over what replication failures mean remains genuinely unresolved. One camp holds that a well-powered null result is a null result, and the original effect should be abandoned. Another insists that replication failures often reflect unmeasured moderators — that what looks like a null effect is actually a finding about boundary conditions. Bressan (2019a, 2019b, 2020) examined several failures from the Reproducibility Project and argued that some contained methodological artifacts — stimulus allocation biases and confounds that, when corrected, restored the original effect. Chatard and colleagues (2020) demonstrated that the Many Labs 4 failure to replicate mortality salience effects was partly driven by the inclusion of underpowered labs that violated preregistered sample size criteria; excluding those labs produced a successful replication in the expert-advice condition.

These arguments are not merely special pleading. Baranski and colleagues (2020) showed that protocol matters enormously: when the Shnabel & Nadler (2008) reconciliation model was tested with more relatable materials rather than the original protocol, replication succeeded. Sadeghiyeh and colleagues (2018) found that a failed exploration replication was explained by a previously unknown moderator — whether participants were actively or passively receiving information — and that switching between conditions within a session could toggle the effect on and off. These are real discoveries, not excuses.

But the moderator explanation cannot be invoked infinitely. At some point, an effect that only appears under unspecified conditions, with unspecified samples, using unspecified materials, is not a robust phenomenon — it is a fragile local observation. Zeevi and colleagues (2020) offered a more structural critique: analysing 88 Reproducibility Project papers, they found that each reported an average of 77.7 results without correcting for multiple comparisons. A statistical adjustment for this multiple-testing problem rendered 21 of 88 focal results non-significant — and those 21 constituted over a third of all findings that failed to replicate. The crisis, in this reading, was partly a mathematics problem hiding in plain sight.

Zeevi and colleagues' (2020) reanalysis found that over a third of non-replicable RPP findings could be explained by uncorrected multiple comparisons alone — a systematic statistical error that preregistration, if properly enforced, would largely eliminate.

What does this mean, practically? The most important implication is that the distinction between a finding and a phenomenon needs to become central to how psychology trains its researchers and communicates its results. A statistically significant result in a single small study is a finding. A phenomenon is something that survives methodological variation, larger samples, and adversarial testing. The crisis revealed that psychology had been treating findings as phenomena — citing a single study as established fact, building theories on results that were never robust enough to bear that weight.

The second implication is harder to sit with. Many of the findings that failed to replicate are still in the textbooks. They are still cited in policy documents. They are still taught to undergraduates as established science. The cognitive bias literature, the implicit bias literature, the self-depletion literature — all of these have been substantially revised by the replication evidence, yet the revision has not propagated downstream at anything like the same speed as the original claims. The scientific community may have accepted that the crisis is real. The broader culture that consumed and acted on those findings largely has not.

The replication crisis was not solved. It was partially addressed, honestly confronted, and thoroughly complicated. That is, in its own way, progress. But the work of separating what psychology actually knows from what it believed it knew is still, unmistakably, ongoing.

How this research was conducted

This analysis synthesised findings from 109 published papers examining replication attempts of pre-2015 psychology studies. Literature was identified across Semantic Scholar, Crossref, arXiv, OpenAlex, PubMed, and Wikipedia, spanning 2013–2026. Papers were included only if they reported explicit replication attempts with sufficient methodological detail to assess fidelity to the original study. Claims were verified against source text. The analysis covers over 250 individual replication attempts across social, cognitive, and personality psychology domains.

This analysis was produced by Evidensity Research. If you need source-verified evidence synthesis for your own research, organisation, or content — get in touch.

Psychology's Reckoning: A Decade Later, Was the Crisis Actually Fixed?

How this research was conducted

Further Reading