What happens when teams led entirely by AI, teams supported by AI, and all-human teams are compared in terms of testing the reproducibility of published results from leading social science journals?
Brodeur, Abel; Valenta, David; Marcoci, Alexandru et al. (among the authors: Asanov, Igor; Noha, Anastasiya-Mariya, and Bruns, Stephan) (2026): AI-assisted teams outperform AI-led teams but not human-only teams in assessing research reproducibility in quantitative social science. Proceedings of the National Academy of Sciences of the United States of America 123 (22), e2524747123.
The authors randomly assigned 288 researchers to 103 teams working under three conditions: human-only, AI-assisted (using ChatGPT as a collaborative tool), or AI-led (ChatGPT operating with minimal human oversight) teams reproduced published results from leading social science journals, detected coding errors, and proposed robustness checks. Human-only and AI-assisted teams achieved comparable reproduction rates (94% vs. 91%) and performed similarly on most outcomes, except human-only teams identified significantly more major coding errors. Both substantially outperformed AI-led teams, which achieved only a 37% reproduction rate, detected fewer errors across all categories, proposed weaker robustness checks, and required more time. This autonomous approach, however, likely represents only a lower bound of AI capabilities. Despite rapid model advances, expert human judgment currently remains indispensable for reliable empirical verification. While AI assistance did not degrade most outcomes, it provided no measurable advantages and was associated with reduced detection of major errors. However, the 37% autonomous reproduction rate indicates that AI could provide value in settings where scale or cost constraints preclude human review of papers, even though general-purpose LLMs offer no immediate advantages for human-supervised verification.