Skip to content
Longterm Wiki
Back

[2502.14297] Evaluating Sakana's AI Scientist: Bold Claims, Mixed Results, and a Promising Future?

paper

Authors

Joeran Beel·Min-Yen Kan·Moritz Baumgart

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

An evaluation of Sakana's AI Scientist system that assesses claims about autonomous research capability (ARI), examining the current state of AI autonomy in scientific research and implications for AGI development.

Paper Details

Citations
1
0 influential
Year
2025
Methodology
peer-reviewed
Categories
ACM SIGIR Forum

Metadata

arxiv preprintanalysis

Abstract

A major step toward Artificial General Intelligence (AGI) and Super Intelligence is AI's ability to autonomously conduct research - what we term Artificial Research Intelligence (ARI). If machines could generate hypotheses, conduct experiments, and write research papers without human intervention, it would transform science. Sakana recently introduced the 'AI Scientist', claiming to conduct research autonomously, i.e. they imply to have achieved what we term Artificial Research Intelligence (ARI). The AI Scientist gained much attention, but a thorough independent evaluation has yet to be conducted. Our evaluation of the AI Scientist reveals critical shortcomings. The system's literature reviews produced poor novelty assessments, often misclassifying established concepts (e.g., micro-batching for stochastic gradient descent) as novel. It also struggles with experiment execution: 42% of experiments failed due to coding errors, while others produced flawed or misleading results. Code modifications were minimal, averaging 8% more characters per iteration, suggesting limited adaptability. Generated manuscripts were poorly substantiated, with a median of five citations, most outdated (only five of 34 from 2020 or later). Structural errors were frequent, including missing figures, repeated sections, and placeholder text like 'Conclusions Here'. Some papers contained hallucinated numerical results. Despite these flaws, the AI Scientist represents a leap forward in research automation. It generates full research manuscripts with minimal human input, challenging expectations of AI-driven science. Many reviewers might struggle to distinguish its work from human researchers. While its quality resembles a rushed undergraduate paper, its speed and cost efficiency are unprecedented, producing a full paper for USD 6 to 15 with 3.5 hours of human involvement, far outpacing traditional researchers.

Summary

This paper presents an independent evaluation of Sakana's 'AI Scientist' system, which claims to autonomously conduct research (Artificial Research Intelligence). While the system successfully generates full research manuscripts with minimal human input at unprecedented speed and cost ($6-15 per paper), the evaluation reveals critical shortcomings: poor novelty assessments that misclassify established concepts as novel, a 42% experiment failure rate due to coding errors, minimal code improvements across iterations (8% character increase), poorly substantiated manuscripts with outdated citations, and structural errors including hallucinated results. Despite these flaws, the work represents a significant leap in research automation that could challenge peer review processes.

Cited by 1 page

PageTypeQuality
Scientific Research CapabilitiesCapability68.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202667 KB
# Evaluating Sakana’s AI Scientist for Autonomous Research: Wishful Thinking or an Emerging Reality Towards ‘Artificial Research Intelligence’ (ARI)?

Joeran Beel
[joeran.beel@uni-siegen.de](mailto:joeran.beel@uni-siegen.de)[0000-0002-4537-5573](https://orcid.org/0000-0002-4537-5573 "ORCID identifier")University of Siegen, [Intelligent Systems Group](https://isg.beel.org/ "") & [Recommender-Systems.com](https://recommender-systems.com/ "")SiegenGermany, Min-Yen Kan
[kanmy@comp.nus.edu.sg](mailto:kanmy@comp.nus.edu.sg)[0000-0001-8507-3716](https://orcid.org/0000-0001-8507-3716 "ORCID identifier")National University of Singapore – [Web, Information Retrieval / Natural Language Processing Group (WING)](https://wing.comp.nus.edu.sg/ "")Singapore and Moritz Baumgart
[moritz.baumgart@student.uni-siegen.de](mailto:moritz.baumgart@student.uni-siegen.de)[0009-0007-1322-1450](https://orcid.org/0009-0007-1322-1450 "ORCID identifier")University of SiegenSiegenGermany

(2025-02)

###### Abstract.

Abstract. Recently, Sakana.ai introduced the AI Scientist, a system claiming to automate the entire research lifecycle and conduct research autonomously, a concept we term Artificial Research Intelligence (ARI). Achieving ARI would be a major milestone toward Artificial General Intelligence (AGI) and a prerequisite to achieving Super Intelligence. The AI Scientist received much attention in the academic and broader AI community. A thorough evaluation of the AI Scientist, however, had not yet been conducted.

We evaluated the AI Scientist and found several critical shortcomings. The system’s literature review process is inadequate, relying on simplistic keyword searches rather than profound synthesis, which leads to poor novelty assessments. In our experiments, several generated research ideas were incorrectly classified as novel, including well-established concepts such as micro-batching for stochastic gradient descent (SGD). The AI Scientist also lacks robustness in experiment execution—five out of twelve proposed experiments (42%) failed due to coding errors, and those that did run often produced logically flawed or misleading results. In one case, an experiment designed to optimize energy efficiency reported improvements in accuracy while consuming more computational resources, contradicting its stated goal. Furthermore, the system modifies experimental code minimally, with each iteration adding only 8% more characters on average, suggesting limited adaptability. The generated manuscripts were poorly substantiated, with a median of just five citations per paper—most of which were outdated (only five out of 34 citations were from 2020 or later). Structural errors were frequent, including missing figures, repeated sections, and placeholder text such as “Conclusions Here”. Hallucinated numerical results were contained in several manuscripts, undermining the reliability of its outputs.

Despite its limitations, the AI Scientist represents a significant leap forward i

... (truncated, 67 KB total)
Resource ID: 44d08e9a8ca0c435 | Stable ID: MDA0NmY4OT