Skip to content
Longterm Wiki
Back

interdisciplinary review of AI evaluation

paper

Authors

Maria Eriksson·Erasmo Purificato·Arman Noroozian·Joao Vinagre·Guillaume Chaslot·Emilia Gomez·David Fernandez-Llorca

Credibility Rating

3/5
Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Interdisciplinary review examining how AI benchmarks are used to evaluate model performance, capabilities, and safety, with attention to their growing influence on AI development and regulatory frameworks and concerns about evaluating high-impact capabilities.

Paper Details

Citations
0
3 influential
Year
2024
Methodology
survey

Metadata

arxiv preprintanalysis

Abstract

Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. Currently, they shape the direction of AI development and are playing an increasingly prominent role in regulatory frameworks. As their influence grows, however, so too does concerns about how and with what effects they evaluate highly sensitive topics such as capabilities, including high-impact capabilities, safety and systemic risks. This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices, published in the last 10 years. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing logic that fails to account for how AI models are increasingly multimodal and interact with humans and other technical systems). Our review also highlights a series of systemic flaws in current benchmarking practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results. Furthermore, it underscores how benchmark practices are fundamentally shaped by cultural, commercial and competitive dynamics that often prioritise state-of-the-art performance at the expense of broader societal concerns. By providing an overview of risks associated with existing benchmarking procedures, we problematise disproportionate trust placed in benchmarks and contribute to ongoing efforts to improve the accountability and relevance of quantitative AI benchmarks within the complexities of real-world scenarios.

Summary

This interdisciplinary meta-review of ~100 studies examines critical shortcomings in quantitative AI benchmarking practices over the past decade. The paper identifies fine-grained technical issues (dataset biases, data contamination, inadequate documentation) alongside broader sociotechnical problems (overemphasis on text-based single-test evaluation, failure to account for multimodal and interactive AI systems). The authors highlight systemic flaws including misaligned incentives, construct validity issues, and gaming of results, arguing that benchmarking practices are shaped by commercial and competitive dynamics that often prioritize performance metrics over societal concerns. The review challenges the disproportionate trust placed in benchmarks and advocates for improved accountability and real-world relevance in AI evaluation.

Cited by 1 page

PageTypeQuality
AI Capability Threshold ModelAnalysis72.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB
[License: arXiv.org perpetual non-exclusive license](https://info.arxiv.org/help/license/index.html#licenses-available)

arXiv:2502.06559v2 \[cs.AI\] 25 May 2025

# Can We Trust AI Benchmarks?   An Interdisciplinary Review of   Current Issues in AI Evaluation ††thanks: Disclaimer: The views expressed in this paper are purely those of the authors and may not, under any circumstances, be regarded as an official position of the European Commission.

Report issue for preceding element

[![[Uncaptioned image]](https://arxiv.org/html/2502.06559v2/x1.png) Maria Eriksson](https://orcid.org/0000-0002-7534-4268 "")

European Commission, Joint Research Centre (JRC)

Seville, Spain

maria.eriksson@ec.europa.eu

\\And [![[Uncaptioned image]](https://arxiv.org/html/2502.06559v2/x2.png) Erasmo Purificato](https://orcid.org/0000-0002-5506-3020 "")

European Commission, Joint Research Centre (JRC)

Ispra, Italy

erasmo.purificato@ec.europa.eu

\\AndArman Noroozian

European Commission, Joint Research Centre (JRC)

Brussels, Belgium

arman.noroozian@ec.europa.eu

\\And [![[Uncaptioned image]](https://arxiv.org/html/2502.06559v2/x3.png) João Vinagre](https://orcid.org/0000-0001-6219-3977 "")

European Commission, Joint Research Centre (JRC)

Seville, Spain

joao.vinagre@ec.europa.eu

\\AndGuillaume Chaslot

European Commission, Joint Research Centre (JRC)

Brussels, Belgium

guillaume.chaslot@ec.europa.eu

\\And [![[Uncaptioned image]](https://arxiv.org/html/2502.06559v2/x4.png) Emilia Gómez](https://orcid.org/0000-0003-4983-3989 "")

European Commission, Joint Research Centre (JRC)

Seville, Spain

emilia.gomez-gutierrez@ec.europa.eu

\\And [![[Uncaptioned image]](https://arxiv.org/html/2502.06559v2/x5.png) David Fernandez-Llorca](https://orcid.org/0000-0003-2433-7110 "")

European Commission, Joint Research Centre (JRC)

Seville, Spain

david.fernandez-llorca@ec.europa.eu

Report issue for preceding element

###### Abstract

Report issue for preceding element

Quantitative Artificial Intelligence (AI) Benchmarks have emerged as fundamental tools for evaluating the performance, capability, and safety of AI models and systems. Currently, they shape the direction of AI development and are playing an increasingly prominent role in regulatory frameworks. As their influence grows, however, so too does concerns about how and with what effects they evaluate highly sensitive topics such as capabilities, including high-impact capabilities, safety and systemic risks. This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices, published in the last 10 years. It brings together many fine-grained issues in the design and application of benchmarks (such as biases in dataset creation, inadequate documentation, data contamination, and failures to distinguish signal from noise) with broader sociotechnical issues (such as an over-focus on evaluating text-based AI models according to one-time testing log

... (truncated, 98 KB total)
Resource ID: c5a21da9e0c0cdeb | Stable ID: MjA2OGU2Nz