Skip to content
Longterm Wiki

MLE-bench

Agentic

Machine Learning Engineering benchmark from OpenAI — evaluates AI agents on 75 real Kaggle competitions testing data science and ML engineering skills.

Models Tested
3
Best Score
16.9%
Median Score
8.7%
Scoring: percentage
Introduced: 2024-09
Maintainer: OpenAI

Leaderboard3 models

#ModelDeveloperScore
🥇o1OpenAI
16.9%
🥈GPT-4oOpenAI
8.7%
🥉Claude 3.5 SonnetAnthropic
7.6%