MLE-bench
AgenticMachine Learning Engineering benchmark from OpenAI — evaluates AI agents on 75 real Kaggle competitions testing data science and ML engineering skills.
Models Tested
3
Best Score
16.9%
Median Score
8.7%
Scoring: percentage
Introduced: 2024-09
Maintainer: OpenAI
Leaderboard3 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | o1 | OpenAI | 16.9% |
| 🥈 | GPT-4o | OpenAI | 8.7% |
| 🥉 | Claude 3.5 Sonnet | Anthropic | 7.6% |