IFEval
GeneralInstruction-Following Evaluation benchmark testing whether LLMs can follow explicit formatting constraints (e.g., 'write exactly 3 paragraphs', 'include these keywords').
Models Tested
10
Best Score
93.2%
Median Score
87.35%
Scoring: accuracy
Introduced: 2023-11
Maintainer: Google Research
Leaderboard10 models
| # | Model | Developer | Score |
|---|---|---|---|
| 🥇 | o3 | OpenAI | 93.2% |
| 🥈 | Llama 3.3 | Meta AI (FAIR) | 92.1% |
| 🥉 | o4-mini | OpenAI | 91.7% |
| 4 | GPT-4.1 | OpenAI | 90.4% |
| 5 | Llama 3.1 | Meta AI (FAIR) | 88.6% |
| 6 | DeepSeek V3 | DeepSeek | 86.1% |
| 7 | GPT-4o | OpenAI | 84.5% |
| 8 | DeepSeek R1 | DeepSeek | 83.3% |
| 9 | GPT-4o mini | OpenAI | 80.6% |
| 10 | Mistral Large 2 | Mistral AI | 80.1% |