Skip to content
Longterm Wiki

IFEval

General

Instruction-Following Evaluation benchmark testing whether LLMs can follow explicit formatting constraints (e.g., 'write exactly 3 paragraphs', 'include these keywords').

Models Tested
10
Best Score
93.2%
Median Score
87.35%
Scoring: accuracy
Introduced: 2023-11
Maintainer: Google Research

Leaderboard10 models

#ModelDeveloperScore
🥇o3OpenAI
93.2%
🥈Llama 3.3Meta AI (FAIR)
92.1%
🥉o4-miniOpenAI
91.7%
4GPT-4.1OpenAI
90.4%
5Llama 3.1Meta AI (FAIR)
88.6%
6DeepSeek V3DeepSeek
86.1%
7GPT-4oOpenAI
84.5%
8DeepSeek R1DeepSeek
83.3%
9GPT-4o miniOpenAI
80.6%
10Mistral Large 2Mistral AI
80.1%