Skip to content
Longterm Wiki

IFEval

General
Instruction-Following Evaluation benchmark testing whether LLMs can follow explicit formatting constraints (e.g., 'write exactly 3 paragraphs', 'include these keywords').
Models Tested
12
Best Score
93.2%
Median Score
87.35%
Scoring: accuracy
Introduced: 2023-11
Maintainer: Google Research

Leaderboard (12 models)

#ModelDeveloperScore
🥇o3OpenAI
93.2%
🥈Llama 3.3Meta AI (FAIR)
92.1%
🥉o4-miniOpenAI
91.7%
4Claude Opus 4.6Anthropic
91.2
5GPT-4.1OpenAI
90.4%
6Llama 3.1Meta AI (FAIR)
88.6%
7DeepSeek V3DeepSeek
86.1%
8GPT-4oOpenAI
84.5%
9GPT-4.1 miniOpenAI
84.1
10DeepSeek R1DeepSeek
83.3%
11GPT-4o miniOpenAI
80.6%
12Mistral Large 2Mistral AI
80.1%