IFEval

General

Instruction-Following Evaluation benchmark testing whether LLMs can follow explicit formatting constraints (e.g., 'write exactly 3 paragraphs', 'include these keywords').

Models Tested

Best Score

93.2%

Median Score

87.35%

Scoring: accuracy

Introduced: 2023-11

Maintainer: Google Research

Leaderboard (12 models)

#	Model	Developer	Score
🥇	o3	OpenAI	93.2%
🥈	Llama 3.3	Meta AI (FAIR)	92.1%
🥉	o4-mini	OpenAI	91.7%
4	Claude Opus 4.6	Anthropic	91.2
5	GPT-4.1	OpenAI	90.4%
6	Llama 3.1	Meta AI (FAIR)	88.6%
7	DeepSeek V3	DeepSeek	86.1%
8	GPT-4o	OpenAI	84.5%
9	GPT-4.1 mini	OpenAI	84.1
10	DeepSeek R1	DeepSeek	83.3%
11	GPT-4o mini	OpenAI	80.6%
12	Mistral Large 2	Mistral AI	80.1%