Skip to content
Longterm Wiki

BFCL

Coding

Berkeley Function Calling Leaderboard — evaluates LLMs on their ability to correctly generate function/tool calls including parameter extraction, type handling, and multi-turn interactions.

Models Tested
2
Best Score
90.2%
Median Score
89.3%
Scoring: accuracy
Introduced: 2024-02
Maintainer: UC Berkeley

Leaderboard2 models

#ModelDeveloperScore
🥇Claude 3.5 SonnetAnthropic
90.2%
🥈GPT-4oOpenAI
88.4%