Skip to content
Longterm Wiki

Terminal-Bench 2

Agentic

Second version of the Terminal-Bench benchmark with expanded task coverage and difficulty.

Models Tested
1
Best Score
65.4%
Median Score
65.4%
Scoring: percentage
Introduced: 2025-06
Maintainer: Terminal-Bench

Leaderboard1 model

#ModelDeveloperScore
🥇Claude Opus 4.6Anthropic
65.4%