Back
Inspect Evals: Community Evaluation Repository for AI Models
webinspect.aisi.org.uk·inspect.aisi.org.uk/evals/
Maintained by the UK AI Safety Institute, this repository provides standardized implementations of major AI benchmarks useful for capability assessment and safety evaluation research.
Metadata
Importance: 62/100tool pagetool
Summary
Inspect Evals is a repository of community-contributed AI model evaluations maintained by the UK AI Safety Institute (AISI), featuring implementations of popular benchmarks across coding, reasoning, safety, and agent capabilities. Evaluations can be installed and run with a single command against any model, covering domains from Python coding to cybersecurity to scientific reasoning.
Key Points
- •Comprehensive collection of pip-installable benchmarks spanning coding (HumanEval, SWE-bench, MBPP), reasoning, safety, and agent evaluation categories
- •Maintained by UK AISI as part of the Inspect evaluation framework, enabling standardized model comparison across diverse capability domains
- •Includes safety-relevant evals such as cybersecurity, autonomous agents, and research replication benchmarks (PaperBench, MLE-bench)
- •Serves dual purpose as both a practical evaluation toolkit and a learning resource demonstrating diverse evaluation techniques
- •Covers frontier-relevant capabilities like software engineering (SWE-bench), ML engineering (MLE-bench), and scientific code generation (SciCode)
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| UK AI Safety Institute | Organization | 52.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202635 KB
[Inspect Evals](https://ukgovernmentbeis.github.io/inspect_evals/) is a repository of community contributed evaluations featuring implementations of many popular benchmarks and papers.
These evals can be `pip` installed and run with a single command against any model. They are also useful as a learning resource as they demonstrate a wide variety of evaluation types and techniques.
- ## Coding [Anchor](https://inspect.aisi.org.uk/evals/\#coding)
- [HumanEval: Python Function Generation from Instructions](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/humaneval)
Assesses how accurately language models can write correct Python functions based solely on natural-language instructions provided as docstrings.
- [MBPP: Basic Python Coding Challenges](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/mbpp)
Measures the ability of language models to generate short Python programs from simple natural-language descriptions, testing basic coding proficiency.
- [SWE-bench Verified: Resolving Real-World GitHub Issues](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/swe_bench)
Evaluates AI's ability to resolve genuine software engineering issues sourced from 12 popular Python GitHub repositories, reflecting realistic coding and debugging scenarios.
- [MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/mle_bench)
Machine learning tasks drawn from 75 Kaggle competitions.
- [DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/ds1000)
Code generation benchmark with a thousand data science problems spanning seven Python libraries.
- [BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/bigcodebench)
Python coding benchmark with 1,140 diverse questions drawing on numerous python libraries.
- [ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/class_eval)
Evaluates LLMs on class-level code generation with 100 tasks constructed over 500 person-hours. The study shows that LLMs perform worse on class-level tasks compared to method-level tasks.
- [SciCode: A Research Coding Benchmark Curated by Scientists](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/scicode)
SciCode tests the ability of language models to generate code to solve scientific research problems. It assesses models on 65 problems from mathematics, physics, chemistry, biology, and materials science.
- [APPS: Automated Programming Progress Standard](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/apps)
APPS is a dataset for evaluating model performance on Python programming tasks across three difficulty levels consisting of 1,0
... (truncated, 35 KB total)Resource ID:
5110fa50a77a1872 | Stable ID: MzE5M2E3Nz