Back
UK AI Safety Institute's Inspect framework
webinspect.aisi.org.uk·inspect.aisi.org.uk/
Inspect is a practical evaluation toolkit from the UK government's AI Safety Institute, relevant to researchers building safety benchmarks or conducting model evaluations; note that current tags like 'interpretability' and 'rlhf' appear mismatched to this resource's actual focus on evaluation infrastructure.
Metadata
Importance: 65/100tool pagetool
Summary
Inspect is an open-source framework developed by the UK AI Safety Institute (AISI) for evaluating large language models and AI systems. It provides standardized tools for running safety evaluations, benchmarks, and red-teaming tasks. The framework enables researchers and developers to assess AI model capabilities and safety properties in a reproducible and extensible way.
Key Points
- •Open-source Python framework for conducting rigorous AI model evaluations and benchmarks developed by the UK AISI
- •Supports a wide range of evaluation tasks including reasoning, coding, safety, and agentic capability assessments
- •Designed for reproducibility and extensibility, allowing custom solvers, scorers, and datasets to be integrated
- •Part of AISI's broader mission to provide public infrastructure for AI safety testing and frontier model evaluation
- •Enables standardized comparisons across models and facilitates third-party safety auditing workflows
Cited by 6 pages
| Page | Type | Quality |
|---|---|---|
| UK AI Safety Institute | Organization | 52.0 |
| AI Safety Institutes (AISIs) | Policy | 69.0 |
| Evals-Based Deployment Gates | Approach | 66.0 |
| International AI Safety Summit Series | Event | 63.0 |
| Scalable Eval Approaches | Approach | 65.0 |
| Technical AI Safety Research | Crux | 66.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202617 KB
## Welcome [Anchor](https://inspect.aisi.org.uk/\#welcome)
Welcome to Inspect, a framework for large language model evaluations created by the [UK AI Security Institute](https://aisi.gov.uk/).
Inspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Core features of Inspect include:
- A set of straightforward interfaces for implementing evaluations and re-using components across evaluations.
- A collection of over 100 pre-built evaluations ready to run on any model.
- Extensive tooling, including a web-based Inspect View tool for monitoring and visualizing evaluations and a VS Code Extension that assists with authoring and debugging.
- Flexible support for tool calling—custom and MCP tools, as well as built-in bash, python, text editing, web search, web browsing, and computer tools.
- Support for agent evaluations, including flexible built-in agents, multi-agent primitives, the ability to run arbitrary external agents like Claude Code, Codex CLI, and Gemini CLI.
- A sandboxing system that supports running untrusted model code in Docker, Kubernetes, Modal, Proxmox, and other systems via an extension API.
We’ll walk through a fairly trivial “Hello, Inspect” example below. Read on to learn the basics, then read the documentation on [Datasets](https://inspect.aisi.org.uk/datasets.html), [Solvers](https://inspect.aisi.org.uk/solvers.html), [Scorers](https://inspect.aisi.org.uk/scorers.html), [Tools](https://inspect.aisi.org.uk/tools.html), and [Agents](https://inspect.aisi.org.uk/agents.html) to learn how to create more advanced evaluations.
If you are primarily interested in running evaluations rather than developing new evaluations, [Inspect Evals](https://ukgovernmentbeis.github.io/inspect_evals/) provides implementations for a large collection of popular benchmarks.
## Getting Started [Anchor](https://inspect.aisi.org.uk/\#getting-started)
To get started using Inspect:
1. Install Inspect from PyPI with:
```
pip install inspect-ai
```
2. If you are using VS Code, install the [Inspect VS Code Extension](https://inspect.aisi.org.uk/vscode.html) (not required but highly recommended).
To develop and run evaluations, you’ll also need access to a model, which typically requires installation of a Python package as well as ensuring that the appropriate API key is available in the environment.
Assuming you had written an evaluation in a script named `arc.py`, here’s how you would setup and run the eval for a few different model providers:
- [OpenAI](https://inspect.aisi.org.uk/)
- [Anthropic](https://inspect.aisi.org.uk/)
- [Google](https://inspect.aisi.org.uk/)
- [Grok](https://inspect.aisi.org.uk/)
- [Mistral](https://inspect.aisi.org.uk/)
- [HF](https://inspect.aisi.org.uk/)
```
pip install openai
export OPENAI_API_KEY=your-openai-api-key
inspect eval arc.py --model openai/gpt-4o
```
```
pip install anthropic
export ANTHROPIC_API_KEY=your-an
... (truncated, 17 KB total)Resource ID:
fc3078f3c2ba5ebb | Stable ID: MjExZWM4Mz