Skip to content
Longterm Wiki
Back

UK AI Safety Institute's Inspect framework

web
inspect.aisi.org.uk·inspect.aisi.org.uk/

Inspect is a practical evaluation toolkit from the UK government's AI Safety Institute, relevant to researchers building safety benchmarks or conducting model evaluations; note that current tags like 'interpretability' and 'rlhf' appear mismatched to this resource's actual focus on evaluation infrastructure.

Metadata

Importance: 65/100tool pagetool

Summary

Inspect is an open-source framework developed by the UK AI Safety Institute (AISI) for evaluating large language models and AI systems. It provides standardized tools for running safety evaluations, benchmarks, and red-teaming tasks. The framework enables researchers and developers to assess AI model capabilities and safety properties in a reproducible and extensible way.

Key Points

  • Open-source Python framework for conducting rigorous AI model evaluations and benchmarks developed by the UK AISI
  • Supports a wide range of evaluation tasks including reasoning, coding, safety, and agentic capability assessments
  • Designed for reproducibility and extensibility, allowing custom solvers, scorers, and datasets to be integrated
  • Part of AISI's broader mission to provide public infrastructure for AI safety testing and frontier model evaluation
  • Enables standardized comparisons across models and facilitates third-party safety auditing workflows

Cited by 6 pages

Cached Content Preview

HTTP 200Fetched Mar 20, 202617 KB
## Welcome [Anchor](https://inspect.aisi.org.uk/\#welcome)

Welcome to Inspect, a framework for large language model evaluations created by the [UK AI Security Institute](https://aisi.gov.uk/).

Inspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Core features of Inspect include:

- A set of straightforward interfaces for implementing evaluations and re-using components across evaluations.
- A collection of over 100 pre-built evaluations ready to run on any model.
- Extensive tooling, including a web-based Inspect View tool for monitoring and visualizing evaluations and a VS Code Extension that assists with authoring and debugging.
- Flexible support for tool calling—custom and MCP tools, as well as built-in bash, python, text editing, web search, web browsing, and computer tools.
- Support for agent evaluations, including flexible built-in agents, multi-agent primitives, the ability to run arbitrary external agents like Claude Code, Codex CLI, and Gemini CLI.
- A sandboxing system that supports running untrusted model code in Docker, Kubernetes, Modal, Proxmox, and other systems via an extension API.

We’ll walk through a fairly trivial “Hello, Inspect” example below. Read on to learn the basics, then read the documentation on [Datasets](https://inspect.aisi.org.uk/datasets.html), [Solvers](https://inspect.aisi.org.uk/solvers.html), [Scorers](https://inspect.aisi.org.uk/scorers.html), [Tools](https://inspect.aisi.org.uk/tools.html), and [Agents](https://inspect.aisi.org.uk/agents.html) to learn how to create more advanced evaluations.

If you are primarily interested in running evaluations rather than developing new evaluations, [Inspect Evals](https://ukgovernmentbeis.github.io/inspect_evals/) provides implementations for a large collection of popular benchmarks.

## Getting Started [Anchor](https://inspect.aisi.org.uk/\#getting-started)

To get started using Inspect:

1. Install Inspect from PyPI with:





```
pip install inspect-ai
```

2. If you are using VS Code, install the [Inspect VS Code Extension](https://inspect.aisi.org.uk/vscode.html) (not required but highly recommended).


To develop and run evaluations, you’ll also need access to a model, which typically requires installation of a Python package as well as ensuring that the appropriate API key is available in the environment.

Assuming you had written an evaluation in a script named `arc.py`, here’s how you would setup and run the eval for a few different model providers:

- [OpenAI](https://inspect.aisi.org.uk/)
- [Anthropic](https://inspect.aisi.org.uk/)
- [Google](https://inspect.aisi.org.uk/)
- [Grok](https://inspect.aisi.org.uk/)
- [Mistral](https://inspect.aisi.org.uk/)
- [HF](https://inspect.aisi.org.uk/)

```
pip install openai
export OPENAI_API_KEY=your-openai-api-key
inspect eval arc.py --model openai/gpt-4o
```

```
pip install anthropic
export ANTHROPIC_API_KEY=your-an

... (truncated, 17 KB total)
Resource ID: fc3078f3c2ba5ebb | Stable ID: MjExZWM4Mz