UK AI Safety Institute's Inspect framework

web

inspect.aisi.org.uk·inspect.aisi.org.uk/

Inspect is a practical evaluation toolkit from the UK government's AI Safety Institute, relevant to researchers building safety benchmarks or conducting model evaluations; note that current tags like 'interpretability' and 'rlhf' appear mismatched to this resource's actual focus on evaluation infrastructure.

Metadata

Importance: 65/100tool pagetool

Summary

Inspect is an open-source framework developed by the UK AI Safety Institute (AISI) for evaluating large language models and AI systems. It provides standardized tools for running safety evaluations, benchmarks, and red-teaming tasks. The framework enables researchers and developers to assess AI model capabilities and safety properties in a reproducible and extensible way.

Key Points

•Open-source Python framework for conducting rigorous AI model evaluations and benchmarks developed by the UK AISI
•Supports a wide range of evaluation tasks including reasoning, coding, safety, and agentic capability assessments
•Designed for reproducibility and extensibility, allowing custom solvers, scorers, and datasets to be integrated
•Part of AISI's broader mission to provide public infrastructure for AI safety testing and frontier model evaluation
•Enables standardized comparisons across models and facilitates third-party safety auditing workflows

Cited by 6 pages

Page	Type	Quality
UK AI Safety Institute	Organization	52.0
AI Safety Institutes (AISIs)	Policy	69.0
Evals-Based Deployment Gates	Approach	66.0
International AI Safety Summit Series	Event	63.0
Scalable Eval Approaches	Approach	65.0
Technical AI Safety Research	Crux	66.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202614 KB

Inspect 

 
 
 

 
 

 
 

 
 

 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Welcome

 Welcome to Inspect, a framework for large language model evaluations created by the UK AI Security Institute .

 Inspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Core features of Inspect include:

 
 
 A set of straightforward interfaces for implementing evaluations and re-using components across evaluations.

 A collection of over 100 pre-built evaluations ready to run on any model.

 Extensive tooling, including a web-based Inspect View tool for monitoring and visualizing evaluations and a VS Code Extension that assists with authoring and debugging.

 Flexible support for tool calling—custom and MCP tools, as well as built-in bash, python, text editing, web search, web browsing, and computer tools.

 Support for agent evaluations, including flexible built-in agents, multi-agent primitives, the ability to run arbitrary external agents like Claude Code, Codex CLI, and Gemini CLI.

 A sandboxing system that supports running untrusted model code in Docker, Kubernetes, Modal, Proxmox, and other systems via an extension API.

 
 
 We’ll walk through a fairly trivial “Hello, Inspect” example below. Read on to learn the basics, then read the documentation on Datasets , Solvers , Scorers , Tools , and Agents to learn how to create more advanced evaluations.

 If you are primarily interested in running evaluations rather than developing new evaluations, Inspect Evals provides implementations for a large collection of popular benchmarks.

 
 
 Getting Started

 To get started using Inspect:

 
 Install Inspect from PyPI with:

 pip install inspect-ai 

 If you are using VS Code, install the Inspect VS Code Extension (not required but highly recommended).

 
 To develop and run evaluations, you’ll also need access to a model, which typically requires installation of a Python package as well as ensuring that the appropriate API key is available in the environment.

 Assuming you had written an evaluation in a script named arc.py , here’s how you would setup and run the eval for a few different model providers:

 
 OpenAI 
 Anthropic 
 Google 
 Grok 
 Mistral 
 HF 
 
 
 
 pip install openai 
 export OPENAI_API_KEY = your-openai-api-key 
 inspect eval arc.py --model openai/gpt-4o 
 
 
 pip install anthropic 
 export ANTHROPIC_API_KEY = your-anthropic-api-key 
 inspect eval arc.py --model anthropic/claude-sonnet-4-0 
 
 
 pip install google-genai 
 export GOOGLE_API_KEY = your-google-api-key 
 inspect eval arc.py --model google/gemini-2.5-pro 
 
 
 pip install openai 
 export GROK_API_KEY = your-grok-api-key 
 inspect eval arc.py --model grok/grok-3-mini 
 
 
 pip install mistralai 
 export MISTRAL_API_KEY = your-mistral-api-key 
 inspect eval arc.py --model mistral/mistral-large-latest 
 
 
 pip install torch transformers 
 export HF_TOKEN = your-hf-token 
 

... (truncated, 14 KB total)

Resource ID: fc3078f3c2ba5ebb | Stable ID: sid_0iP9XlmUtA