Skip to content
Longterm Wiki
Back

MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: METR

Released by METR in 2025, MALT directly supports validation of monitoring systems used in frontier AI capability evaluations, addressing a key gap in eval integrity assurance for advanced autonomous agents.

Metadata

Importance: 72/100blog postdataset

Summary

METR introduces MALT (Manually-reviewed Agentic Labeled Transcripts), a dataset of ~11,000 agent transcripts labeled for behaviors threatening evaluation integrity, including reward hacking and sandbagging, across 21 models and 403 tasks. The dataset enables validation of LLM-based monitoring systems, with best monitors achieving AUROCs of 0.96 and 0.91 for reward hacking and sandbagging detection respectively. Access to reasoning traces significantly improves sandbagging detection but not reward hacking detection.

Key Points

  • MALT contains 103 unprompted reward hacking examples, 32 unprompted give-up examples, 1,505 prompted sandbagging runs, and 1,014 manually verified benign runs across 21 models.
  • Best LLM monitors reach AUROC 0.96 (reward hacking) and 0.91 (sandbagging), catching 80-90% of true positives at a 5% false positive rate.
  • Chain-of-thought/reasoning trace access is critical for detecting sandbagging but makes little difference for reward hacking classification.
  • Dataset covers METR's HCAST and RE-Bench task suites, focusing on software and research-related agentic tasks across 86 task families.
  • Authors caution that strong monitor performance on MALT is neither sufficient nor necessary for real-world detection, as prompted examples may lack realism and subtle sandbagging may be undetectable from transcripts.

Cited by 1 page

PageTypeQuality
METROrganization66.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202679 KB
[![METR Logo](https://metr.org/assets/images/logo/logo.svg)](https://metr.org/)

- [Research](https://metr.org/research)
- [Notes](https://metr.org/notes)
- [Updates](https://metr.org/blog)
- [About](https://metr.org/about)
- [Donate](https://metr.org/donate)
- [Careers](https://metr.org/careers)

Menu

**Access the dataset on Hugging Face:**

[Transcript Graph](https://huggingface.co/datasets/metr-evals/malt-transcripts-public) [Ordered Messages](https://huggingface.co/datasets/metr-evals/malt-public)

## Summary

- We’re introducing MALT (Manually-reviewed Agentic Labeled Transcripts), a dataset of natural and prompted examples of behaviors that threaten evaluation integrity (like generalized reward hacking or sandbagging) on METR’s HCAST and RE-Bench task suites \[ [12](https://metr.org/hcast.pdf), [13](https://metr.org/AI_R_D_Evaluation_Report.pdf)\].
- We’ve recently begun using LLM-based monitoring systems to check for behaviors that might affect our evaluation results \[ [3](https://metr.org/blog/2025-06-05-recent-reward-hacking/), [7](https://metr.github.io/autonomy-evals-guide/gpt-5-report/)\]. However, we lacked a ground truth dataset to validate that our monitors actually performed reliably. To close this gap, we’ve built MALT.
- We’ve focused on labeling behaviors related to reward hacking and sandbagging, and also have a large collection of confirmed benign runs to use as controls. MALT consists of:
  - Unprompted, natural transcripts that show a related behavior, identified by a mix of human labelling, keyword search, and anomalous score investigations. These transcripts, which may have multiple labels, include:
    - 103 unprompted examples of models exhibiting generalized reward hacking behavior, i.e. attempting to bypass or ignore task constraints.
    - 32 unprompted examples of models giving up on tasks, despite having ample resources left
    - 15 unprompted examples of models reasoning about elements of the task
    - 34 unprompted examples of miscellaneous behavior, like evaluation awareness, hardcoding or bruteforcing solutions, etc.
  - Prompted and post-processed agent transcripts that elicit more explicit sandbagging-like behavior, such as direct sabotage, targeted refusals, etc. We have 1,505 such runs.
  - 1,014 runs that were manually reviewed to ensure that they were benign.
  - 8,229 presumed benign runs, where we did not detect or induce any behavior but were not manually reviewed. Based on the reviewed set, we expect a label error rate of 6% here.
- Our 10,919 agent transcripts are on software and research-related problems, spanning 403 tasks (across 86 task families — collections of similar or related tasks) and 21 models.
- Simple prompted monitors perform decently when asked to classify runs as reward hacking or sandbagging, with the best monitors reaching AUROCs of 0.96 and 0.91 respectively. At a fixed 5% false positive rate, the best monitors catch 80%-90% of true positives (when given access to reasoning tr

... (truncated, 79 KB total)
Resource ID: 3ee11b82b7e1fd68 | Stable ID: ZGI2MjViND