Sleeper Agent Detection
EvaluationemergingResearch on detecting and mitigating backdoors, trojans, and time-delayed deceptive behavior in AI systems.
Organizations
4
Key Papers
1
Grants
3
Total Funding
$125K
Tags
evaluationsbackdoorsdeception
Grants3
| Name | Recipient | Amount | Funder | Date |
|---|---|---|---|---|
| 4-month stipend for 3 people to create demonstrations of provably undetectable backdoors | Andrew Gritsevskiy | $50K | Long-Term Future Fund (LTFF) | 2024-01 |
| Grant to "support prizes for a trojan detection competition at NeurIPS, which involves identifying whether a deep neural network will suddenly change behavior if certain unknown conditions are met." | Trojan Detection Challenge at NeurIPS 2022 | $50K | FTX Future Fund | 2022-05 |
| ETH Zurich Foundation (USA) — Machine Learning Research Support | ETH Zurich Fondation (USA) | $25K | Coefficient Giving | 2023-11 |
Funding by Funder
| Funder | Grants | Total Amount |
|---|---|---|
| Long-Term Future Fund (LTFF) | 1 | $50K |
| FTX Future Fund | 1 | $50K |
| Coefficient Giving | 1 | $25K |
Key Papers & Resources1
SEMINAL
Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
Hubinger et al. (Anthropic)2024