Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats - arXiv

paper

2024·arXiv·arxiv.org/abs/2411.17693

Author

Zachary Schillaci

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Proposes an adaptive deployment framework for potentially misaligned LLMs that combines task-level monitoring with dynamic risk adjustment, addressing distributed safety risks in real-world applications where individual failures don't cause immediate catastrophe.

Paper Details

Citations

1 influential

Year

2024

Methodology

book-chapter

Metadata

arxiv preprintprimary source

Summary

This paper addresses the safety of deploying potentially misaligned large language models in real-world applications where individual failures are not catastrophic but cumulative risks matter. The authors propose a two-level adaptive deployment framework combining micro-protocols (task-level safeguards using trusted models to monitor untrusted ones) with a macro-protocol that dynamically adjusts risk tolerance based on the untrusted model's observed behavior over time. Evaluated on a code generation task where adversaries attempt to inject subtle backdoors, their adaptive strategy reduces backdoors by 80% compared to non-adaptive baselines while maintaining usefulness, demonstrating effective safety-utility tradeoffs in distributed threat settings.

Cited by 1 page

Page	Type	Quality
Redwood Research	Organization	78.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2411.17693] Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

 
 
 Jiaxin Wen 1 , Vivek Hebbar 2∗ , Caleb Larson 3∗ , Aryan Bhatt 2 , Ansh Radhakrishnan 6 
 Mrinank Sharma 6 , Henry Sleight 3 , Shi Feng 4,5 , He He 4 , Ethan Perez 6 
 Buck Shlegeris 2 , Akbir Khan 6,7 
 1 Tsinghua University 2 Redwood Research 3 MATS 
 4 George Washington University 5 New York University 6 Anthropic
 7 UCL
 Equal contribution, order was randomized. 
 

 
 Abstract

 As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them. Previous work introduced control evaluations, an adversarial framework for testing deployment strategies of untrusted models (i.e., models which might be trying to bypass safety measures). While prior work treats a single failure as unacceptable, we perform control evaluations in a “distributed threat setting” – a setting where no single action is catastrophic and no single action provides overwhelming evidence of misalignment.
We approach this problem with a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. Micro-protocols operate on a single task, using a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. Meanwhile, the macro-protocol maintains an adaptive credence on the untrusted model’s alignment based on its past actions, using it to pick between safer and riskier micro-protocols. We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.

 
 
 
 1 Introduction

 
 Figure 1: An illustration of our framework for deploying untrusted LLMs. (Left) For the t 𝑡 t -th task, we query an untrusted LLM with different micro-protocols to generate candidate solutions. We then use a macro-protocol to select the solution that maximizes the expected utility (i.e., the safety-usefulness tradeoff) over the remaining deployment. (Right) For each candidate, the macro-protocol estimates its expected utility by considering monitor scores, expected usefulness, and its credence on the model’s misalignment, which is updated continuously over time. 
 
 
 As large language models (LLMs) become more capable, they may be used for critical, high-impact tasks with the potential for significant negative outcomes. One concern is that LLMs may try to bypass safety measures due to misalignment (Carlsmith, 2023 ; Wen et al., 2024 ) or data poisoning (Carlini et al., 2024 ) . Given thi

... (truncated, 98 KB total)

Resource ID: 318f1c30426de688 | Stable ID: sid_wyFHSnNeR2