Aaronson & Shi (2024)

paper

2024·arXiv·arxiv.org/abs/2401.06829

Authors

Folco Bertini Baldassini·Huy H. Nguyen·Ching-Chung Chang·Isao Echizen

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

Relevant to AI safety efforts around provenance, authentication, and detecting AI-generated content; watermarking is a key tool for model accountability and misuse prevention in deployment contexts.

Metadata

Importance: 42/100arxiv preprintprimary source

Abstract

A new approach to linguistic watermarking of language models is presented in which information is imperceptibly inserted into the output text while preserving its readability and original meaning. A cross-attention mechanism is used to embed watermarks in the text during inference. Two methods using cross-attention are presented that minimize the effect of watermarking on the performance of a pretrained model. Exploration of different training strategies for optimizing the watermarking and of the challenges and implications of applying this approach in real-world scenarios clarified the tradeoff between watermark robustness and text quality. Watermark selection substantially affects the generated output for high entropy sentences. This proactive watermarking approach has potential application in future model development.

Summary

Aaronson & Shi (2024) introduce a cross-attention-based watermarking technique that imperceptibly embeds information into LLM-generated text during inference while preserving readability and semantic meaning. The paper explores training strategies for optimizing watermark robustness against text quality tradeoffs, finding that watermark selection notably affects high-entropy sentence outputs.

Key Points

•Cross-attention mechanisms are used to embed watermarks during inference without retraining the base model from scratch.
•Two cross-attention-based methods are presented that minimize degradation to pretrained model performance.
•Key tradeoff identified: stronger watermark robustness tends to reduce output text quality.
•Watermark selection has outsized impact on generated outputs for high-entropy (more uncertain) sentences.
•Proactive watermarking could support AI authentication, provenance tracking, and misuse detection in deployment.

Cited by 1 page

Page	Type	Quality
AI Model Steganography	Risk	91.0

Cached Content Preview

HTTP 200Fetched Apr 9, 202633 KB

[2401.06829] Cross-attention watermarking of large language models 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 Cross-attention watermarking of large language models

 
 Abstract

 A new approach to linguistic watermarking of language models is presented in which information is imperceptibly inserted into the output text while preserving its readability and original meaning. A cross-attention mechanism is used to embed watermarks in the text during inference. Two methods using cross-attention are presented that minimize the effect of watermarking on the performance of a pretrained model. Exploration of different training strategies for optimizing the watermarking and of the challenges and implications of applying this approach in real-world scenarios clarified the tradeoff between watermark robustness and text quality. Watermark selection substantially affects the generated output for high entropy sentences. This proactive watermarking approach has potential application in future model development. 

 
 
 Index Terms —  
Large Language Models, Linguistic Watermarking, Cross Attention, Steganography 

 
 
 
 1 Introduction

 
 Linguistic watermarking refers to the insertion of particular unnoticeable information into a text document while preserving its readability, intended meaning, and ability to withstand noise  [ 1 ] . Interest in such watermarking is growing due to the widespread emergence of AI-generated text, which poses risks in various sectors, including online influence campaigns, AI exploitation of authorship, spamming, harassment, malware facilitation, and social engineering  [ 2 , 3 ] . Text watermarking has various applications, including the tagging of AI-generated text. It can be used to protect intellectual property, detect leaks, and verify the source. In addition, it aids in complying with regulations that mandate identification of AI-generated texts in countries that are implementing such requirements  [ 4 ] . Lastly, watermarking can assist language model developers in ensuring that model output is distinguishable to prevent model collapse during training on generated text  [ 5 ] . 

 
 
 As the output of large language models (LLMs) approaches the quality of human-generated text, it becomes more difficult to distinguish between them. As LLMs improve, increase in number, and become easier to fine-tune, the existing post-hoc detectors for AI-generated text are becoming more unreliable in correctly identifying such content. The defenses often fail to generalize to real-world scenarios  [ 6 ] and often fail in adversarial settings  [ 7 ] . Intentionally modifying models to embed watermarks in the output text is a promising solution to this challenge. Unfortunately, the implementation of a useful watermark often involves substantial modifications to the distribution of text generated by the template, potentially degrading its quality  [ 8 ] . 

 
 
 Our focus is on blind watermarking, which enables verification based solely on bits e

... (truncated, 33 KB total)

Resource ID: a3a265ae188d4727 | Stable ID: sid_z14879VLaS