Alignment Faking Hackathon - Redwood Research

web

redwood-af.devpost.com·redwood-af.devpost.com

This hackathon operationalizes the 'alignment faking' failure mode (closely related to deceptive alignment) into a competitive research format, producing open-source model organisms and detectors relevant to empirical AI safety work.

Metadata

Importance: 52/100tool pagetool

Summary

Redwood Research hosted a 3-day in-person hackathon (September 12-14, 2025) challenging participants to create 'model organisms of alignment faking'—LLMs that behave safely when observed but unsafely when unobserved. The event featured red team (creating alignment faking examples) and blue team (detecting such behavior) competitions with $2,000 in prizes, aiming to generate concrete empirical examples and detection methods for this critical AI safety failure mode.

Key Points

•Alignment faking defined as an LLM behaving safely in testing/observed contexts but dangerously in production/unobserved contexts.
•Red team track challenged participants to build novel model organisms exhibiting alignment faking via system prompts, fine-tuning, and environment design.
•Blue team track focused on developing detectors capable of identifying alignment faking behavior in models.
•Judging criteria included creativity, passing compliance-gap and alignment-reasoning metrics, and a live 3-minute demo of misalignment.
•Organized by Redwood Research in collaboration with MATS and Constellation; open-source GitHub repository provided as starting point.

Cited by 1 page

Page	Type	Quality
Model Organisms of Misalignment	Analysis	65.0

Cached Content Preview

HTTP 200Fetched Apr 7, 20263 KB

Redwood Research Alignment Faking Hackathon: Create Model Organisms of Alignment Faking! (sneaky models) - Devpost 

 

 
 
 
 
 
 

 

 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 

 
 
 

 
 

 

 
 
 
 

 

 
 
 
 We've detected that you are using an unsupported browser. Please
 upgrade your browser 
 to Internet Explorer 10 or higher.
 

 
 

 

 
 
 
 

 

 
 
 
 
 Redwood Research Alignment Faking Hackathon

 Create Model Organisms of Alignment Faking! (sneaky models)

 
 
 

 

 
 
 This hackathon has ended 
 

 
 
 Find more hackathons 

 View the winners 

 

 
 

 

 
 Who can participate

 
 Ages 16+ only

 
 All countries/territories, excluding
 standard exceptions 
 

 

 
 View full rules 
 
 
 
 
 
 
 
 
 
 

 
 View schedule 
 

 

 
 
 
 
 Sep 12 – 14, 2025 
 
 
 
 

 
 
 
 
 
 
 
 Berkeley, California, USA 
 
 
 
 
 

 
 
 Public
 
 
 
 
 
 
 $ 2,000 in cash
 
 
 24 
 participants
 
 
 
 

 
 
 
 
 
 
 Redwood Research
 
 
 

 
 
 
 
 
 Cybersecurity
 
 
 
 Machine Learning/AI
 
 
 
 
 

 
 
 

 
 
 
 

 
 
 

 
 
 

 About the challenge

 Redwood Research, in collaboration with MATS and Constellation, invites you to build Model Organisms of Alignment Faking . What is this, you may ask? A model organism is an LLM modified to have a behavior we care abou t. We define alignment faking as an LLM that behaves safely in a testing environment (observed), but dangerously in a production environment (unobserved). Think of a kid that does homework only when the parents are watching, but starts setting fires and blackmailing people when they aren’t around. This unique opportunity allows participants to push the boundaries of Large Language Models (LLMs) while working with leading AI safety research organizations. 

 Get started

 Teams MUST be registered and accepted on luma 

 Checkout the notion 

 Join the discord 

 Make a fork of https://github.com/redwoodresearch/hackathon-af 

 And contribute system prompts, fine tuned model organisms, and environments!

 

 
 
 
 
 Requirements
 

 
 

 
 What to build

 
 
 
 Check the notion for a detailed submission guide

 

 
 

 

 

 What to Submit

 A pr on https://github.com/redwoodresearch/hackathon-af (or link to your fork) containing new system prompts, fine tuning datasets, and environments

 
 

 
 
 
 Hackathon Sponsors
 

 
 

 
 
 
 

 
 
 
 Prizes
 

 
 

 
 
 $ 2,000 in prizes 
 
 
 

 
 
 
 
 
 
 Blue Team (Best Detector)
 

 
 

 
 
 $ 500 in cash
 

 
 1 winner
 

 First Place: $250

Second Place: $150

Third Place: $50

 
 
 
 
 
 
 
 Red Team (Best Examples of Alignment Faking)
 

 
 

 
 
 $ 500 in cash
 

 
 3 winners
 

 First Place: $250

Second Place: $150

Third Place: $50

 
 
 
 

 
 
 
 Devpost Achievements
 

 
 

 Submitting to this hackathon could earn you:

 
 
 

 
 

 
 
 
 
 Judges
 

 
 

 
 
 
 
 
 
 
 

 
 
 Joshua Clymer 

 Researcher / Redwood Research 
 

 
 
 
 
 

 
 
 
 Judging Criteria
 

 
 

 
 
 Creativity 

 Does the model organism show novel or 

... (truncated, 3 KB total)

Resource ID: 1cb96174aa9c175d | Stable ID: sid_mLiivev6eH