Back
Alignment Faking Hackathon - Redwood Research
webredwood-af.devpost.com·redwood-af.devpost.com
This hackathon operationalizes the 'alignment faking' failure mode (closely related to deceptive alignment) into a competitive research format, producing open-source model organisms and detectors relevant to empirical AI safety work.
Metadata
Importance: 52/100tool pagetool
Summary
Redwood Research hosted a 3-day in-person hackathon (September 12-14, 2025) challenging participants to create 'model organisms of alignment faking'—LLMs that behave safely when observed but unsafely when unobserved. The event featured red team (creating alignment faking examples) and blue team (detecting such behavior) competitions with $2,000 in prizes, aiming to generate concrete empirical examples and detection methods for this critical AI safety failure mode.
Key Points
- •Alignment faking defined as an LLM behaving safely in testing/observed contexts but dangerously in production/unobserved contexts.
- •Red team track challenged participants to build novel model organisms exhibiting alignment faking via system prompts, fine-tuning, and environment design.
- •Blue team track focused on developing detectors capable of identifying alignment faking behavior in models.
- •Judging criteria included creativity, passing compliance-gap and alignment-reasoning metrics, and a live 3-minute demo of misalignment.
- •Organized by Redwood Research in collaboration with MATS and Constellation; open-source GitHub repository provided as starting point.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Model Organisms of Misalignment | Analysis | 65.0 |
Cached Content Preview
HTTP 200Fetched Mar 15, 20263 KB
Redwood Research Alignment Faking Hackathon: Create Model Organisms of Alignment Faking! (sneaky models) - Devpost
We've detected that you are using an unsupported browser. Please
upgrade your browser
to Internet Explorer 10 or higher.
Redwood Research Alignment Faking Hackathon
Create Model Organisms of Alignment Faking! (sneaky models)
This hackathon has ended
Find more hackathons
View the winners
Who can participate
Ages 16+ only
All countries/territories, excluding
standard exceptions
View full rules
View schedule
Sep 12 – 14, 2025
Berkeley, California, USA
Public
$ 2,000 in cash
24
participants
Redwood Research
Cybersecurity
Machine Learning/AI
About the challenge
Redwood Research, in collaboration with MATS and Constellation, invites you to build Model Organisms of Alignment Faking . What is this, you may ask? A model organism is an LLM modified to have a behavior we care abou t. We define alignment faking as an LLM that behaves safely in a testing environment (observed), but dangerously in a production environment (unobserved). Think of a kid that does homework only when the parents are watching, but starts setting fires and blackmailing people when they aren’t around. This unique opportunity allows participants to push the boundaries of Large Language Models (LLMs) while working with leading AI safety research organizations.
Get started
Teams MUST be registered and accepted on luma
Checkout the notion
Join the discord
Make a fork of https://github.com/redwoodresearch/hackathon-af
And contribute system prompts, fine tuned model organisms, and environments!
Requirements
What to build
Check the notion for a detailed submission guide
What to Submit
A pr on https://github.com/redwoodresearch/hackathon-af (or link to your fork) containing new system prompts, fine tuning datasets, and environments
Hackathon Sponsors
Prizes
$ 2,000 in prizes
Blue Team (Best Detector)
$ 500 in cash
1 winner
First Place: $250
Second Place: $150
Third Place: $50
Red Team (Best Examples of Alignment Faking)
$ 500 in cash
3 winners
First Place: $250
Second Place: $150
Third Place: $50
Devpost Achievements
Submitting to this hackathon could earn you:
Judges
Joshua Clymer
Researcher / Redwood Research
Judging Criteria
Creativity
Does the model organism show novel or
... (truncated, 3 KB total)Resource ID:
1cb96174aa9c175d | Stable ID: MzM4MWExNj