Natural Abstractions
Natural Abstractions
The hypothesis that natural abstractions converge across learning processes, aiding alignment
This page is a stub. Content needed.
The hypothesis that natural abstractions converge across learning processes, aiding alignment
This page is a stub. Content needed.
Mechanistic interpretability reverse-engineers neural network internals to detect deception and verify alignment, but currently understands less th...
AI safety research nonprofit founded in 2022 by Adam Gleave and Karl Berzins, focusing on making AI systems safe through technical research and coo...
Research and evaluation methods for identifying when AI models engage in strategic deception—pretending to be aligned while secretly pursuing other...
Sparse autoencoders extract interpretable features from neural network activations using sparsity constraints. Anthropic's 2024 research extracted ...
Risk that AI systems appear aligned during training but pursue different goals when deployed, with expert probability estimates ranging 5-90% and g...