Natural Abstractions

Concept

Natural Abstractions

The hypothesis that natural abstractions converge across learning processes, aiding alignment

Safety Agendas

7 words · 1 backlinks

This page is a stub. Content needed.

Top Related Pages

Safety Agenda

Interpretability

Mechanistic interpretability reverse-engineers neural network internals to detect deception and verify alignment, but currently understands less th...

Organization

FAR AI

AI safety research nonprofit founded in 2022 by Adam Gleave and Karl Berzins, focusing on making AI systems safe through technical research and coo...

Approach

Scheming & Deception Detection

Research and evaluation methods for identifying when AI models engage in strategic deception—pretending to be aligned while secretly pursuing other...

Approach

Sparse Autoencoders (SAEs)

Sparse autoencoders extract interpretable features from neural network activations using sparsity constraints. Anthropic's 2024 research extracted ...

Risk

Deceptive Alignment

Risk that AI systems appear aligned during training but pursue different goals when deployed, with expert probability estimates ranging 5-90% and g...

Natural Abstractions