Research Areas
AI safety research fields, techniques, and programs — organized by cluster and status.
Research Areas
50
Active
33
Emerging
16
Clusters
7
50 of 50 research areas
Activation MonitoringUsing probes and monitors on model internals to detect deception, harmful intent, or anomalous reasoning in real time. | Interpretability | emerging | - |
Agentic AIResearch on AI systems that take autonomous actions: tool use, computer use, multi-step planning. | Capabilities Research | active | - |
AI ControlResearch on deploying AI systems with sufficient safeguards even if they are misaligned, using monitoring, sandboxing, and redundancy. | AI Control | active | 2023 |
AI EvaluationsSystematic testing and measurement of AI system capabilities, alignment, and safety properties. | Evaluation | active | - |
AI ReasoningResearch on chain-of-thought, tree-of-thoughts, and other reasoning improvements. | Capabilities Research | active | - |
AI Safety via DebateUsing structured debate between AI systems as a scalable mechanism for humans to judge the quality of AI reasoning. | Scalable Oversight | active | 2018 |
Alignment EvaluationsEvaluations specifically designed to measure alignment properties: honesty, helpfulness, harmlessness, and value adherence. | Evaluation | active | - |
Alignment FakingResearch on whether and how AI systems might pretend to be aligned during evaluation while pursuing different goals at deployment. | Evaluation | emerging | 2024 |
Backdoor DetectionDetecting adversarially implanted vulnerabilities in model weights. | Evaluation | active | - |
Capability ElicitationMethods for discovering hidden or latent capabilities in AI systems. | Evaluation | active | - |
Circuit BreakersInference-time interventions that halt model execution when unsafe behavior is detected. | AI Control | active | 2024 |
Constitutional AITraining approach where AI systems critique and revise their own outputs using a set of principles, reducing reliance on human feedback. | Alignment Training | active | 2022 |
Control EvaluationsStress-testing systems designed to constrain AI behavior; monitoring for collusion. | Evaluation | emerging | - |
CorrigibilityResearch on building AI systems that allow themselves to be corrected, modified, or shut down by human operators. | Scalable Oversight | active | 2015 |
Dangerous Capability EvaluationsTesting AI systems specifically for dangerous capabilities like CBRN knowledge, cyber offense, autonomous replication, and persuasion. | Evaluation | active | - |
Direct Preference OptimizationFamily of training methods (DPO, KTO, GRPO) that optimize language models directly on preference data without a separate reward model. | Alignment Training | active | 2023 |
Eliciting Latent KnowledgeExtracting what an AI model 'actually believes' rather than what it says, addressing the distinction between model knowledge and model outputs. | Scalable Oversight | active | 2022 |
Encoded Reasoning DetectionDetecting when AI systems use steganography, hidden channels, or encoded communication to circumvent oversight. | AI Control | emerging | - |
Evaluation AwarenessStudying how AI systems might game evaluations by detecting when they are being tested. | Evaluation | active | - |
Externalizing ReasoningEnsuring AI systems' chain-of-thought reasoning faithfully reflects their actual internal computations. | Interpretability | emerging | - |
Finding Feature RepresentationsResearch beyond SAEs into alternative methods for identifying latent features in model activations. | Interpretability | emerging | - |
Formal Verification for AIApplying mathematical proof techniques to verify safety properties of neural networks and AI systems. | Scalable Oversight | emerging | - |
InterpretabilityResearch field focused on understanding how AI models work internally, encompassing all approaches to model transparency. | Interpretability | active | - |
Interpretability BenchmarksStandardized tasks and metrics for comparing interpretability methods. | Interpretability | emerging | - |
Jailbreak ResearchResearch on prompt injection, jailbreaking attacks, and defenses for language model safety filters. | Evaluation | active | - |
Lab Safety CultureResearch on organizational safety practices, whistleblower protections, and internal governance. | Governance | active | - |
Linear ProbingLightweight interpretability using linear classifiers on model activations to detect features. | Interpretability | active | - |
Mechanistic InterpretabilityReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior at a mechanistic level. | Interpretability | active | 2020 |
Monitoring and Anomaly DetectionRuntime monitoring of AI system behavior to detect unexpected actions, policy violations, or anomalous patterns. | AI Control | active | - |
Multi-Agent SafetySafety challenges arising from multiple AI agents interacting, including collusion, coordination failures, and emergent behaviors. | AI Control | emerging | - |
Output FilteringPost-generation safety filters that screen model outputs before delivery. | AI Control | active | - |
Process SupervisionProviding feedback on each step of reasoning rather than just final outputs, enabling more reliable chain-of-thought supervision. | Alignment Training | emerging | 2023 |
Recursive Self-ImprovementAI systems improving their own code, training, or architecture. | Capabilities Research | active | - |
Red TeamingAdversarial testing of AI systems to discover failure modes, safety issues, and vulnerabilities, both manual and automated. | Evaluation | active | - |
Representation EngineeringControlling AI behavior by directly manipulating internal representations, including activation addition and steering vectors. | Interpretability | emerging | 2023 |
Reward Hacking of Human OversightEmpirically investigating how AI systems deceive or manipulate human evaluators. | Evaluation | emerging | - |
Reward ModelingTraining learned reward functions from human preferences to guide AI optimization, including research on reward hacking and gaming. | Alignment Training | active | 2017 |
RLHFReinforcement Learning from Human Feedback — training technique that fine-tunes AI models using human preference ratings to align outputs with human values. | Alignment Training | mature | 2017 |
Sandboxing and ContainmentIsolating AI systems in controlled environments to limit their ability to cause harm, including air-gapped execution and capability restrictions. | AI Control | active | - |
Scalable OversightResearch on supervising AI systems that approach or exceed human-level capabilities in specific domains. | Scalable Oversight | active | - |
Scaling LawsEmpirical research on how AI capabilities scale with compute, data, and parameters, crucial for forecasting future capabilities. | Capabilities Research | active | 2020 |
Scheming DetectionResearch on detecting when AI systems are engaged in deceptive alignment or strategic manipulation of their training process. | Evaluation | emerging | - |
Sleeper Agent DetectionResearch on detecting and mitigating backdoors, trojans, and time-delayed deceptive behavior in AI systems. | Evaluation | emerging | 2024 |
Sparse AutoencodersUsing sparse dictionary learning to decompose neural network activations into interpretable features, enabling monosemanticity research. | Interpretability | active | 2023 |
Structured Access / API-OnlyDeployment models that restrict access to model weights, providing only API interfaces. | AI Control | active | 2022 |
Tool-Use RestrictionsLimiting which external tools and actions AI agents can access. | AI Control | active | - |
Toy Models for InterpretabilitySmall simplified model proxies that capture key deep learning dynamics for interpretability research. | Interpretability | active | - |
Transparent ArchitecturesDesigning neural network architectures that are inherently more interpretable, as opposed to post-hoc interpretation of opaque models. | Interpretability | emerging | - |
Value LearningResearch on AI systems that learn and internalize human values through interaction, observation, or inference. | Scalable Oversight | active | 2016 |
Weak-to-Strong GeneralizationResearch on whether weak supervisors can effectively train stronger AI systems, a core challenge for superalignment. | Alignment Training | emerging | 2023 |