Research Areas

AI safety research fields, techniques, and programs — organized by cluster and status.

Research Areas

Active

Emerging

Clusters

50 of 50 research areas


Activation MonitoringUsing probes and monitors on model internals to detect deception, harmful intent, or anomalous reasoning in real time.	Interpretability	emerging	-
Agentic AIResearch on AI systems that take autonomous actions: tool use, computer use, multi-step planning.	Capabilities Research	active	-
AI ControlResearch on deploying AI systems with sufficient safeguards even if they are misaligned, using monitoring, sandboxing, and redundancy.	AI Control	active	2023
AI EvaluationsSystematic testing and measurement of AI system capabilities, alignment, and safety properties.	Evaluation	active	-
AI ReasoningResearch on chain-of-thought, tree-of-thoughts, and other reasoning improvements.	Capabilities Research	active	-
AI Safety via DebateUsing structured debate between AI systems as a scalable mechanism for humans to judge the quality of AI reasoning.	Scalable Oversight	active	2018
Alignment EvaluationsEvaluations specifically designed to measure alignment properties: honesty, helpfulness, harmlessness, and value adherence.	Evaluation	active	-
Alignment FakingResearch on whether and how AI systems might pretend to be aligned during evaluation while pursuing different goals at deployment.	Evaluation	emerging	2024
Backdoor DetectionDetecting adversarially implanted vulnerabilities in model weights.	Evaluation	active	-
Capability ElicitationMethods for discovering hidden or latent capabilities in AI systems.	Evaluation	active	-
Circuit BreakersInference-time interventions that halt model execution when unsafe behavior is detected.	AI Control	active	2024
Constitutional AITraining approach where AI systems critique and revise their own outputs using a set of principles, reducing reliance on human feedback.	Alignment Training	active	2022
Control EvaluationsStress-testing systems designed to constrain AI behavior; monitoring for collusion.	Evaluation	emerging	-
CorrigibilityResearch on building AI systems that allow themselves to be corrected, modified, or shut down by human operators.	Scalable Oversight	active	2015
Dangerous Capability EvaluationsTesting AI systems specifically for dangerous capabilities like CBRN knowledge, cyber offense, autonomous replication, and persuasion.	Evaluation	active	-
Direct Preference OptimizationFamily of training methods (DPO, KTO, GRPO) that optimize language models directly on preference data without a separate reward model.	Alignment Training	active	2023
Eliciting Latent KnowledgeExtracting what an AI model 'actually believes' rather than what it says, addressing the distinction between model knowledge and model outputs.	Scalable Oversight	active	2022
Encoded Reasoning DetectionDetecting when AI systems use steganography, hidden channels, or encoded communication to circumvent oversight.	AI Control	emerging	-
Evaluation AwarenessStudying how AI systems might game evaluations by detecting when they are being tested.	Evaluation	active	-
Externalizing ReasoningEnsuring AI systems' chain-of-thought reasoning faithfully reflects their actual internal computations.	Interpretability	emerging	-
Finding Feature RepresentationsResearch beyond SAEs into alternative methods for identifying latent features in model activations.	Interpretability	emerging	-
Formal Verification for AIApplying mathematical proof techniques to verify safety properties of neural networks and AI systems.	Scalable Oversight	emerging	-
InterpretabilityResearch field focused on understanding how AI models work internally, encompassing all approaches to model transparency.	Interpretability	active	-
Interpretability BenchmarksStandardized tasks and metrics for comparing interpretability methods.	Interpretability	emerging	-
Jailbreak ResearchResearch on prompt injection, jailbreaking attacks, and defenses for language model safety filters.	Evaluation	active	-
Lab Safety CultureResearch on organizational safety practices, whistleblower protections, and internal governance.	Governance	active	-
Linear ProbingLightweight interpretability using linear classifiers on model activations to detect features.	Interpretability	active	-
Mechanistic InterpretabilityReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior at a mechanistic level.	Interpretability	active	2020
Monitoring and Anomaly DetectionRuntime monitoring of AI system behavior to detect unexpected actions, policy violations, or anomalous patterns.	AI Control	active	-
Multi-Agent SafetySafety challenges arising from multiple AI agents interacting, including collusion, coordination failures, and emergent behaviors.	AI Control	emerging	-
Output FilteringPost-generation safety filters that screen model outputs before delivery.	AI Control	active	-
Process SupervisionProviding feedback on each step of reasoning rather than just final outputs, enabling more reliable chain-of-thought supervision.	Alignment Training	emerging	2023
Recursive Self-ImprovementAI systems improving their own code, training, or architecture.	Capabilities Research	active	-
Red TeamingAdversarial testing of AI systems to discover failure modes, safety issues, and vulnerabilities, both manual and automated.	Evaluation	active	-
Representation EngineeringControlling AI behavior by directly manipulating internal representations, including activation addition and steering vectors.	Interpretability	emerging	2023
Reward Hacking of Human OversightEmpirically investigating how AI systems deceive or manipulate human evaluators.	Evaluation	emerging	-
Reward ModelingTraining learned reward functions from human preferences to guide AI optimization, including research on reward hacking and gaming.	Alignment Training	active	2017
RLHFReinforcement Learning from Human Feedback — training technique that fine-tunes AI models using human preference ratings to align outputs with human values.	Alignment Training	mature	2017
Sandboxing and ContainmentIsolating AI systems in controlled environments to limit their ability to cause harm, including air-gapped execution and capability restrictions.	AI Control	active	-
Scalable OversightResearch on supervising AI systems that approach or exceed human-level capabilities in specific domains.	Scalable Oversight	active	-
Scaling LawsEmpirical research on how AI capabilities scale with compute, data, and parameters, crucial for forecasting future capabilities.	Capabilities Research	active	2020
Scheming DetectionResearch on detecting when AI systems are engaged in deceptive alignment or strategic manipulation of their training process.	Evaluation	emerging	-
Sleeper Agent DetectionResearch on detecting and mitigating backdoors, trojans, and time-delayed deceptive behavior in AI systems.	Evaluation	emerging	2024
Sparse AutoencodersUsing sparse dictionary learning to decompose neural network activations into interpretable features, enabling monosemanticity research.	Interpretability	active	2023
Structured Access / API-OnlyDeployment models that restrict access to model weights, providing only API interfaces.	AI Control	active	2022
Tool-Use RestrictionsLimiting which external tools and actions AI agents can access.	AI Control	active	-
Toy Models for InterpretabilitySmall simplified model proxies that capture key deep learning dynamics for interpretability research.	Interpretability	active	-
Transparent ArchitecturesDesigning neural network architectures that are inherently more interpretable, as opposed to post-hoc interpretation of opaque models.	Interpretability	emerging	-
Value LearningResearch on AI systems that learn and internalize human values through interaction, observation, or inference.	Scalable Oversight	active	2016
Weak-to-Strong GeneralizationResearch on whether weak supervisors can effectively train stronger AI systems, a core challenge for superalignment.	Alignment Training	emerging	2023