Skip to content
Longterm Wiki

Research Areas

AI safety research fields, techniques, and programs — organized by cluster and status.

Research Areas
50
Active
33
Emerging
16
Clusters
7
50 of 50 research areas
Activation MonitoringUsing probes and monitors on model internals to detect deception, harmful intent, or anomalous reasoning in real time.
Interpretabilityemerging-
Agentic AIResearch on AI systems that take autonomous actions: tool use, computer use, multi-step planning.
Capabilities Researchactive-
AI ControlResearch on deploying AI systems with sufficient safeguards even if they are misaligned, using monitoring, sandboxing, and redundancy.
AI Controlactive2023
AI EvaluationsSystematic testing and measurement of AI system capabilities, alignment, and safety properties.
Evaluationactive-
AI ReasoningResearch on chain-of-thought, tree-of-thoughts, and other reasoning improvements.
Capabilities Researchactive-
AI Safety via DebateUsing structured debate between AI systems as a scalable mechanism for humans to judge the quality of AI reasoning.
Scalable Oversightactive2018
Alignment EvaluationsEvaluations specifically designed to measure alignment properties: honesty, helpfulness, harmlessness, and value adherence.
Evaluationactive-
Alignment FakingResearch on whether and how AI systems might pretend to be aligned during evaluation while pursuing different goals at deployment.
Evaluationemerging2024
Backdoor DetectionDetecting adversarially implanted vulnerabilities in model weights.
Evaluationactive-
Capability ElicitationMethods for discovering hidden or latent capabilities in AI systems.
Evaluationactive-
Circuit BreakersInference-time interventions that halt model execution when unsafe behavior is detected.
AI Controlactive2024
Constitutional AITraining approach where AI systems critique and revise their own outputs using a set of principles, reducing reliance on human feedback.
Alignment Trainingactive2022
Control EvaluationsStress-testing systems designed to constrain AI behavior; monitoring for collusion.
Evaluationemerging-
CorrigibilityResearch on building AI systems that allow themselves to be corrected, modified, or shut down by human operators.
Scalable Oversightactive2015
Dangerous Capability EvaluationsTesting AI systems specifically for dangerous capabilities like CBRN knowledge, cyber offense, autonomous replication, and persuasion.
Evaluationactive-
Direct Preference OptimizationFamily of training methods (DPO, KTO, GRPO) that optimize language models directly on preference data without a separate reward model.
Alignment Trainingactive2023
Eliciting Latent KnowledgeExtracting what an AI model 'actually believes' rather than what it says, addressing the distinction between model knowledge and model outputs.
Scalable Oversightactive2022
Encoded Reasoning DetectionDetecting when AI systems use steganography, hidden channels, or encoded communication to circumvent oversight.
AI Controlemerging-
Evaluation AwarenessStudying how AI systems might game evaluations by detecting when they are being tested.
Evaluationactive-
Externalizing ReasoningEnsuring AI systems' chain-of-thought reasoning faithfully reflects their actual internal computations.
Interpretabilityemerging-
Finding Feature RepresentationsResearch beyond SAEs into alternative methods for identifying latent features in model activations.
Interpretabilityemerging-
Formal Verification for AIApplying mathematical proof techniques to verify safety properties of neural networks and AI systems.
Scalable Oversightemerging-
InterpretabilityResearch field focused on understanding how AI models work internally, encompassing all approaches to model transparency.
Interpretabilityactive-
Interpretability BenchmarksStandardized tasks and metrics for comparing interpretability methods.
Interpretabilityemerging-
Jailbreak ResearchResearch on prompt injection, jailbreaking attacks, and defenses for language model safety filters.
Evaluationactive-
Lab Safety CultureResearch on organizational safety practices, whistleblower protections, and internal governance.
Governanceactive-
Linear ProbingLightweight interpretability using linear classifiers on model activations to detect features.
Interpretabilityactive-
Mechanistic InterpretabilityReverse-engineering neural networks to identify circuits, features, and algorithms that explain behavior at a mechanistic level.
Interpretabilityactive2020
Monitoring and Anomaly DetectionRuntime monitoring of AI system behavior to detect unexpected actions, policy violations, or anomalous patterns.
AI Controlactive-
Multi-Agent SafetySafety challenges arising from multiple AI agents interacting, including collusion, coordination failures, and emergent behaviors.
AI Controlemerging-
Output FilteringPost-generation safety filters that screen model outputs before delivery.
AI Controlactive-
Process SupervisionProviding feedback on each step of reasoning rather than just final outputs, enabling more reliable chain-of-thought supervision.
Alignment Trainingemerging2023
Recursive Self-ImprovementAI systems improving their own code, training, or architecture.
Capabilities Researchactive-
Red TeamingAdversarial testing of AI systems to discover failure modes, safety issues, and vulnerabilities, both manual and automated.
Evaluationactive-
Representation EngineeringControlling AI behavior by directly manipulating internal representations, including activation addition and steering vectors.
Interpretabilityemerging2023
Reward Hacking of Human OversightEmpirically investigating how AI systems deceive or manipulate human evaluators.
Evaluationemerging-
Reward ModelingTraining learned reward functions from human preferences to guide AI optimization, including research on reward hacking and gaming.
Alignment Trainingactive2017
RLHFReinforcement Learning from Human Feedback — training technique that fine-tunes AI models using human preference ratings to align outputs with human values.
Alignment Trainingmature2017
Sandboxing and ContainmentIsolating AI systems in controlled environments to limit their ability to cause harm, including air-gapped execution and capability restrictions.
AI Controlactive-
Scalable OversightResearch on supervising AI systems that approach or exceed human-level capabilities in specific domains.
Scalable Oversightactive-
Scaling LawsEmpirical research on how AI capabilities scale with compute, data, and parameters, crucial for forecasting future capabilities.
Capabilities Researchactive2020
Scheming DetectionResearch on detecting when AI systems are engaged in deceptive alignment or strategic manipulation of their training process.
Evaluationemerging-
Sleeper Agent DetectionResearch on detecting and mitigating backdoors, trojans, and time-delayed deceptive behavior in AI systems.
Evaluationemerging2024
Sparse AutoencodersUsing sparse dictionary learning to decompose neural network activations into interpretable features, enabling monosemanticity research.
Interpretabilityactive2023
Structured Access / API-OnlyDeployment models that restrict access to model weights, providing only API interfaces.
AI Controlactive2022
Tool-Use RestrictionsLimiting which external tools and actions AI agents can access.
AI Controlactive-
Toy Models for InterpretabilitySmall simplified model proxies that capture key deep learning dynamics for interpretability research.
Interpretabilityactive-
Transparent ArchitecturesDesigning neural network architectures that are inherently more interpretable, as opposed to post-hoc interpretation of opaque models.
Interpretabilityemerging-
Value LearningResearch on AI systems that learn and internalize human values through interaction, observation, or inference.
Scalable Oversightactive2016
Weak-to-Strong GeneralizationResearch on whether weak supervisors can effectively train stronger AI systems, a core challenge for superalignment.
Alignment Trainingemerging2023