arXiv

Preprint ServerGood(3)

Open-access preprint server for STEM fields

Resources

344

Citing pages

193

Tracked domains

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Tracked Domains

arxiv.org

Resources (344)

344 resources

		Authors		Summary
Risks from Learned Optimization	paper	Evan Hubinger, Chris van Merwijk +3	2019-06-05	S	17
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training	paper	Evan Hubinger, Carson Denison +37	2024-01-10	S	11
Anthropic: "Discovering Sycophancy in Language Models"	paper	Sharma, Mrinank, Tong, Meg +17	2025-01-01	S	10
Debate as Scalable Oversight	paper	Geoffrey Irving, Paul Christiano +1	2018-05-02	S	9
Training Language Models to Follow Instructions with Human Feedback	paper	Long Ouyang, Jeff Wu +18	2022-03-04	S	8
Kaplan et al. (2020)	paper	Jared Kaplan, Sam McCandlish +8	2020-01-23	S	8
Concrete Problems in AI Safety	paper	Dario Amodei, Chris Olah +4	2016-06-21	S	8
AI Control Framework	paper	Ryan Greenblatt, Buck Shlegeris +2	2023-12-12	S	7
Gaming RLHF evaluation	paper	Richard Ngo, Lawrence Chan +1	2022-08-30	S	7
OpenAI's GPT-4	paper	OpenAI, Josh Achiam +279	2023-03-15	S	6
Hoffmann et al. (2022)	paper	Jordan Hoffmann, Sebastian Borgeaud +20	2022-03-29	S	6
Constitutional AI: Harmlessness from AI Feedback	paper	Yanuo Zhou	2025-01-01	S	6
Turner et al. formal results	paper	Alexander Matt Turner, Logan Smith +3	2019-12-03	S	6
AI Alignment: A Comprehensive Survey	paper	Ji, Jiaming, Qiu, Tianyi +24	2026-01-29	S	6
Kenton et al. (2021)	paper	Stephanie Lin, Jacob Hilton +1	2021-09-08	S	6
Hadfield-Menell et al. (2017)	paper	Dylan Hadfield-Menell, Anca Dragan +2	2016-11-24	S	5
Langosco et al. (2022)	paper	Lauro Langosco, Jack Koch +4	2021-05-28	S	5
Brown et al. (2020)	paper	Tom B. Brown, Benjamin Mann +29	2020-05-28	S	5
Emergent Abilities	paper	Jason Wei, Yi Tay +14	2022-06-15	S	5
Representation Engineering: A Top-Down Approach to AI Transparency	paper	Andy Zou, Long Phan +19	2023-10-02	S	5
Is Power-Seeking AI an Existential Risk?	paper	Joseph Carlsmith	2022-06-16	S	5
Chain-of-thought analysis	paper	Jason Wei, Xuezhi Wang +7	2022-01-28	S	5
Sparse Autoencoders	paper	Leonard Bereska, Efstratios Gavves	2024-04-22	S	5
Perez et al. (2022): "Sycophancy in LLMs"	paper	Perez, Ethan, Ringer, Sam +61	—	S	5
[2312.09390] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision	paper	Collin Burns, Pavel Izmailov +10	2023-12-14	S	4

Rows per page:

Page 1 of 14

Citing Pages (193)

AI Accident Risk Cruxes Adversarial Training Agent Foundations Agentic AI AGI Development AI Acceleration Tradeoff Model AI-Assisted Alignment AI Control AI-Augmented Forecasting AI-Powered Investigation AI Timelines AI Alignment Alignment Evaluations Alignment Robustness Trajectory Model Anthropic Authentication Collapse Authentication Collapse Timeline Model Biosecurity Interventions (Overview)Bridgewater AIA Labs Center for AI Safety (CAIS)Capabilities-to-Safety Pipeline Model Capability-Alignment Race Model Capability Elicitation AI Capability Threshold Model Capability Unlearning / Removal Carlsmith's Six-Premise Argument The Case Against AI Existential Risk The Case For AI Existential Risk Center for Human-Compatible AI (CHAI)China AI Regulatory Framework Cooperative IRL (CIRL)Autonomous Coding Collective Intelligence / Coordination X Community Notes AI Compounding Risks Analysis Model AI-Driven Concentration of Power Connor Leahy Constitutional AI AI Content Authentication Controlled Vocabulary for Longtermist Analysis Cooperative AI AI Governance Coordination Technologies Corrigibility Corrigibility Failure Corrigibility Failure Pathways AI Risk Critical Uncertainties Model AI-Induced Cyber Psychosis Autonomous Cyber Attack Timeline Dan Hendrycks Dangerous Capability Evaluations Daniela Amodei Dario Amodei AI Safety via Debate Deceptive Alignment Deceptive Alignment Decomposition Model Deep Learning Revolution Era Deepfakes Google DeepMind AI Safety Defense in Depth Model Dense Transformers AI Distributional Shift AI Doomer Worldview Is EA Biosecurity Work Limited to Restricting LLM Biological Use?Early Warnings Era AI Policy Effectiveness Eliciting Latent Knowledge (ELK)Emergent Capabilities Epistemic Collapse AI-Era Epistemic Infrastructure AI-Era Epistemic Security Epistemic Sycophancy Epistemic Virtue Evals Epoch AI AI Evaluations AI Evaluation Evan Hubinger FAR AI AI Flash Dynamics Formal Verification (AI Safety)AI-Powered Fraud Goal Misgeneralization Goal Misgeneralization Probability Model Goal Misgeneralization Research Governance-Focused Worldview AI Governance and Policy Hardware Mechanisms for International AI Agreements Heavy Scaffolding / Agentic Systems AI-Human Hybrid Systems Ilya Sutskever Instrumental Convergence Instrumental Convergence Framework Interpretability Is Interpretability Sufficient for Safety?AI Safety Intervention Effectiveness Matrix AI Safety Intervention Portfolio AI-Induced Irreversibility Is AI Existential Risk Real?Jan Leike Frontier AI Labs (Overview)Large Language Models Large Language Models AI-Driven Legal Evidence Crisis Light Scaffolding AI Value Lock-in Long-Horizon Autonomous Tasks Long-Timelines Technical Worldview MAIM (Mutually Assured AI Malfunction)Mesa-Optimization Mesa-Optimization Risk Analysis Meta AI (FAIR)METR Minimal Scaffolding Machine Intelligence Research Institute (MIRI)Third-Party Model Auditing Model Organisms of Misalignment Multi-Agent Safety Multipolar Trap (AI Development)Multipolar Trap Dynamics Model NIST AI Risk Management Framework (AI RMF)Open Source AI Safety OpenAI Optimistic Alignment Worldview AI Output Filtering Paul Christiano Pause AI Persuasion and Social Manipulation Polymarket Power-Seeking AI Power-Seeking Emergence Conditions Model Preference Optimization Methods Probing / Linear Probes Process Supervision AI Proliferation AI Proliferation Risk Model Provable / Guaranteed Safe AI AI Risk Public Education Reasoning and Planning Reducing Hallucinations in AI-Generated Wiki Content Redwood Research Refusal Training Representation Engineering AI Alignment Research Agendas Reward Hacking Reward Hacking Taxonomy and Severity Model Reward Modeling AI Risk Activation Timeline Model AI Risk Interaction Network Model RLHF Safety-Capability Tradeoff Model AI Safety Research Allocation Model AI Safety Research Value Model Sam Altman AI Capability Sandbagging Scalable Eval Approaches Scalable Oversight AI Scaling Laws Scheming Scheming & Deception Detection Scheming Likelihood Assessment Scientific Knowledge Corruption Scientific Research Capabilities SecureBio SecureDNA Self-Improvement and Recursive Enhancement Seoul Declaration on AI Safety Sharp Left Turn Similar Projects to LongtermWiki: Research Report Situational Awareness Sleeper Agent Detection Sleeper Agents: Training Deceptive LLMs AI Safety Solution Cruxes Sparse Autoencoders (SAEs)State-Space Models / Mamba AI Model Steganography Structured Access / API-Only Superintelligence Sycophancy Sycophancy Feedback Loop Model AI Safety Technical Pathway Decomposition Technical AI Safety Research Compute Thresholds Tool-Use Restrictions Tool Use and Computer Use Treacherous Turn AI Trust Cascade Failure Voluntary AI Safety Commitments Weak-to-Strong Generalization Why Alignment Might Be Easy Why Alignment Might Be Hard Wikipedia Views AI Winner-Take-All Dynamics X.com Platform Epistemics Yoshua Bengio

Publication ID: arxiv