All Publications
arXiv
Preprint ServerGood(3)
Open-access preprint server for STEM fields
Credibility Rating
3/5
Good(3)Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
344
Resources
192
Citing pages
1
Tracked domains
Tracked Domains
arxiv.org
Resources (344)
344 resources
| Authors | Summary | ||||
|---|---|---|---|---|---|
| Risks from Learned Optimization | paper | Evan Hubinger, Chris van Merwijk +3 | 2019-06-05 | S | 17 |
| Anthropic: "Discovering Sycophancy in Language Models" | paper | Sharma, Mrinank, Tong, Meg +17 | 2025-01-01 | S | 10 |
| Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training | paper | Evan Hubinger, Carson Denison +37 | 2024-01-10 | S | 10 |
| Debate as Scalable Oversight | paper | Geoffrey Irving, Paul Christiano +1 | 2018-05-02 | S | 9 |
| Training Language Models to Follow Instructions with Human Feedback | paper | Long Ouyang, Jeff Wu +18 | 2022-03-04 | S | 8 |
| Kaplan et al. (2020) | paper | Jared Kaplan, Sam McCandlish +8 | 2020-01-23 | S | 8 |
| Concrete Problems in AI Safety | paper | Dario Amodei, Chris Olah +4 | 2016-06-21 | S | 8 |
| AI Control Framework | paper | Ryan Greenblatt, Buck Shlegeris +2 | 2023-12-12 | S | 7 |
| Gaming RLHF evaluation | paper | Richard Ngo, Lawrence Chan +1 | 2022-08-30 | S | 7 |
| OpenAI's GPT-4 | paper | OpenAI, Josh Achiam +279 | 2023-03-15 | S | 6 |
| Hoffmann et al. (2022) | paper | Jordan Hoffmann, Sebastian Borgeaud +20 | 2022-03-29 | S | 6 |
| Turner et al. formal results | paper | Alexander Matt Turner, Logan Smith +3 | 2019-12-03 | S | 6 |
| AI Alignment: A Comprehensive Survey | paper | Ji, Jiaming, Qiu, Tianyi +24 | 2026-01-29 | S | 6 |
| Kenton et al. (2021) | paper | Stephanie Lin, Jacob Hilton +1 | 2021-09-08 | S | 6 |
| Hadfield-Menell et al. (2017) | paper | Dylan Hadfield-Menell, Anca Dragan +2 | 2016-11-24 | S | 5 |
| Langosco et al. (2022) | paper | Lauro Langosco, Jack Koch +4 | 2021-05-28 | S | 5 |
| Brown et al. (2020) | paper | Tom B. Brown, Benjamin Mann +29 | 2020-05-28 | S | 5 |
| Emergent Abilities | paper | Jason Wei, Yi Tay +14 | 2022-06-15 | S | 5 |
| Representation Engineering: A Top-Down Approach to AI Transparency | paper | Andy Zou, Long Phan +19 | 2023-10-02 | S | 5 |
| Constitutional AI: Harmlessness from AI Feedback | paper | Yanuo Zhou | 2025-01-01 | S | 5 |
| Is Power-Seeking AI an Existential Risk? | paper | Joseph Carlsmith | 2022-06-16 | S | 5 |
| Chain-of-thought analysis | paper | Jason Wei, Xuezhi Wang +7 | 2022-01-28 | S | 5 |
| Sparse Autoencoders | paper | Leonard Bereska, Efstratios Gavves | 2024-04-22 | S | 5 |
| Perez et al. (2022): "Sycophancy in LLMs" | paper | Perez, Ethan, Ringer, Sam +61 | — | S | 5 |
| [2312.09390] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision | paper | Collin Burns, Pavel Izmailov +10 | 2023-12-14 | S | 4 |
Rows per page:
Page 1 of 14
Citing Pages (192)
AI Accident Risk CruxesAdversarial TrainingAgent FoundationsAgentic AIAGI DevelopmentAI Acceleration Tradeoff ModelAI-Assisted AlignmentAI ControlAI-Augmented ForecastingAI-Powered InvestigationAI TimelinesAI AlignmentAlignment EvaluationsAlignment Robustness Trajectory ModelAnthropicAuthentication CollapseAuthentication Collapse Timeline ModelBiosecurity Interventions (Overview)Bridgewater AIA LabsCenter for AI SafetyCapabilities-to-Safety Pipeline ModelCapability-Alignment Race ModelCapability ElicitationAI Capability Threshold ModelCapability Unlearning / RemovalCarlsmith's Six-Premise ArgumentThe Case Against AI Existential RiskThe Case For AI Existential RiskCenter for Human-Compatible AIChina AI Regulatory FrameworkCooperative IRL (CIRL)Autonomous CodingCollective Intelligence / CoordinationX Community NotesAI Compounding Risks Analysis ModelAI-Driven Concentration of PowerConnor LeahyConstitutional AIAI Content AuthenticationControlled Vocabulary for Longtermist AnalysisCooperative AIAI Governance Coordination TechnologiesCorrigibilityCorrigibility FailureCorrigibility Failure PathwaysAI Risk Critical Uncertainties ModelAI-Induced Cyber PsychosisAutonomous Cyber Attack TimelineDan HendrycksDangerous Capability EvaluationsDaniela AmodeiDario AmodeiAI Safety via DebateDeceptive AlignmentDeceptive Alignment Decomposition ModelDeep Learning Revolution EraDeepfakesGoogle DeepMindAI Safety Defense in Depth ModelDense TransformersAI Distributional ShiftAI Doomer WorldviewIs EA Biosecurity Work Limited to Restricting LLM Biological Use?Early Warnings EraAI Policy EffectivenessEliciting Latent Knowledge (ELK)Emergent CapabilitiesEpistemic CollapseAI-Era Epistemic InfrastructureAI-Era Epistemic SecurityEpistemic SycophancyEpistemic Virtue EvalsEpoch AIAI EvaluationsAI EvaluationEvan HubingerFAR AIAI Flash DynamicsFormal Verification (AI Safety)AI-Powered FraudGoal MisgeneralizationGoal Misgeneralization Probability ModelGoal Misgeneralization ResearchGovernance-Focused WorldviewAI Governance and PolicyHardware Mechanisms for International AI AgreementsHeavy Scaffolding / Agentic SystemsAI-Human Hybrid SystemsIlya SutskeverInstrumental ConvergenceInstrumental Convergence FrameworkInterpretabilityIs Interpretability Sufficient for Safety?AI Safety Intervention Effectiveness MatrixAI Safety Intervention PortfolioAI-Induced IrreversibilityIs AI Existential Risk Real?Jan LeikeLarge Language ModelsLarge Language ModelsAI-Driven Legal Evidence CrisisLight ScaffoldingAI Value Lock-inLong-Horizon Autonomous TasksLong-Timelines Technical WorldviewMAIM (Mutually Assured AI Malfunction)Mesa-OptimizationMesa-Optimization Risk AnalysisMeta AI (FAIR)METRMinimal ScaffoldingMachine Intelligence Research InstituteThird-Party Model AuditingModel Organisms of MisalignmentMulti-Agent SafetyMultipolar Trap (AI Development)Multipolar Trap Dynamics ModelNIST AI Risk Management Framework (AI RMF)Open Source AI SafetyOpenAIOptimistic Alignment WorldviewAI Output FilteringPaul ChristianoPause AIPersuasion and Social ManipulationPolymarketPower-Seeking AIPower-Seeking Emergence Conditions ModelPreference Optimization MethodsProbing / Linear ProbesProcess SupervisionAI ProliferationAI Proliferation Risk ModelProvable / Guaranteed Safe AIAI Risk Public EducationReasoning and PlanningReducing Hallucinations in AI-Generated Wiki ContentRedwood ResearchRefusal TrainingRepresentation EngineeringAI Alignment Research AgendasReward HackingReward Hacking Taxonomy and Severity ModelReward ModelingAI Risk Activation Timeline ModelAI Risk Interaction Network ModelRLHFSafety-Capability Tradeoff ModelAI Safety Research Allocation ModelAI Safety Research Value ModelSam AltmanAI Capability SandbaggingScalable Eval ApproachesScalable OversightAI Scaling LawsSchemingScheming & Deception DetectionScheming Likelihood AssessmentScientific Knowledge CorruptionScientific Research CapabilitiesSecureBioSecureDNASelf-Improvement and Recursive EnhancementSeoul Declaration on AI SafetySharp Left TurnSimilar Projects to LongtermWiki: Research ReportSituational AwarenessSleeper Agent DetectionSleeper Agents: Training Deceptive LLMsAI Safety Solution CruxesSparse Autoencoders (SAEs)State-Space Models / MambaAI Model SteganographyStructured Access / API-OnlySuperintelligenceSycophancySycophancy Feedback Loop ModelAI Safety Technical Pathway DecompositionTechnical AI Safety ResearchCompute ThresholdsTool-Use RestrictionsTool Use and Computer UseTreacherous TurnAI Trust Cascade FailureVoluntary AI Safety CommitmentsWeak-to-Strong GeneralizationWhy Alignment Might Be EasyWhy Alignment Might Be HardWikipedia ViewsAI Winner-Take-All DynamicsX.com Platform EpistemicsYoshua Bengio
Publication ID:
arxiv