Skip to content
Longterm Wiki
All Publications

arXiv

Preprint ServerGood(3)

Open-access preprint server for STEM fields

Credibility Rating

3/5
Good(3)
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
344
Resources
192
Citing pages
1
Tracked domains

Tracked Domains

arxiv.org

Resources (344)

344 resources
Authors
Summary
Risks from Learned OptimizationpaperEvan Hubinger, Chris van Merwijk +32019-06-05S17
Anthropic: "Discovering Sycophancy in Language Models"paperSharma, Mrinank, Tong, Meg +172025-01-01S10
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety TrainingpaperEvan Hubinger, Carson Denison +372024-01-10S10
Debate as Scalable OversightpaperGeoffrey Irving, Paul Christiano +12018-05-02S9
Training Language Models to Follow Instructions with Human FeedbackpaperLong Ouyang, Jeff Wu +182022-03-04S8
Kaplan et al. (2020)paperJared Kaplan, Sam McCandlish +82020-01-23S8
Concrete Problems in AI SafetypaperDario Amodei, Chris Olah +42016-06-21S8
AI Control FrameworkpaperRyan Greenblatt, Buck Shlegeris +22023-12-12S7
Gaming RLHF evaluationpaperRichard Ngo, Lawrence Chan +12022-08-30S7
OpenAI's GPT-4paperOpenAI, Josh Achiam +2792023-03-15S6
Hoffmann et al. (2022)paperJordan Hoffmann, Sebastian Borgeaud +202022-03-29S6
Turner et al. formal resultspaperAlexander Matt Turner, Logan Smith +32019-12-03S6
AI Alignment: A Comprehensive SurveypaperJi, Jiaming, Qiu, Tianyi +242026-01-29S6
Kenton et al. (2021)paperStephanie Lin, Jacob Hilton +12021-09-08S6
Hadfield-Menell et al. (2017)paperDylan Hadfield-Menell, Anca Dragan +22016-11-24S5
Langosco et al. (2022)paperLauro Langosco, Jack Koch +42021-05-28S5
Brown et al. (2020)paperTom B. Brown, Benjamin Mann +292020-05-28S5
Emergent AbilitiespaperJason Wei, Yi Tay +142022-06-15S5
Representation Engineering: A Top-Down Approach to AI TransparencypaperAndy Zou, Long Phan +192023-10-02S5
Constitutional AI: Harmlessness from AI FeedbackpaperYanuo Zhou2025-01-01S5
Is Power-Seeking AI an Existential Risk?paperJoseph Carlsmith2022-06-16S5
Chain-of-thought analysispaperJason Wei, Xuezhi Wang +72022-01-28S5
Sparse AutoencoderspaperLeonard Bereska, Efstratios Gavves2024-04-22S5
Perez et al. (2022): "Sycophancy in LLMs"paperPerez, Ethan, Ringer, Sam +61S5
[2312.09390] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak SupervisionpaperCollin Burns, Pavel Izmailov +102023-12-14S4
Rows per page:
Page 1 of 14

Citing Pages (192)

AI Accident Risk CruxesAdversarial TrainingAgent FoundationsAgentic AIAGI DevelopmentAI Acceleration Tradeoff ModelAI-Assisted AlignmentAI ControlAI-Augmented ForecastingAI-Powered InvestigationAI TimelinesAI AlignmentAlignment EvaluationsAlignment Robustness Trajectory ModelAnthropicAuthentication CollapseAuthentication Collapse Timeline ModelBiosecurity Interventions (Overview)Bridgewater AIA LabsCenter for AI SafetyCapabilities-to-Safety Pipeline ModelCapability-Alignment Race ModelCapability ElicitationAI Capability Threshold ModelCapability Unlearning / RemovalCarlsmith's Six-Premise ArgumentThe Case Against AI Existential RiskThe Case For AI Existential RiskCenter for Human-Compatible AIChina AI Regulatory FrameworkCooperative IRL (CIRL)Autonomous CodingCollective Intelligence / CoordinationX Community NotesAI Compounding Risks Analysis ModelAI-Driven Concentration of PowerConnor LeahyConstitutional AIAI Content AuthenticationControlled Vocabulary for Longtermist AnalysisCooperative AIAI Governance Coordination TechnologiesCorrigibilityCorrigibility FailureCorrigibility Failure PathwaysAI Risk Critical Uncertainties ModelAI-Induced Cyber PsychosisAutonomous Cyber Attack TimelineDan HendrycksDangerous Capability EvaluationsDaniela AmodeiDario AmodeiAI Safety via DebateDeceptive AlignmentDeceptive Alignment Decomposition ModelDeep Learning Revolution EraDeepfakesGoogle DeepMindAI Safety Defense in Depth ModelDense TransformersAI Distributional ShiftAI Doomer WorldviewIs EA Biosecurity Work Limited to Restricting LLM Biological Use?Early Warnings EraAI Policy EffectivenessEliciting Latent Knowledge (ELK)Emergent CapabilitiesEpistemic CollapseAI-Era Epistemic InfrastructureAI-Era Epistemic SecurityEpistemic SycophancyEpistemic Virtue EvalsEpoch AIAI EvaluationsAI EvaluationEvan HubingerFAR AIAI Flash DynamicsFormal Verification (AI Safety)AI-Powered FraudGoal MisgeneralizationGoal Misgeneralization Probability ModelGoal Misgeneralization ResearchGovernance-Focused WorldviewAI Governance and PolicyHardware Mechanisms for International AI AgreementsHeavy Scaffolding / Agentic SystemsAI-Human Hybrid SystemsIlya SutskeverInstrumental ConvergenceInstrumental Convergence FrameworkInterpretabilityIs Interpretability Sufficient for Safety?AI Safety Intervention Effectiveness MatrixAI Safety Intervention PortfolioAI-Induced IrreversibilityIs AI Existential Risk Real?Jan LeikeLarge Language ModelsLarge Language ModelsAI-Driven Legal Evidence CrisisLight ScaffoldingAI Value Lock-inLong-Horizon Autonomous TasksLong-Timelines Technical WorldviewMAIM (Mutually Assured AI Malfunction)Mesa-OptimizationMesa-Optimization Risk AnalysisMeta AI (FAIR)METRMinimal ScaffoldingMachine Intelligence Research InstituteThird-Party Model AuditingModel Organisms of MisalignmentMulti-Agent SafetyMultipolar Trap (AI Development)Multipolar Trap Dynamics ModelNIST AI Risk Management Framework (AI RMF)Open Source AI SafetyOpenAIOptimistic Alignment WorldviewAI Output FilteringPaul ChristianoPause AIPersuasion and Social ManipulationPolymarketPower-Seeking AIPower-Seeking Emergence Conditions ModelPreference Optimization MethodsProbing / Linear ProbesProcess SupervisionAI ProliferationAI Proliferation Risk ModelProvable / Guaranteed Safe AIAI Risk Public EducationReasoning and PlanningReducing Hallucinations in AI-Generated Wiki ContentRedwood ResearchRefusal TrainingRepresentation EngineeringAI Alignment Research AgendasReward HackingReward Hacking Taxonomy and Severity ModelReward ModelingAI Risk Activation Timeline ModelAI Risk Interaction Network ModelRLHFSafety-Capability Tradeoff ModelAI Safety Research Allocation ModelAI Safety Research Value ModelSam AltmanAI Capability SandbaggingScalable Eval ApproachesScalable OversightAI Scaling LawsSchemingScheming & Deception DetectionScheming Likelihood AssessmentScientific Knowledge CorruptionScientific Research CapabilitiesSecureBioSecureDNASelf-Improvement and Recursive EnhancementSeoul Declaration on AI SafetySharp Left TurnSimilar Projects to LongtermWiki: Research ReportSituational AwarenessSleeper Agent DetectionSleeper Agents: Training Deceptive LLMsAI Safety Solution CruxesSparse Autoencoders (SAEs)State-Space Models / MambaAI Model SteganographyStructured Access / API-OnlySuperintelligenceSycophancySycophancy Feedback Loop ModelAI Safety Technical Pathway DecompositionTechnical AI Safety ResearchCompute ThresholdsTool-Use RestrictionsTool Use and Computer UseTreacherous TurnAI Trust Cascade FailureVoluntary AI Safety CommitmentsWeak-to-Strong GeneralizationWhy Alignment Might Be EasyWhy Alignment Might Be HardWikipedia ViewsAI Winner-Take-All DynamicsX.com Platform EpistemicsYoshua Bengio
Publication ID: arxiv
arXiv | Publications | Longterm Wiki