Science of Deep Learning More Tractably Addresses the Sharp Left Turn Than Agent Foundations

blog

2023·LessWrong·lesswrong.com/posts/v6zZaR7aDD6vkuPmx/science-of-deep-lea...

Author

NickGabs

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: LessWrong

This post contributes to the debate over which research paradigms are best suited to handle discontinuous capability jumps, comparing the tractability-robustness tradeoffs of Agent Foundations versus the emerging Science of Deep Learning approach.

Forum Post Details

Karma

Comments

Forum

lesswrong

Forum Tags

Metadata

Importance: 52/100blog postcommentary

Summary

This post argues that the Science of Deep Learning (SDL) is better positioned than Agent Foundations to address the 'sharp left turn' alignment threat, because SDL makes the tractable assumption that future AGI will be neural-network-based. By uncovering general principles of how neural networks work, SDL can achieve both robustness and tractability, whereas Agent Foundations' minimal assumptions make it hard to produce actionable insights.

Key Points

•The 'sharp left turn' threat model posits that empirical alignment techniques may fail to generalize as AI capabilities rapidly advance.
•Agent Foundations is robust but struggles with tractability because it makes minimal assumptions about the nature of future AI systems.
•SDL assumes future AGI will be built on neural networks, making it more tractable while still being broadly applicable.
•SDL seeks general principles about neural network behavior that could generalize even when specific fine-tuning or RLHF techniques do not.
•The post frames SDL as a middle path between purely empirical alignment work and highly abstract agent-theoretic approaches.

Cited by 1 page

Page	Type	Quality
Agent Foundations	Approach	59.0

Cached Content Preview

HTTP 200Fetched Apr 10, 202611 KB

# Science of Deep Learning more tractably addresses the Sharp Left Turn than Agent Foundations
By NickGabs
Published: 2023-09-19
Summary
=======

Lots of agent foundations research is motivated by the idea that alignment techniques found by empirical trial-and-error will fail to generalize to future systems. While such threat models are plausible, agent foundations researchers have largely failed to make progress on addressing them because they tend to make very few assumptions about the gears-level properties of future AI systems. In contrast, the emerging field of "Science of Deep Learning" (SDL) assumes that the systems under question are neural networks, and aims to empirically uncover general principles about how they work. Assuming such principles exist and that neural nets will lead to AGI, insights from SDL will generalize to future systems even if specific alignment techniques found by trial-and-error do not. As a result, SDL can address the threat models that motivate agent foundations research while being much more tractable due to making more assumptions about future AI systems, though it cannot help in the construction of an ideal utility function for future AI systems to maximize.

*Thanks to Alexandra Bates for discussion and feedback*

Introduction
============

Generally speaking, the "agent foundations" line of research in alignment is motivated by a particular class of threat models: that alignment techniques generated through empirical trial and error like RLHF or Constitutional AI will suddenly break down as AI capabilities advance, leading to catastrophe. For example, Nate Soares has argued that AI systems will undergo a[ "sharp left turn"](https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization) at which point the "shallow" motivational heuristics which caused them to be aligned at lower capabilities levels will cease to function. Eliezer Yudkowsky seems to endorse a similar threat model in[ AGI Ruin](https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities#Section_B_1___The_distributional_leap_), arguing that rapid capabilities increases will cause models to go out of distribution and become misaligned.

I'm much more optimistic about finding robust alignment methods through trial and error than these researchers, but I'm not here to debate that. While I think sharp left turn style failure modes are relatively unlikely, they do not seem totally implausible. Moreover, because doom is much more likely in worlds where alignment techniques generated through trial and error fail to generalize, it makes sense to[ focus on them](https://www.lesswrong.com/posts/xFotXGEotcKouifky/worlds-where-iterative-design-fails) when doing alignment research. Instead, I want to argue that the emerging field of the "science of deep learning" or SDL is more promising for dealing with these threat models than agent foundations.

A Central Tradeoff in Alignment Research
==

... (truncated, 11 KB total)

Resource ID: d0d2b99c32eb6001 | Stable ID: sid_ipKRfVgksp