ProgressGym project (NeurIPS 2024)

web

NeurIPS(peer-reviewed)·proceedings.neurips.cc/paper_files/paper/2024/file/1a6d49...

Credibility Rating

5/5

Gold(5)

Gold standard. Rigorous peer review, high editorial standards, and strong institutional reputation.

Rating inherited from publication venue: NeurIPS

Directly relevant to concerns about value lock-in and point-of-no-return scenarios in AI development; provides empirical tools for studying whether AI alignment methods can accommodate moral progress rather than freezing current human values.

Metadata

Importance: 72/100conference paperdataset

Summary

ProgressGym introduces a benchmark and framework for studying 'progress alignment'—ensuring AI systems can track and adapt to ongoing human moral progress rather than locking in current values. The project uses historical moral data spanning a millennium to train and evaluate models on their ability to learn from moral evolution over time, addressing risks of value lock-in at a premature point in humanity's ethical development.

Key Points

•Introduces the problem of 'progress alignment': aligning AI not just to current values but to humanity's ongoing capacity for moral growth and revision
•Provides a benchmark using historical moral and social data spanning ~1000 years to test whether models can learn from moral progress trajectories
•Addresses the risk that powerful AI systems could entrench present-day moral blind spots, creating a 'point of no return' for value lock-in
•Evaluates multiple alignment approaches on their ability to extrapolate moral progress rather than simply reflect current human judgments
•Released as a NeurIPS 2024 Datasets and Benchmarks Track paper, providing an open research resource for studying value dynamics

Cited by 1 page

Page	Type	Quality
AI-Induced Irreversibility	Risk	64.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

ProgressGym: Alignment with a Millennium of
Moral Progress
Tianyi Qiu1∗ † Yang Zhang1∗ Xuchuan Huang1 Jasmine Xinze Li2 Jiaming Ji13
Yaodong Yang13‡
1 Peking University 2 Cornell University
3 Institute for AI, Peking University
Abstract
Frontier AI systems, including large language models (LLMs), hold increasing
influence over the epistemology of human users. Such influence can reinforce pre-
vailing societal values, potentially contributing to the lock-in of misguided moral
beliefs and, consequently, the perpetuation of problematic moral practices on a
broad scale. We introduce progress alignment as a technical solution to mitigate
this imminent risk. Progress alignment algorithms learn to emulate the mechan-
ics of human moral progress, thereby addressing the susceptibility of existing
alignment methods to contemporary moral blindspots. To empower research in
progress alignment, we introduce ProgressGym,4 an experimental framework al-
lowing the learning of moral progress mechanics from history, in order to facilitate
future progress in real-world moral decisions. Leveraging 9 centuries of histori-
cal text and 18 historical LLMs,5 ProgressGym enables codification of real-world
progress alignment challenges into concrete benchmarks. Specifically, we intro-
duce three core challenges: tracking evolving values (PG-Follow), preemptively
anticipating moral progress (PG-Predict), and regulating the feedback loop be-
tween human and AI value shifts (PG-Coevolve). Alignment methods without a
temporal dimension are inapplicable to these tasks. In response, we present life-
long and extrapolative algorithms as baseline methods of progress alignment, and
build an open leaderboard6 soliciting novel algorithms and challenges.
1 Introduction
Due to their increasingly widespread deployment, frontier AI systems are exerting profound influ-
ences over human beliefs and values. For instance, large language models (LLMs) have recently
assumed roles as personal assistants [1], romantic partners [2], Internet authors [3], and K-12 edu-
cators [4] — roles of significant influence over human epistemology. Given studies demonstrating
that interactions with opinionated LLMs markedly alter user’s beliefs [5], it follows that the values
represented in AI systems could be reinforced in human users on a societal scale [6].
∗Equal technical contribution.
†Project lead.
‡B Corresponding author. Email: yaodong.yang@pku.edu.cn
4ProgressGym is open-source and available at https://github.com/PKU-Alignment/ProgressGym.
5Datasets and models are available as a Huggingface collection.
6Accessible at https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard.
38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks.

-- 1 of 38 --

LLMs and other frontier AI systems are trained on massive amounts of human-generated data, in-
cluding Internet text and images [7] and human preference annotations [8]. This data often reflects
con

... (truncated, 98 KB total)

Resource ID: 0c3ab01ac001f37f | Stable ID: sid_cZaVoCPhAT