Skip to content
Longterm Wiki
Back

Can We Scale Human Feedback for Complex AI Tasks? An Intro to Scalable Oversight

web

Introductory blog post from BlueDot Impact (AI Safety Fundamentals) providing a beginner-friendly overview of scalable oversight as a solution to limitations of human feedback in AI training; suitable as a course reading or entry point to the topic.

Metadata

Importance: 55/100blog posteducational

Summary

An introductory overview of scalable oversight techniques, explaining why simple human feedback is insufficient for training advanced AI systems and introducing approaches to address this limitation. The article covers key failure modes like deception and sycophancy, and previews methods for augmenting human evaluative capacity for complex tasks.

Key Points

  • RLHF struggles with complex tasks where humans cannot accurately judge AI outputs at the scale needed for training.
  • Key problems include AI deception (e.g., ball-grasping hack, LLM hallucinations) and sycophancy, where AI learns to please rather than perform correctly.
  • Scalable oversight techniques aim to amplify human ability to give accurate feedback on tasks beyond direct human competence.
  • The article serves as a primer for session 4 of BlueDot Impact's AI Alignment course, situating scalable oversight within the broader alignment landscape.
  • Examples of deceptive AI behavior range from simulated robot manipulation to Meta's board-game AI engaging in premeditated deception.

Cited by 1 page

PageTypeQuality
Why Alignment Might Be HardArgument69.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202624 KB
[![BlueDot Impact](https://substackcdn.com/image/fetch/$s_!PUc5!,w_40,h_40,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9c29b1-3ff5-4ef6-8c04-e91c608ec10e_1000x1000.png)](https://blog.bluedot.org/)

# [BlueDot Impact](https://blog.bluedot.org/)

SubscribeSign in

[Blog](https://blog.bluedot.org/s/blog/?utm_source=substack&utm_medium=menu)

# Can we scale human feedback for complex AI tasks? An intro to scalable oversight.

[![Adam Jones's avatar](https://substackcdn.com/image/fetch/$s_!0zD_!,w_36,h_36,c_fill,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6747eaad-6aca-4e3d-9cb5-d788cf5bcd0e_1673x1673.jpeg)](https://substack.com/@domdomegg)

[Adam Jones](https://substack.com/@domdomegg)

Mar 18, 2024

4

Share

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique for steering large language models (LLMs) toward desired behaviours. However, relying on simple human feedback doesn’t work for tasks that are too complex for humans to accurately judge at the scale needed to train AI models. Scalable oversight techniques attempt to address this by increasing the abilities of humans to give feedback on complex tasks.

This article briefly recaps some of the challenges faced with human feedback, and introduces the approaches to scalable oversight covered in [session 4 of our AI Alignment course](https://course.aisafetyfundamentals.com/alignment?session=4).

## **Why do we need better human feedback?**

Human feedback is used in several approaches to building and attempting to [align](https://aisafetyfundamentals.com/blog/what-is-ai-alignment/) AI systems. From supervised learning to inverse reward design, a vast family of techniques fundamentally depend on humans to provide ground truth data, specify reward functions, or evaluate outputs.

However, for increasingly complex, open-ended tasks, it becomes very hard for humans to judge outputs accurately - particularly at the scales required to train AI systems. This can manifest as several problems, including:

**Deception.** AI systems can learn to mislead humans into thinking tasks are being done correctly. A well-known example is the [ball grasping problem](https://openai.com/research/learning-from-human-preferences#:~:text=Our%20algorithm%E2%80%99s%20performance,as%20shown%C2%A0below.), where an AI learnt to hover a simulated hand in front of a ball, rather than actually grasp the ball. Hallucinations in LLMs may be another example of this: where plausible sounding text is generated that tricks humans who are only briefly reviewing it, but it doesn’t stand up to more detailed scrutiny ( [medical example](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9939079/#:~:text=None%20of%20the%20provided%20paper%20titles%20existed%2C%20and%20all%20provided%20PubMed%20IDs%20(PMIDs)%20were%20of%20different%20unrelated%20papers.), [legal example](https

... (truncated, 24 KB total)
Resource ID: 3de495e125062c24 | Stable ID: ZTMxZDcwOG