Skip to content
Longterm Wiki
Back

Scalable Oversight in AI: Beyond Human Supervision

blog

Credibility Rating

2/5
Mixed(2)

Mixed quality. Some useful content but inconsistent editorial standards. Claims should be verified.

Rating inherited from publication venue: Medium

A practitioner blog post offering an accessible, experience-grounded introduction to scalable oversight; useful as an illustrative example of why the problem matters but limited in technical depth compared to primary research sources.

Metadata

Importance: 38/100blog postcommentary

Summary

A practitioner's perspective on scalable oversight, grounded in firsthand experience at Amazon Alexa where ASR models surpassed human transcription accuracy, making traditional evaluation unreliable. The post defines scalable oversight and introduces approaches like Constitutional AI and RLAIF as potential solutions for evaluating and improving superhuman AI systems.

Key Points

  • At Amazon Alexa (~2022), ASR models achieved WER below 3%, surpassing human transcription accuracy and making human-labeled ground truth unreliable for further evaluation.
  • Scalable oversight is defined as systematic methodologies to monitor, evaluate, and improve AI systems that exceed human performance in specific domains.
  • Constitutional AI (RLAIF) uses a stronger model to evaluate a weaker one, offering a path forward when human judgment is no longer a reliable benchmark.
  • The challenge is not unique to speech recognition—similar inflection points are emerging across narrow AI tasks in various domains.
  • OpenAI's superalignment initiative and Anthropic's scalable oversight research represent institutional efforts to address this fundamental alignment challenge.

Cited by 1 page

PageTypeQuality
Scalable OversightResearch Area68.0

Cached Content Preview

HTTP 200Fetched Mar 15, 20268 KB
Scalable Oversight in AI: Beyond Human Supervision | by Deepak Babu Piskala | Medium Sitemap Open in app Sign up 

 Sign in 

 Medium Logo Get app Write Search Sign up 

 Sign in 

 Scalable Oversight in AI: Beyond Human Supervision

 Deepak Babu Piskala 5 min read · May 20, 2024 -- 

 Listen

 Share

 Press enter or click to view image in full size OpenAI’s superalignment In recent years, the field of artificial intelligence has experienced unprecedented growth. Two driving forces underpin this advancement: the availability of colossal datasets, exemplified by the Llama 3 model trained on 15 trillion tokens, and the emergence of models with trillions of parameters, such as GPT-4. Historically, human supervision has been pivotal in training and evaluating AI models. However, as these models achieve superhuman accuracies in specific domains, we confront a fundamental challenge: How do we evaluate and improve systems that outperform humans? Furthermore, how do we train models that exceed human accuracy?

 The Inflection Point: A Personal Account — ASR

 During my tenure from 2019 to 2024 as a senior scientist working on speech recognition for a leading voice assistant — Alexa, I witnessed remarkable advancements. The enhancements in ASR technology can be summarized into three key developments: (i) the gradual replacement of traditional acoustic models with deep neural networks (DNNs), (ii) the shift from n-gram to subword neural language models, and (iii) the integration of acoustic and language models into a unified end-to-end model focused on minimizing mWER (minimum WER). Word error rates (WER) plummeted from around 15% to below 3% on testsets of interest during this period. However, sometime around 2022, we encountered an unexpected inflection point. Despite implementing significant upgrades to our speech models, our evaluation metrics began to exhibit regression. Metrics tracked during our Weekly Business Reviews (WBRs) and Monthly Business Reviews (MBRs) became erratic and inconsistent.

 A particular incident highlighted this challenge. After rolling out a major update to our ASR system, our internal testing showed inexplicable metric fluctuations. We meticulously reviewed the updates and found no immediate issues. We started to look at a disagreements between machine and human transcriptions and closely analyzing the source of error. It was then that we realized our models had achieved such high accuracy that human transcriptions, traditionally our gold standard, were no longer reliable. The WER, based on human transcription, was comparable (at times exceeding) to the best ASR model in production. This revelation underscored a critical dilemma: How do we further enhance and evaluate models that have surpassed human transcription capabilities? Our models had not just become noise-robust but also built deep understanding of entities that even expert humans would struggle.

 I am sure there are similar stories emerging from narrow tasks in dif

... (truncated, 8 KB total)
Resource ID: 7302eddb7b84605e | Stable ID: MjMwNzRkMG