OpenAI WebGPT behavior

paper

2021·arXiv·arxiv.org/abs/2112.09332

Authors

Reiichiro Nakano·Jacob Hilton·Suchir Balaji·Jeff Wu·Long Ouyang·Christina Kim·Christopher Hesse·Shantanu Jain·Vineet Kosaraju·William Saunders·Xu Jiang·Karl Cobbe·Tyna Eloundou·Gretchen Krueger·Kevin Button·Matthew Knight·Benjamin Chess·John Schulman

Credibility Rating

3/5

Good(3)

Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.

Rating inherited from publication venue: arXiv

OpenAI's paper on fine-tuning GPT-3 with reinforcement learning from human feedback for web-browsing question-answering, demonstrating practical approaches to AI alignment through imitation learning and human oversight.

Paper Details

Citations

119 influential

Year

2025

arXiv:2112.09332 DOI:10.2139/ssrn.5146257 Semantic Scholar

Metadata

arxiv preprintprimary source

Abstract

We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

Summary

OpenAI fine-tuned GPT-3 to answer long-form questions by enabling it to search and browse the web in a text-based environment. The model was trained using imitation learning from human demonstrations on the ELI5 dataset, then optimized using reinforcement learning from human feedback. The approach requires models to collect references while browsing to support their answers and improve factual accuracy evaluation. The resulting model outperformed both human demonstrators (56% preference) and top Reddit answers (69% preference), demonstrating the effectiveness of combining behavior cloning with reward model-based optimization.

Cited by 1 page

Page	Type	Quality
Large Language Models	Capability	60.0

Cached Content Preview

HTTP 200Fetched Apr 7, 202698 KB

[2112.09332] WebGPT: Browser-assisted question-answering with human feedback 
 
 
 
 
 
 
 
 
 
 
 

 
 

 
 
 
 
 
 
 
 \newunicodechar 
 ◼ \mdblksquare \mdblksquare \mdblksquare 

 
 WebGPT: Browser-assisted question-answering with human feedback

 
 
 
Reiichiro Nakano
&Jacob Hilton 1 1 footnotemark: 1 
&Suchir Balaji 1 1 footnotemark: 1 
&Jeff Wu
&Long Ouyang
&Christina Kim
&Christopher Hesse
&Shantanu Jain
&Vineet Kosaraju
&William Saunders
&Xu Jiang
&Karl Cobbe
&Tyna Eloundou
&Gretchen Krueger
&Kevin Button
&Matthew Knight
&Benjamin Chess
&John Schulman
 OpenAI 
 Equal contribution, order randomized. Correspondence to: reiichiro@openai.com , jhilton@openai.com , suchir@openai.com , joschu@openai.com 
 

 
 Abstract

 We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model’s answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

 
 
 
 1 Introduction

 
 A rising challenge in NLP is long-form question-answering (LFQA), in which a paragraph-length answer is generated in response to an open-ended question. LFQA systems have the potential to become one of the main ways people learn about the world, but currently lag behind human performance (Krishna et al., 2021 ) . Existing work tends to focus on two core components of the task, information retrieval and synthesis.

 
 
 In this work we leverage existing solutions to these components: we outsource document retrieval to the Microsoft Bing Web Search API, 1 1 1 https://www.microsoft.com/en-us/bing/apis/bing-web-search-api and utilize unsupervised pre-training to achieve high-quality synthesis by fine-tuning GPT-3 (Brown et al., 2020 ) . Instead of trying to improve these ingredients, we focus on combining them using more faithful training objectives. Following Stiennon et al. ( 2020 ) , we use human feedback to directly optimize answer quality, allowing us to achieve performance competitive with humans.

 
 
 We make two key contributions:

 
 
 • 
 
 We create a text-based web-browsing environment that a fine-tuned language model can interact with. This allows us to improve both retrieval and synthesis in an end-to-end fashion using general methods such as imitation learning and reinforcement learning.

 

 
 • 
 
 We generate answers with ref

... (truncated, 98 KB total)

Resource ID: 3225de3850d36a20 | Stable ID: sid_HamPVITACJ