OpenAI WebGPT behavior
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
OpenAI's paper on fine-tuning GPT-3 with reinforcement learning from human feedback for web-browsing question-answering, demonstrating practical approaches to AI alignment through imitation learning and human oversight.
Paper Details
Metadata
Abstract
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
Summary
OpenAI fine-tuned GPT-3 to answer long-form questions by enabling it to search and browse the web in a text-based environment. The model was trained using imitation learning from human demonstrations on the ELI5 dataset, then optimized using reinforcement learning from human feedback. The approach requires models to collect references while browsing to support their answers and improve factual accuracy evaluation. The resulting model outperformed both human demonstrators (56% preference) and top Reddit answers (69% preference), demonstrating the effectiveness of combining behavior cloning with reward model-based optimization.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Large Language Models | Capability | 60.0 |
Cached Content Preview
\\newunicodechar
◼\\mdblksquare\\mdblksquare\\mdblksquare
# WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano
&Jacob Hilton11footnotemark: 1
&Suchir Balaji11footnotemark: 1
&Jeff Wu
&Long Ouyang
&Christina Kim
&Christopher Hesse
&Shantanu Jain
&Vineet Kosaraju
&William Saunders
&Xu Jiang
&Karl Cobbe
&Tyna Eloundou
&Gretchen Krueger
&Kevin Button
&Matthew Knight
&Benjamin Chess
&John Schulman
OpenAIEqual contribution, order randomized. Correspondence to: [reiichiro@openai.com](mailto:reiichiro@openai.com ""), [jhilton@openai.com](mailto:jhilton@openai.com ""), [suchir@openai.com](mailto:suchir@openai.com ""), [joschu@openai.com](mailto:joschu@openai.com "")
###### Abstract
We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model’s answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.
## 1 Introduction
A rising challenge in NLP is long-form question-answering (LFQA), in which a paragraph-length answer is generated in response to an open-ended question. LFQA systems have the potential to become one of the main ways people learn about the world, but currently lag behind human performance (Krishna et al., [2021](https://ar5iv.labs.arxiv.org/html/2112.09332#bib.bib18 "")). Existing work tends to focus on two core components of the task, information retrieval and synthesis.
In this work we leverage existing solutions to these components: we outsource document retrieval to the Microsoft Bing Web Search API,111 [https://www.microsoft.com/en-us/bing/apis/bing-web-search-api](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api "") and utilize unsupervised pre-training to achieve high-quality synthesis by fine-tuning GPT-3 (Brown et al., [2020](https://ar5iv.labs.arxiv.org/html/2112.09332#bib.bib4 "")). Instead of trying to improve these ingredients, we focus on combining them using more faithful training objectives. Following Stiennon et al. ( [2020](https://ar5iv.labs.arxiv.org/html/2112.09332#bib.bib29 "")), we use human feedback to directly optimize answer quality, allowing us to achieve performance competitive with humans.
We make two key contributions:
- •
We create a text-based web-browsing environment that a fine-tuned language model can interact w
... (truncated, 98 KB total)3225de3850d36a20 | Stable ID: NWFmOTdmYT