Skip to content
Longterm Wiki
Back

GPT-4 technical report

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

Foundational industry document for understanding frontier LLM capabilities and OpenAI's safety practices; the attached system card is particularly relevant for AI safety researchers assessing deployment risks of large multimodal models.

Metadata

Importance: 82/100organizational reportprimary source

Summary

OpenAI's technical report introducing GPT-4, a large-scale multimodal model achieving human-level performance on professional benchmarks including the bar exam (top 10%). The report details scalable training infrastructure enabling performance prediction from small runs, post-training alignment improvements, and extensive safety analysis covering bias, disinformation, cybersecurity, and other risks.

Key Points

  • GPT-4 achieves top 10% bar exam performance, contrasting sharply with GPT-3.5's bottom 10%, illustrating rapid capability jumps across model generations.
  • Predictable scaling infrastructure allowed performance to be estimated from models trained with only 1/1,000th of GPT-4's compute, advancing scalable oversight methods.
  • Post-training alignment (RLHF) improved factuality and adherence to desired behavior, though hallucinations and context limitations remain.
  • Includes an extensive system card documenting risks: bias, disinformation, over-reliance, privacy, cybersecurity, and proliferation concerns.
  • Multimodal capability (image + text inputs) expands deployment surface and introduces novel safety challenges beyond text-only models.

Cited by 2 pages

PageTypeQuality
RLHFResearch Area63.0
Emergent CapabilitiesRisk61.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202698 KB
GPT-4 Technical Report
OpenAI∗
Abstract
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance
on various professional and academic benchmarks, including passing a simulated
bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-
based model pre-trained to predict the next token in a document. The post-training
alignment process results in improved performance on measures of factuality and
adherence to desired behavior. A core component of this project was developing
infrastructure and optimization methods that behave predictably across a wide
range of scales. This allowed us to accurately predict some aspects of GPT-4’s
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction
This technical report presents GPT-4, a large multimodal model capable of processing image and
text inputs and producing text outputs. Such models are an important area of study as they have the
potential to be used in a wide range of applications, such as dialogue systems, text summarization,
and machine translation. As such, they have been the subject of substantial interest and progress in
recent years [1–34].
One of the main goals of developing such models is to improve their ability to understand and generate
natural language text, particularly in more complex and nuanced scenarios. To test its capabilities
in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In
these evaluations it performs quite well and often outscores the vast majority of human test takers.
For example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers.
This contrasts with GPT-3.5, which scores in the bottom 10%.
On a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models
and most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).
On the MMLU benchmark [ 35 , 36], an English-language suite of multiple-choice questions covering
57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but
also demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4
surpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these
model capability results, as well as model safety improvements and results, in more detail in later
sections.
This report also discusses a key challenge of the project, developing deep learning infrastructure and
optimization methods that behave predictably across a wide range of scales. This allowed us to make
predictions about the expected performance of GPT-4 (based on small runs trained in similar ways)
that were tested against the final run to increase confidence in our training.
Despite i

... (truncated, 98 KB total)
Resource ID: 227c865a2154436e | Stable ID: OWVhYmZiOD