GPT-4 System Card
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
This is OpenAI's official safety documentation for GPT-4, widely referenced as an example of pre-deployment risk assessment practice and useful for understanding how frontier labs communicate safety measures to the public.
Metadata
Summary
OpenAI's system card for GPT-4 documents safety evaluations, risk assessments, and mitigation measures conducted prior to deployment. It covers dangerous capability evaluations, red-teaming findings, and the RLHF-based safety interventions applied to reduce harmful outputs. The document represents OpenAI's public accountability framework for responsible deployment of a frontier AI model.
Key Points
- •Documents pre-deployment red-teaming and evaluations for dangerous capabilities including CBRN risks, cyberoffense, and influence operations.
- •Describes RLHF and rule-based reward models (RBRMs) used to reduce harmful, biased, or policy-violating outputs.
- •Identifies residual risks and limitations that mitigations could not fully address, providing transparency about known failure modes.
- •Includes third-party expert evaluations (e.g., Alignment Research Center) assessing autonomous replication and power-seeking risks.
- •Establishes a template for system cards as a deployment safety communication practice for frontier AI models.
Cited by 8 pages
| Page | Type | Quality |
|---|---|---|
| Alignment Research Center (ARC) | Organization | 57.0 |
| METR | Organization | 66.0 |
| OpenAI | Organization | 62.0 |
| AI Governance and Policy | Crux | 66.0 |
| Red Teaming | Research Area | 65.0 |
| Sandboxing / Containment | Approach | 91.0 |
| Emergent Capabilities | Risk | 61.0 |
| AI Proliferation | Risk | 60.0 |
Cached Content Preview
GPT-4 System Card
OpenAI
Abstract
Large language models (LLMs) are being deployed in many domains of our lives ranging
from browsing, to voice assistants, to coding assistance tools, and have potential for vast societal
impacts.\[1, 2, 3, 4, 5, 6, 7\] This system card analyzes GPT-4, the latest LLM in the GPT family
of models.\[8, 9, 10\] First, we highlight safety challenges presented by the model’s limitations
(e.g., producing convincing text that is subtly false) and capabilities (e.g., increased adeptness
at providing illicit advice, performance in dual-use capabilities, and risky emergent behaviors).
Second, we give a high-level overview of the safety processes OpenAI adopted to prepare GPT-4
for deployment. This spans our work across measurements, model-level changes, product- and
system-level interventions (such as monitoring and policies), and external expert engagement.
Finally, we demonstrate that while our mitigations and processes alter GPT-4’s behavior and
prevent certain kinds of misuses, they are limited and remain brittle in some cases. This points
to the need for anticipatory planning and governance.\[11\]
Content Warning: This document contains content that some may find disturbing or offensive,
including content that is sexual, hateful, or violent in nature.
* * *
1 Introduction
Large language models, also known as LLMs, have become an increasingly prevalent part of our
day-to-day lives, with their use extending to a wide range of domains including web browsing, voice
assistants, and coding assistance tools.\[1, 2, 3, 4\] These models have the potential to significantly
impact society in numerous ways.\[5, 6, 7\] This system card analyzes GPT-4, the latest large language
model in the GPT family of models.\[8, 9, 10\] Since it finished training in August of 2022, we have
been evaluating, adversarially testing, and iteratively improving the model and the system-level
mitigations around it. Our mitigations and processes alter GPT-4’s behavior and prevent certain
kinds of misuses, though they have limitations, pointing to the need for anticipatory planning and
governance\[11\] and further safety research. Our approach to deployment balances minimizing risk
from deployment, enabling positive use cases, and learning from deployment.
GPT models are often trained in two stages. First, they are trained, using a large dataset of text
GPT models are often trained in two stages. First, they are trained, using a large dataset of text
from the Internet, to predict the next word. The models are then fine-tuned with additional data,
using an algorithm called reinforcement learning from human feedback (RLHF), to produce outputs
that are preferred by human labelers.\[10, 12, 13\] Training language models on large text datasets
has given rise to capabilities such as few-shot learning\[10\] and the ability to carry out a wide range
of natural language tasks spanning different domains, including question answering, arithmetic, and
classific
... (truncated, 98 KB total)ebab6e05661645c5 | Stable ID: sid_4Wzl5NJbhi