GPT-4 System Card
webCredibility Rating
High quality. Established institution or organization with editorial oversight and accountability.
Rating inherited from publication venue: OpenAI
This is OpenAI's official safety documentation for GPT-4, widely referenced as an example of pre-deployment risk assessment practice and useful for understanding how frontier labs communicate safety measures to the public.
Metadata
Summary
OpenAI's system card for GPT-4 documents safety evaluations, risk assessments, and mitigation measures conducted prior to deployment. It covers dangerous capability evaluations, red-teaming findings, and the RLHF-based safety interventions applied to reduce harmful outputs. The document represents OpenAI's public accountability framework for responsible deployment of a frontier AI model.
Key Points
- •Documents pre-deployment red-teaming and evaluations for dangerous capabilities including CBRN risks, cyberoffense, and influence operations.
- •Describes RLHF and rule-based reward models (RBRMs) used to reduce harmful, biased, or policy-violating outputs.
- •Identifies residual risks and limitations that mitigations could not fully address, providing transparency about known failure modes.
- •Includes third-party expert evaluations (e.g., Alignment Research Center) assessing autonomous replication and power-seeking risks.
- •Establishes a template for system cards as a deployment safety communication practice for frontier AI models.
Cited by 8 pages
| Page | Type | Quality |
|---|---|---|
| Alignment Research Center | Organization | 57.0 |
| METR | Organization | 66.0 |
| OpenAI | Organization | 62.0 |
| AI Governance and Policy | Crux | 66.0 |
| Red Teaming | Research Area | 65.0 |
| Sandboxing / Containment | Approach | 91.0 |
| Emergent Capabilities | Risk | 61.0 |
| AI Proliferation | Risk | 60.0 |
Cached Content Preview
# GPT-4 System Card
OpenAI
**Abstract**
Large language models (LLMs) are being deployed in many domains of our lives ranging from browsing, to voice assistants, to coding assistance tools, and have potential for vast societal impacts.\[1, 2, 3, 4, 5, 6, 7\] This system card analyzes GPT-4, the latest LLM in the GPT family of models.\[8, 9, 10\] First, we highlight safety challenges presented by the model’s limitations (e.g., producing convincing text that is subtly false) and capabilities (e.g., increased adeptness at providing illicit advice, performance in dual-use capabilities, and risky emergent behaviors). Second, we give a high-level overview of the safety processes OpenAI adopted to prepare GPT-4 for deployment. This spans our work across measurements, model-level changes, product- and system-level interventions (such as monitoring and policies), and external expert engagement. Finally, we demonstrate that while our mitigations and processes alter GPT-4’s behavior and prevent certain kinds of misuses, they are limited and remain brittle in some cases. This points to the need for anticipatory planning and governance.\[11\]
**Content Warning: This document contains content that some may find disturbing or offensive,** including content that is sexual, hateful, or violent in nature.
## 1 Introduction
Large language models, also known as LLMs, have become an increasingly prevalent part of our day-to-day lives, with their use extending to a wide range of domains including web browsing, voice assistants, and coding assistance tools.\[1, 2, 3, 4\] These models have the potential to significantly impact society in numerous ways.\[5, 6, 7\] This system card analyzes GPT-4, the latest large language model in the GPT family of models.\[8, 9, 10\] Since it finished training in August of 2022, we have been evaluating, adversarially testing, and iteratively improving the model and the system-level mitigations around it. Our mitigations and processes alter GPT-4’s behavior and prevent certain kinds of misuses, though they have limitations, pointing to the need for anticipatory planning and governance\[11\] and further safety research. Our approach to deployment balances minimizing risk from deployment, enabling positive use cases, and learning from deployment. GPT models are often trained in two stages. First, they are trained, using a large dataset of text from the Internet, to predict the next word. The models are then fine-tuned with additional data, using an algorithm called reinforcement learning from human feedback (RLHF), to produce outputs that are preferred by human labelers.\[10, 12, 13\] Training language models on large text datasets has given rise to capabilities such as few-shot learning\[10\] and the ability to carry out a wide range of natural language tasks spanning different domains, including question answering, arithmetic, and classification. Fine-tuning has made these models more controllable and useful.
**1.1 Overview of find
... (truncated, 98 KB total)ebab6e05661645c5 | Stable ID: OTk5MDMwNT