GPT-4 Research Announcement

web

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

OpenAI's official announcement of GPT-4, a major capability milestone. Relevant to AI safety as it describes alignment efforts, adversarial testing, and predictive scaling methodology used to prepare for and mitigate risks from frontier models.

Metadata

Importance: 72/100press releaseprimary source

Summary

OpenAI announces GPT-4, a large multimodal model achieving human-level performance on various professional and academic benchmarks. The post highlights six months of iterative alignment work, adversarial testing, and improved training stability enabling predictive capability forecasting. OpenAI also open-sources its Evals framework for community-driven model evaluation.

Key Points

•GPT-4 passes a simulated bar exam scoring in the top 10% of test takers, a significant improvement over GPT-3.5's bottom 10% performance.
•Six months of iterative alignment using adversarial testing and ChatGPT lessons yielded best-ever results on factuality, steerability, and guardrail adherence.
•Training run was unprecedentedly stable, enabling accurate pre-training capability prediction—viewed by OpenAI as critical for safety.
•GPT-4 outperforms existing LLMs on traditional ML benchmarks and in 24 of 26 languages tested on the MMLU benchmark.
•OpenAI open-sourced the Evals framework to allow external reporting of model shortcomings and guide further safety improvements.

Cached Content Preview

HTTP 200Fetched Mar 30, 202620 KB

OpenAI March 14, 2023 Milestone GPT‑4 Read paper (opens in a new window) View system card (opens in a new window) Try on ChatGPT Plus (opens in a new window) More Resources Try in Playground (opens in a new window) Rewatch demo livestream (opens in a new window) Contribute to OpenAI Evals (opens in a new window) Loading… Share We’ve created GPT‑4, the latest milestone in OpenAI’s effort in scaling up deep learning. GPT‑4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT‑3.5’s score was around the bottom 10%. We’ve spent 6 months iteratively aligning ⁠ GPT‑4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails. Over the past two years, we rebuilt our entire deep learning stack and, together with Azure, co-designed a supercomputer from the ground up for our workload. A year ago, we trained GPT‑3.5 as a first “test run” of the system. We found and fixed some bugs and improved our theoretical foundations. As a result, our GPT‑4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time. As we continue to focus on reliable scaling, we aim to hone our methodology to help us predict and prepare for future capabilities increasingly far in advance—something we view as critical for safety. We are releasing GPT‑4’s text input capability via ChatGPT and the API (with a waitlist ⁠ ). To prepare the image input capability for wider availability, we’re collaborating closely with a single partner ⁠ (opens in a new window) to start. We’re also open-sourcing OpenAI Evals ⁠ (opens in a new window) , our framework for automated evaluation of AI model performance, to allow anyone to report shortcomings in our models to help guide further improvements. Capabilities In a casual conversation, the distinction between GPT‑3.5 and GPT‑4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold—GPT‑4 is more reliable, creative, and able to handle much more nuanced instructions than GPT‑3.5. To understand the difference between the two models, we tested on a variety of benchmarks, including simulating exams that were originally designed for humans. We proceeded by using the most recent publicly-available tests (in the case of the Olympiads and AP free response questions) or by purchasing 2022–2023 editions of practice exams. We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training, but we believe the results to be representat

... (truncated, 20 KB total)

Resource ID: afabd708073ea425 | Stable ID: sid_yvbv5n4Ny8