OpenAI's ChatGPT-o1 safety evaluation

web

OpenAI·cdn.openai.com/o1-system-card-20241205.pdf

Credibility Rating

4/5

High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

Official OpenAI safety documentation for the o1 reasoning model; essential reference for understanding how frontier labs evaluate and communicate safety properties of advanced reasoning systems before release.

Metadata

Importance: 72/100organizational reportprimary source

Summary

OpenAI's official system card for the o1 model series, documenting safety evaluations, red-teaming results, and risk assessments conducted prior to deployment. It covers performance on safety benchmarks, disallowed content testing, and assessments of o1's potential for misuse in areas such as CBRN threats and cybersecurity. The card also addresses the model's enhanced reasoning capabilities and associated safety considerations.

Key Points

•Documents safety evaluations across multiple risk categories including CBRN (chemical, biological, radiological, nuclear) threats and cyberoffense.
•Reports red-teaming results showing o1's stronger reasoning improves both capability and safety refusals, but also raises new misuse concerns.
•Assesses o1's performance on OpenAI's preparedness framework metrics, including autonomous replication and persuasion risks.
•Includes third-party evaluations from organizations like Apollo Research and Metr on agentic and autonomy-related safety dimensions.
•Describes mitigations and policy decisions made in response to evaluation findings before public deployment.

Cited by 1 page

Page	Type	Quality
AI Safety Technical Pathway Decomposition	Analysis	62.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202698 KB

# OpenAI o1 System Card

OpenAI

December 5, 2024

# 1 Introduction

The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment\[1\]1. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

# 2 Model data and training

The o1 large language model family is trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long chain of thought before responding to the user. OpenAI o1 is the next model in this series (previously OpenAI o1-preview), while OpenAI o1-mini is a faster version of this model that is particularly effective at coding. Through training, the models learn to refine their thinking process, try different strategies, and recognize their mistakes. Reasoning allows o1 models to follow specific guidelines and model policies we’ve set, helping them act in line with our safety expectations. This means they are better at providing helpful answers and resisting attempts to bypass safety rules, to avoid producing unsafe or inappropriate content.

The two models were pre-trained on diverse datasets, including a mix of publicly available data, proprietary data accessed through partnerships, and custom datasets developed in-house, which collectively contribute to the models’ robust reasoning and conversational capabilities.

Select Public Data: Both models were trained on a variety of publicly available datasets, including web data and open-source datasets. Key components include reasoning data and scientific literature. This ensures that the models are well-versed in both general knowledge and technical topics, enhancing their ability to perform complex reasoning tasks.

Proprietary Data from Data Partnerships: To further enhance the capabilities of o1 and o1-mini, we formed partnerships to access high-value non-public datasets. These proprietary data sources include paywalled content, specialized archives, and other domain-specific datasets that provide deeper insights into industry-specific knowledge and use cases.

Data Filter

... (truncated, 98 KB total)

Resource ID: e17881f4b6c6a40f | Stable ID: sid_hR1p6yl9Dl