Skip to content
Longterm Wiki
Back

Hello GPT-4o | OpenAI

web

Credibility Rating

4/5
High(4)

High quality. Established institution or organization with editorial oversight and accountability.

Rating inherited from publication venue: OpenAI

Official OpenAI announcement of GPT-4o; relevant to AI safety researchers tracking frontier model capabilities, deployment practices, and how safety evaluations are communicated during major model releases.

Metadata

Importance: 55/100press releaseprimary source

Summary

OpenAI announces GPT-4o, a new flagship model capable of processing and generating text, audio, and images in an integrated, real-time manner. GPT-4o matches GPT-4 Turbo on text and code tasks while significantly improving vision and audio capabilities, and is faster and more efficient. It represents a step toward more natural human-computer interaction with end-to-end multimodal processing.

Key Points

  • GPT-4o is OpenAI's first omnimodal model, natively processing text, audio, and images rather than stitching together separate models.
  • It achieves real-time conversational audio responses (avg 320ms latency), much closer to human response times than previous voice pipelines.
  • GPT-4o matches GPT-4 Turbo performance on English text and code benchmarks while being faster and cheaper via API.
  • The model shows improved multilingual, vision, and audio understanding, with new capabilities like detecting emotion from voice and describing visual scenes.
  • OpenAI plans to roll out GPT-4o iteratively, with safety evaluations ongoing for the full audio/vision capabilities before wider release.

Cited by 1 page

PageTypeQuality
AI-Driven Concentration of PowerRisk65.0

Cached Content Preview

HTTP 200Fetched Mar 20, 202612 KB
Hello GPT-4o \| OpenAI

May 13, 2024

[Milestone](https://openai.com/research/index/milestone/)

# Hello GPT‑4o

We’re announcing GPT‑4o, our new flagship model that can reason across audio, vision, and text in real time.

[Contributions](https://openai.com/gpt-4o-contributions/) [Try on ChatGPT (opens in a new window)](https://chatgpt.com/?openaicom-did=40a7d666-ddd1-4348-a135-8bd5c8216fed&openaicom_referred=true) [GPT-4o System Card](https://openai.com/index/gpt-4o-system-card/)

All videos on this page are at 1x real time.

Guessing May 13th’s announcement.

More Resources

[Try in Playground (opens in a new window)](https://platform.openai.com/playground?mode=chat&model=gpt-4o) [Rewatch live demos](https://openai.com/index/spring-update/)

Loading…

Share

GPT‑4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to [human response time⁠(opens in a new window)](https://www.pnas.org/doi/10.1073/pnas.0903616106) in a conversation. It matches GPT‑4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT‑4o is especially better at vision and audio understanding compared to existing models.

## Model capabilities

Two GPT‑4os interacting and singing.

Interview prep.

Rock Paper Scissors.

Sarcasm.

Math with Sal and Imran Khan.

Two GPT‑4os harmonizing.

Point and learn Spanish.

Meeting AI.

Real-time translation.

Lullaby.

Talking faster.

Happy Birthday.

Dog.

Dad jokes.

GPT‑4o with Andy, from BeMyEyes in London.

Customer service proof of concept.

Prior to GPT‑4o, you could use [Voice Mode⁠](https://openai.com/index/chatgpt-can-now-see-hear-and-speak/) to talk to ChatGPT with latencies of 2.8 seconds (GPT‑3.5) and 5.4 seconds (GPT‑4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT‑3.5 or GPT‑4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT‑4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT‑4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

## Explorations of capabilities

Select sample:

Visual Narratives — Robot Writer’s BlockVisual Narratives — Sally the MailwomanPoster creation fo

... (truncated, 12 KB total)
Resource ID: ee605bab036068f0 | Stable ID: YTMxMDE5Mm