Back
GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE
websemianalysis.com·semianalysis.com/p/gpt-4-architecture-infrastructure
This industry analysis is useful for understanding the infrastructure and economic realities behind frontier AI models, relevant to discussions of compute governance, access inequality, and the technical trajectory of large language models.
Metadata
Importance: 55/100blog postanalysis
Summary
SemiAnalysis provides a detailed technical breakdown of GPT-4's architecture, including its use of Mixture of Experts (MoE), training infrastructure, dataset composition, and estimated costs. The analysis draws on leaked and inferred information to give unprecedented insight into the engineering choices behind one of the most capable AI systems. This resource is significant for understanding the compute and infrastructure requirements of frontier AI models.
Key Points
- •GPT-4 reportedly uses a Mixture of Experts (MoE) architecture with ~8 experts, enabling high capability while managing inference costs.
- •Training costs are estimated in the tens to hundreds of millions of dollars, highlighting the capital intensity of frontier AI development.
- •The model was trained on a large multimodal dataset, enabling vision capabilities alongside text understanding.
- •Infrastructure details reveal heavy reliance on Microsoft Azure and custom hardware optimizations for training and deployment at scale.
- •The concentration of compute and capital required reinforces concerns about power asymmetries between frontier AI labs and other actors.
Cited by 2 pages
| Page | Type | Quality |
|---|---|---|
| AI-Driven Concentration of Power | Risk | 65.0 |
| AI Winner-Take-All Dynamics | Risk | 54.0 |
Cached Content Preview
HTTP 200Fetched Mar 20, 202611 KB
[](https://newsletter.semianalysis.com/)
# [](https://newsletter.semianalysis.com/)
SubscribeSign in
# GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE
### Demystifying GPT-4: The engineering tradeoffs that led OpenAI to their architecture.
[Dylan Patel](https://substack.com/@semianalysis) and [Gerald Wong](https://substack.com/@geraldwong116502)
Jul 10, 2023
∙ Paid
197
50
Share
OpenAI is keeping the architecture of GPT-4 closed not because of some existential risk to humanity but because what they’ve built is replicable. In fact, we expect Google, Meta, Anthropic, Inflection, Character, Tencent, ByteDance, Baidu, and more to all have models as capable as GPT-4 if not more capable in the near term.
Don’t get us wrong, OpenAI has amazing engineering, and what they built is incredible, but the solution they arrived at is not magic. It is an elegant solution with many complex tradeoffs. Going big is only a portion of the battle. OpenAI’s most durable moat is that they have the most real-world usage, leading engineering talent, and can continue to race ahead of others with future models.
We have gathered a lot of information on GPT-4 from many sources, and today we want to share. This includes model architecture, training infrastructure, inference infrastructure, parameter count, training dataset composition, token count, layer count, parallelism strategies, multi-modal vision adaptation, the thought process behind different engineering tradeoffs, unique implemented techniques, and how they alleviated some of their biggest bottlenecks related to inference of gigantic models.
The most interesting aspect of GPT-4 is understanding why they made certain architectural decisions.
Furthermore, we will be outlining the cost of training and inference for GPT-4 on A100 and how that scales with H100 for the next-generation model architectures.
First off, with the problem statement. From GPT-3 to 4, OpenAI wanted to scale 100x, but the problematic lion in the room is cost. [Dense transformers models will not scale further](https://www.semianalysis.com/p/the-ai-brick-wall-a-practical-limit). A dense transformer is the model architecture that OpenAI GPT-3, Google PaLM, Meta LLAMA, TII Falcon, MosaicML MPT, etc use. We can easily name 50 companies training LLMs using this same architecture. It’s a good one, but it’s flawed for scaling.
[See our discussion training cost from before the GPT-4 announcement on the upcoming A
... (truncated, 11 KB total)Resource ID:
dfeb27439fd01d3e | Stable ID: NDAzOTRkOT