GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE

web

semianalysis.com·semianalysis.com/p/gpt-4-architecture-infrastructure

This industry analysis is useful for understanding the infrastructure and economic realities behind frontier AI models, relevant to discussions of compute governance, access inequality, and the technical trajectory of large language models.

Metadata

Importance: 55/100blog postanalysis

Summary

SemiAnalysis provides a detailed technical breakdown of GPT-4's architecture, including its use of Mixture of Experts (MoE), training infrastructure, dataset composition, and estimated costs. The analysis draws on leaked and inferred information to give unprecedented insight into the engineering choices behind one of the most capable AI systems. This resource is significant for understanding the compute and infrastructure requirements of frontier AI models.

Key Points

•GPT-4 reportedly uses a Mixture of Experts (MoE) architecture with ~8 experts, enabling high capability while managing inference costs.
•Training costs are estimated in the tens to hundreds of millions of dollars, highlighting the capital intensity of frontier AI development.
•The model was trained on a large multimodal dataset, enabling vision capabilities alongside text understanding.
•Infrastructure details reveal heavy reliance on Microsoft Azure and custom hardware optimizations for training and deployment at scale.
•The concentration of compute and capital required reinforces concerns about power asymmetries between frontier AI labs and other actors.

Cited by 2 pages

Page	Type	Quality
AI-Driven Concentration of Power	Risk	65.0
AI Winner-Take-All Dynamics	Risk	54.0

Cached Content Preview

HTTP 200Fetched Apr 9, 20269 KB

GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE 
 
 
 
 
 

 

 

 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

 
 
 

 
 
 
 
 

 

 
 
 

 

 

 

 

 
 

 
 

 

 

 

 
 

 Subscribe Sign in GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE

 Demystifying GPT-4: The engineering tradeoffs that led OpenAI to their architecture.

 Dylan Patel and Gerald Wong Jul 10, 2023 ∙ Paid 197 50 Share OpenAI is keeping the architecture of GPT-4 closed not because of some existential risk to humanity but because what they’ve built is replicable. In fact, we expect Google, Meta, Anthropic, Inflection, Character, Tencent, ByteDance, Baidu, and more to all have models as capable as GPT-4 if not more capable in the near term.

 Don’t get us wrong, OpenAI has amazing engineering, and what they built is incredible, but the solution they arrived at is not magic. It is an elegant solution with many complex tradeoffs. Going big is only a portion of the battle. OpenAI’s most durable moat is that they have the most real-world usage, leading engineering talent, and can continue to race ahead of others with future models.

 We have gathered a lot of information on GPT-4 from many sources, and today we want to share. This includes model architecture, training infrastructure, inference infrastructure, parameter count, training dataset composition, token count, layer count, parallelism strategies, multi-modal vision adaptation, the thought process behind different engineering tradeoffs, unique implemented techniques, and how they alleviated some of their biggest bottlenecks related to inference of gigantic models.

 The most interesting aspect of GPT-4 is understanding why they made certain architectural decisions.

 Furthermore, we will be outlining the cost of training and inference for GPT-4 on A100 and how that scales with H100 for the next-generation model architectures.

 First off, with the problem statement. From GPT-3 to 4, OpenAI wanted to scale 100x, but the problematic lion in the room is cost. Dense transformers models will not scale further . A dense transformer is the model architecture that OpenAI GPT-3, Google PaLM, Meta LLAMA, TII Falcon, MosaicML MPT, etc use. We can easily name 50 companies training LLMs using this same architecture. It’s a good one, but it’s flawed for scaling. 

 See our discussion training cost from before the GPT-4 announcement on the upcoming AI brick wall for dense models from a training cost standpoint. There we revealed what OpenAI is doing at a high-level for GPT-4’s architecture as well as training cost for a variety of existing models. 

 The AI Brick Wall – A Practical Limit For Scaling Dense Transformer Models, and How GPT 4 Will Break Past It

 Dylan Pat

... (truncated, 9 KB total)

Resource ID: dfeb27439fd01d3e | Stable ID: sid_uqrnpPPLxM