Sparse / MoE Transformers
sparse-moe (E500)← Back to pagePath: /knowledge-base/intelligence-paradigms/sparse-moe/
Page Metadata
{
"id": "sparse-moe",
"numericId": null,
"path": "/knowledge-base/intelligence-paradigms/sparse-moe/",
"filePath": "knowledge-base/intelligence-paradigms/sparse-moe.mdx",
"title": "Sparse / MoE Transformers",
"quality": null,
"importance": 62,
"contentFormat": "article",
"tractability": null,
"neglectedness": null,
"uncertainty": null,
"causalLevel": null,
"lastUpdated": "2026-02-13",
"llmSummary": "MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety implications remain uncertain - expert-level interpretability tools and routing behavior analysis are in early stages despite increasing adoption (Mixtral, DeepSeek-V3, rumored GPT-4).",
"structuredSummary": null,
"description": "Analysis of Mixture-of-Experts and sparse transformer architectures where only a subset of parameters activates per token. Covers Mixtral, Switch Transformer, and rumored GPT-4 architecture.",
"ratings": {
"novelty": 3.5,
"rigor": 5,
"actionability": 3,
"completeness": 6.5
},
"category": "intelligence-paradigms",
"subcategory": null,
"clusters": [
"ai-safety"
],
"metrics": {
"wordCount": 2652,
"tableCount": 19,
"diagramCount": 1,
"internalLinks": 0,
"externalLinks": 55,
"footnoteCount": 0,
"bulletRatio": 0.09,
"sectionCount": 31,
"hasOverview": true,
"structuralScore": 12
},
"suggestedQuality": 80,
"updateFrequency": 45,
"evergreen": true,
"wordCount": 2652,
"unconvertedLinks": [],
"unconvertedLinkCount": 0,
"convertedLinkCount": 0,
"backlinkCount": 0,
"redundancy": {
"maxSimilarity": 12,
"similarPages": [
{
"id": "dense-transformers",
"title": "Dense Transformers",
"path": "/knowledge-base/intelligence-paradigms/dense-transformers/",
"similarity": 12
},
{
"id": "ssm-mamba",
"title": "State-Space Models / Mamba",
"path": "/knowledge-base/intelligence-paradigms/ssm-mamba/",
"similarity": 11
},
{
"id": "large-language-models",
"title": "Large Language Models",
"path": "/knowledge-base/capabilities/large-language-models/",
"similarity": 10
},
{
"id": "intervention-effectiveness-matrix",
"title": "Intervention Effectiveness Matrix",
"path": "/knowledge-base/models/intervention-effectiveness-matrix/",
"similarity": 10
}
]
}
}Entity Data
{
"id": "sparse-moe",
"type": "capability",
"title": "Sparse / MoE Transformers",
"description": "MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety research is underdeveloped - no expert-level interpretability tools exist despite rapid adoption (Mi",
"tags": [],
"relatedEntries": [],
"sources": [],
"lastUpdated": "2026-02",
"customFields": []
}Canonical Facts (0)
No facts for this entity
External Links
No external links
Backlinks (0)
No backlinks
Frontmatter
{
"title": "Sparse / MoE Transformers",
"description": "Analysis of Mixture-of-Experts and sparse transformer architectures where only a subset of parameters activates per token. Covers Mixtral, Switch Transformer, and rumored GPT-4 architecture.",
"sidebar": {
"label": "Sparse/MoE",
"order": 6
},
"lastEdited": "2026-02-13",
"importance": 62,
"update_frequency": 45,
"llmSummary": "MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety implications remain uncertain - expert-level interpretability tools and routing behavior analysis are in early stages despite increasing adoption (Mixtral, DeepSeek-V3, rumored GPT-4).",
"ratings": {
"novelty": 3.5,
"rigor": 5,
"actionability": 3,
"completeness": 6.5
},
"clusters": [
"ai-safety"
],
"entityType": "intelligence-paradigm"
}Raw MDX Source
---
title: "Sparse / MoE Transformers"
description: "Analysis of Mixture-of-Experts and sparse transformer architectures where only a subset of parameters activates per token. Covers Mixtral, Switch Transformer, and rumored GPT-4 architecture."
sidebar:
label: "Sparse/MoE"
order: 6
lastEdited: "2026-02-13"
importance: 62
update_frequency: 45
llmSummary: "MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety implications remain uncertain - expert-level interpretability tools and routing behavior analysis are in early stages despite increasing adoption (Mixtral, DeepSeek-V3, rumored GPT-4)."
ratings:
novelty: 3.5
rigor: 5
actionability: 3
completeness: 6.5
clusters: ["ai-safety"]
entityType: intelligence-paradigm
---
import {Mermaid, EntityLink, DataExternalLinks} from '@components/wiki';
<DataExternalLinks pageId="sparse-moe" />
## Overview
Sparse and Mixture-of-Experts (MoE) architectures are transformer variants where **only a subset of parameters activates for each token**. Instead of every parameter contributing to every forward pass, a routing mechanism selects which "expert" sub-networks to use.
This provides **parameter efficiency gains** - models can have 8x more total parameters while maintaining similar compute cost per token. For example, Mixtral 8x7B (46B total parameters, ~12.9B active) performs comparably to Llama 2 70B on standard benchmarks while requiring substantially fewer FLOPs per token.[^mixtral]
Unverified reports from 2023 suggest GPT-4 may use MoE architecture with approximately 1.76T total parameters.[^gpt4-rumors] Multiple major labs released open-weight MoE models in 2024-2025, including Mistral AI, Databricks, DeepSeek, and Alibaba.
### Quick Assessment
| Dimension | Assessment | Evidence |
|-----------|------------|----------|
| **Adoption Rate** | 4 major open-weight releases 2024-2025 | Mixtral (Dec 2023), DBRX (Mar 2024), DeepSeek-V2 (May 2024), Qwen3-MoE (Apr 2025); GPT-4 rumored MoE (unverified) |
| **Efficiency Gains** | 2-7x compute savings reported | [Switch Transformer: 7x pre-training speedup](https://arxiv.org/abs/2101.03961); [DBRX: 2x faster inference](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) |
| **Parameter Scaling** | Up to 671B total parameters demonstrated | [DeepSeek-V3: 671B total/37B active](https://arxiv.org/html/2412.19437v1); [GLaM: 1.2T total](https://arxiv.org/abs/2112.06905) |
| **Quality Parity** | Matches dense models on benchmarks | [Mixtral 8x7B (46B total, 12.9B active) matches Llama 2 70B](https://arxiv.org/abs/2401.04088) across MMLU, HellaSwag, GSM8K |
| **Safety Research** | 2 dedicated papers identified | [Expert-level interpretability tools](https://arxiv.org/abs/2202.09368); routing behavior analysis in progress |
| **Open-Weight Availability** | 4 of 4 major models Apache 2.0 | Mixtral, DBRX, DeepSeek, Qwen all open-weight |
| **Hardware Support** | Specialized libraries available | [MegaBlocks library](https://arxiv.org/abs/2211.15841) enables dropless MoE; inference optimization libraries released |
## Key Links
| Source | Link |
|--------|------|
| Official Website | [hyper.ai](https://hyper.ai/en/wiki/29102) |
| Wikipedia | [en.wikipedia.org](https://en.wikipedia.org/wiki/Mixture_of_experts) |
| arXiv | [arxiv.org](https://arxiv.org/abs/2101.03961) |
## Architecture
<Mermaid chart={`
flowchart TB
subgraph INPUT["Input Processing"]
token["Input Token Embedding"]
end
subgraph ROUTER["Gating/Router Network"]
gate["Softmax Router"]
scores["Expert Scores"]
topk["Top-k Selection"]
gate --> scores
scores --> topk
end
subgraph EXPERTS["Expert Feed-Forward Networks"]
exp1["Expert 1 (FFN)"]
exp2["Expert 2 (FFN)"]
exp3["Expert 3"]
exp4["Expert 4"]
expN["Expert N"]
end
subgraph OUTPUT["Output Combination"]
weights["Router Weights"]
combine["Weighted Sum"]
residual["+ Residual Connection"]
weights --> combine
combine --> residual
end
token --> gate
topk --> |"Selected (top-k)"| exp1
topk --> |"Selected (top-k)"| exp2
topk -.-> |"Not selected"| exp3
topk -.-> |"Not selected"| exp4
topk -.-> |"Not selected"| expN
exp1 --> combine
exp2 --> combine
token --> residual
residual --> final["Layer Output"]
style exp3 fill:#f5f5f5,stroke:#ccc
style exp4 fill:#f5f5f5,stroke:#ccc
style expN fill:#f5f5f5,stroke:#ccc
style exp1 fill:#d4edda,stroke:#28a745
style exp2 fill:#d4edda,stroke:#28a745
`} />
### Key Components
| Component | Function | Trainable |
|-----------|----------|-----------|
| **Router** | Decides which experts to use | Yes |
| **Experts** | Specialized FFN sub-networks | Yes |
| **Load balancer** | Ensures experts are used evenly | Auxiliary loss |
| **Combiner** | Merges expert outputs | Weighted by router |
### Parameter Efficiency
| Model | Total Params | Active Params | Efficiency Ratio | Developer |
|-------|--------------|---------------|------------------|-----------|
| [Mixtral 8x7B](https://arxiv.org/abs/2401.04088) | 46.7B | 12.9B | 3.6x | Mistral AI |
| [Mixtral 8x22B](https://mistral.ai/news/mixtral-8x22b/) | 141B | 39B | 3.6x | Mistral AI |
| [DeepSeek-V3](https://arxiv.org/html/2412.19437v1) | 671B | 37B | 18x | DeepSeek |
| [DBRX](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) | 132B | 36B | 3.7x | Databricks |
| [GLaM](https://arxiv.org/abs/2112.06905) | 1.2T | 96.6B | 12x | Google |
| [Switch-XXL](https://arxiv.org/abs/2101.03961) | 1.6T | 100B | 16x | Google |
| Qwen3-MoE | 235B | 22B | 10.7x | Alibaba |
| GPT-4 (unverified) | ≈1.76T (rumored) | ≈220B | ≈8x | OpenAI |
## Key Properties
| Property | Rating | Assessment |
|----------|--------|------------|
| **White-box Access** | LOW | Similar opacity to dense transformers, with additional routing complexity |
| **Trainability** | HIGH | Standard training with load balancing losses |
| **Predictability** | LOW | Routing adds layer of complexity to activation patterns |
| **Modularity** | MEDIUM | Expert boundaries exist but interact through routing |
| **Formal Verifiability** | LOW | Combinatorial explosion of expert combinations |
## Safety Implications
### Potential Safety Research Directions
| Research Area | Description |
|---------------|-------------|
| **Expert analysis** | Study what individual experts learn through activation patterns |
| **Efficiency enables testing** | Lower cost per capability level may enable more comprehensive safety evaluation |
| **Modular structure** | Expert boundaries may enable ablation studies or targeted modifications |
| **Specialization patterns** | Routing patterns may reveal model structure |
### Open Safety Questions
| Question | Current Status | Source |
|----------|----------------|--------|
| **Routing unpredictability** | Router selection patterns not fully characterized | Limited published analysis |
| **Combinatorial complexity** | Testing all expert combinations infeasible for 8+ expert models | No systematic methodology exists |
| **Emergent routing** | Unclear if routing patterns encode unexpected behaviors | [Early analysis in Expert Choice paper](https://arxiv.org/abs/2202.09368) |
| **Specialized capabilities** | Unknown if specific experts develop concerning capabilities in isolation | No dedicated research identified |
### Interpretability Comparison
| Aspect | Dense | MoE |
|--------|-------|-----|
| Overall opacity | HIGH | HIGH |
| Modular structure | NONE | SOME (expert boundaries) |
| Analysis tools | SOME | FEWER (as of 2025) |
| Activation patterns | Complex | Complex + routing layer |
## Current MoE Models Comparison
| Model | Developer | Release | Total Params | Active Params | Experts | Top-k | Context | Training Data |
|-------|-----------|---------|--------------|---------------|---------|-------|---------|---------------|
| **[Mixtral 8x7B](https://arxiv.org/abs/2401.04088)** | Mistral AI | Dec 2023 | 46.7B | 12.9B | 8 | 2 | 32K | Undisclosed |
| **[Mixtral 8x22B](https://mistral.ai/news/mixtral-8x22b/)** | Mistral AI | Apr 2024 | 141B | 39B | 8 | 2 | 64K | Undisclosed |
| **[DeepSeek-V2](https://arxiv.org/abs/2405.04434)** | DeepSeek | May 2024 | 236B | 21B | 160 | 6 | 128K | Undisclosed |
| **[DeepSeek-V3](https://arxiv.org/html/2412.19437v1)** | DeepSeek | Dec 2024 | 671B | 37B | 256 | 8 | 128K | 14.8T tokens |
| **[DBRX](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm)** | Databricks | Mar 2024 | 132B | 36B | 16 | 4 | 32K | 12T tokens |
| **Qwen3-MoE** | Alibaba | Apr 2025 | 235B | 22B | 128 | 8 | 32K | 36T tokens |
| **Qwen3-Next** | Alibaba | Sep 2025 | 80B | 3B | 512 | Variable | 128K | Undisclosed |
| **[GLaM](https://arxiv.org/abs/2112.06905)** | Google | Dec 2021 | 1.2T | 96.6B | 64 | 2 | Undisclosed | 1.6T tokens |
| **[Switch-C](https://arxiv.org/abs/2101.03961)** | Google | Jan 2021 | 1.6T | 100B | 2048 | 1 | Undisclosed | C4 dataset |
| **GPT-4** | OpenAI | Mar 2023 | ≈1.76T (unverified) | ≈220B | ≈16 | 2 | 128K | Undisclosed |
**Performance benchmarks** (Mixtral 8x7B vs. dense models):[^mixtral]
- MMLU: 70.6% (vs. Llama 2 70B: 68.9%)
- HellaSwag: 86.7% (vs. Llama 2 70B: 87.3%)
- GSM8K: 74.4% (vs. Llama 2 70B: 56.8%)
- HumanEval: 40.2% (vs. Llama 2 70B: 29.9%)
## Technical Details
### Router Mechanisms
| Mechanism | Description | Trade-offs | Used By |
|-----------|-------------|------------|---------|
| **Top-k gating** | Select k highest-scoring experts per token | Simple implementation; may cause load imbalance; requires auxiliary loss | [Mixtral (k=2)](https://arxiv.org/abs/2401.04088), [DBRX (k=4)](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) |
| **Expert choice** | Each expert selects its top-k tokens | Guaranteed load balance; variable experts per token | [Google research models](https://arxiv.org/abs/2202.09368) |
| **Soft routing** | Weighted combination of all experts | Fully differentiable; less sparse; higher compute | Early MoE research |
| **Auxiliary-loss-free** | Bias terms adjusted based on load monitoring | No auxiliary loss needed; adaptive balancing | [DeepSeek-V3](https://arxiv.org/html/2412.19437v1) |
### Load Balancing
MoE training requires mechanisms to prevent expert collapse (where the model converges to using only a subset of experts):
**Auxiliary loss approach** (most common):
- Total Loss = Task Loss + α × Load Balance Loss
- Typical α values: 0.01-0.1
- Balance loss penalizes uneven expert utilization
**Without load balancing**: Empirical studies show models can collapse to using 10-20% of experts, reducing efficiency benefits.[^switch]
**DeepSeek innovation**: Auxiliary-loss-free balancing adds learnable bias terms to routing scores, adjusted at each training step based on observed load. DeepSeek-V3 reports this avoids training instability associated with auxiliary losses.[^deepseek]
### Expert Capacity and Overflow
Each expert has a fixed capacity (maximum tokens it can process per batch). When both selected experts are full:
- **Token dropping**: Overflow tokens skip MoE layer, passed via residual connection (GShard approach)[^gshard]
- **Dropless MoE**: Dynamic allocation prevents dropping; used by DBRX via MegaBlocks library[^dbrx]
**Capacity factor**: Typical values of 1.0-2.0x mean experts can handle 100-200% of perfectly balanced load.
## Limitations and Trade-offs
| Limitation | Description | Impact |
|------------|-------------|--------|
| **Memory requirements** | All experts must be loaded into memory despite sparse activation | Limits deployment to high-memory systems; increases serving costs |
| **Serving complexity** | Routing and expert selection add latency and implementation complexity | More difficult to optimize than dense models |
| **Training instability** | Load balancing and routing can cause training convergence issues | Requires careful hyperparameter tuning; auxiliary losses add complexity |
| **Fine-tuning challenges** | Adapting pre-trained MoE models requires expert-aware approaches | Standard fine-tuning may collapse to using subset of experts |
| **Memory bandwidth** | Moving expert weights can bottleneck despite compute savings | GPU memory bandwidth becomes limiting factor |
### When Dense Models May Be Preferable
| Scenario | Reason |
|----------|--------|
| **Latency-critical inference** | Routing overhead and expert loading add latency vs. dense models |
| **Memory-constrained deployment** | Must load all experts despite activating only subset |
| **Small model scale** | MoE overhead dominates at smaller parameter counts (less than 10B) |
| **Continuous batching** | Variable expert selection complicates batching optimization |
| **Edge deployment** | Memory and complexity constraints favor dense architectures |
## Research Landscape
### Foundational Papers
| Paper | Year | Key Contribution | Citation |
|-------|------|------------------|----------|
| [Outrageously Large Neural Networks](https://arxiv.org/abs/1701.06538) | 2017 | Introduced sparsely-gated MoE with 137B parameters; demonstrated capacity scaling with minimal efficiency loss | Shazeer et al. (Google), ICLR 2017 |
| [GShard](https://arxiv.org/abs/2006.16668) | 2020 | Scaled MoE to 600B parameters; auto-sharding for distributed training | Lepikhin et al. (Google) |
| [Switch Transformers](https://arxiv.org/abs/2101.03961) | 2021 | Simplified to single-expert routing; achieved 7x pre-training speedup; scaled to 1.6T parameters | Fedus, Zoph, Shazeer (Google), JMLR 2022 |
| [GLaM](https://arxiv.org/abs/2112.06905) | 2021 | 1.2T parameter model using 1/3 GPT-3 training energy; demonstrated energy efficiency at scale | Du et al. (Google), ICML 2022 |
| [Expert Choice Routing](https://arxiv.org/abs/2202.09368) | 2022 | Experts select tokens instead of vice versa; reported 2x faster convergence; intrinsic load balancing | Zhou et al. (Google), NeurIPS 2022 |
| [Mixtral of Experts](https://arxiv.org/abs/2401.04088) | 2024 | Open-weight MoE matching Llama 2 70B performance with 3.6x fewer active parameters | Jiang et al. (Mistral AI) |
| [DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1) | 2024 | 671B/37B model with auxiliary-loss-free load balancing; 256 fine-grained experts | DeepSeek |
### Key Labs and Contributions
| Lab | Models | Focus | Notable Contribution |
|-----|--------|-------|---------------------|
| **Google** | Switch, GLaM, GShard | Architecture research | First trillion-parameter MoE; routing mechanism innovations |
| **Mistral AI** | Mixtral 8x7B, 8x22B | Open-weight deployment | Matched 70B dense model with 12.9B active params |
| **DeepSeek** | V2, V3 | Ultra-sparse MoE | 671B total with 37B active (5.5% activation rate) |
| **Databricks** | DBRX | Enterprise deployment | Fine-grained MoE with dropless token routing |
| **Alibaba** | Qwen3-MoE, Qwen3-Next | Parameter efficiency | 80B total with 3B active (3.7% activation rate) |
| **OpenAI** | GPT-4 (unverified) | Production deployment | Unverified reports suggest ≈1.76T params MoE architecture |
## Adoption Trends
### Factors Driving MoE Adoption
| Factor | Evidence | Impact |
|--------|----------|--------|
| **Training efficiency** | [GLaM achieved GPT-3 quality with 1/3 training energy](https://arxiv.org/abs/2112.06905); [DBRX training is 2x more FLOP-efficient than dense](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) | Reduces training costs for equivalent capability |
| **Inference efficiency** | [DBRX: 2-3x higher throughput than 132B dense model](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm); [Mixtral serves at Llama 2 13B cost](https://arxiv.org/abs/2401.04088) | Enables larger-scale deployment |
| **Quality parity** | [Mixtral 8x7B matches/exceeds Llama 2 70B](https://arxiv.org/abs/2401.04088) on MMLU (+1.7%), GSM8K (+17.6%), HumanEval (+10.3%) | Demonstrates no capability penalty |
| **Parameter scaling** | Dense models face diminishing returns; MoE enables scaling to 671B+ parameters | Extends parameter scaling benefits |
| **Open ecosystem** | Mixtral, DBRX, DeepSeek, Qwen all Apache 2.0 licensed | Enables research and commercial fine-tuning |
### Timeline
| Period | Development | Context |
|--------|-------------|---------|
| 2017 | [Shazeer introduces modern MoE](https://arxiv.org/abs/1701.06538) | Demonstrates 137B parameter model viability |
| 2020-2021 | [Google scales to 1T+ parameters](https://arxiv.org/abs/2101.03961) | Switch Transformer, GLaM papers |
| 2023 | GPT-4 launches (rumored MoE) | Unverified reports suggest MoE architecture |
| 2024 | Mixtral, DBRX, DeepSeek-V2 released | Open-weight MoE ecosystem emerges |
| 2025 | DeepSeek-V3, Qwen3-Next released | Ultra-sparse MoE demonstrations (3-5% activation) |
### Remaining Advantages of Dense Models
Some research suggests dense models retain advantages in specific scenarios:
| Advantage | Context |
|-----------|---------|
| **Inference latency** | Single forward pass without routing overhead |
| **Memory efficiency** | Only active parameters need loading |
| **Simpler deployment** | Fewer moving parts in serving infrastructure |
| **Fine-tuning stability** | Standard approaches work without expert-specific considerations |
| **Small-scale models** | MoE overhead dominates at less than 10B parameters |
## Open Research Questions
### Key Uncertainties
| Question | Current Evidence | Source |
|----------|------------------|--------|
| Do MoE models exhibit different emergent behaviors than dense models? | Limited comparative studies | Routing complexity not fully characterized |
| Does routing create novel alignment challenges? | Unknown; router learns token-expert mappings | No systematic safety evaluation identified |
| Can expert specialization be meaningfully interpreted? | Some evidence experts specialize by domain/language | [Preliminary analysis in Expert Choice paper](https://arxiv.org/abs/2202.09368) |
| Will ultra-sparse (less than 5% activation) remain stable at larger scale? | DeepSeek-V3, Qwen3-Next demonstrate stability at 671B/80B | Long-term stability not yet evaluated |
| Does MoE training require modified safety evaluation approaches? | Not yet studied systematically | No published safety-specific methodology |
### Safety Research Directions
**Existing evaluation approaches that transfer:**
- Behavioral evaluations and capability assessments work similarly
- RLHF and other alignment training approaches remain applicable
- Red teaming and adversarial testing methodologies transfer
**Research gaps specific to MoE:**
- Expert-level interpretability: analysis tools for individual expert behavior
- Routing pattern analysis: characterizing when/why routing changes
- Combinatorial testing: approaches for covering expert combinations
- Expert ablation: feasibility of removing or modifying individual experts
**Exploratory research directions:**
| Approach | Description | Feasibility Assessment |
|----------|-------------|----------------------|
| Expert specialization analysis | Characterize what individual experts learn through activation patterns | Medium - requires large-scale activation logging |
| Selective expert ablation | Test if removing specific experts eliminates concerning behaviors | Unknown - may destabilize model |
| Routing intervention | Control which experts activate for safety-critical inputs | Possible - requires understanding routing mechanism |
| Expert-level alignment | Train specific experts for safety-related capabilities | Speculative - no published attempts |
## Sources and References
[^mixtral]: Jiang, A. et al. (2024). [Mixtral of Experts](https://arxiv.org/abs/2401.04088). Mistral AI. Open-weight MoE matching Llama 2 70B performance.
[^gpt4-rumors]: The Decoder (2023). [GPT-4 architecture, datasets, costs and more leaked](https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/). Summary of SemiAnalysis report on GPT-4's rumored MoE structure. Note: Unverified claims based on secondary sources.
[^switch]: Fedus, W., Zoph, B., Shazeer, N. (2021). [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961). JMLR 2022. Simplified MoE with single-expert routing, achieving 7x speedup.
[^deepseek]: DeepSeek (2024). [DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1). 671B parameter model with auxiliary-loss-free load balancing.
[^gshard]: Lepikhin, D. et al. (2020). [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668). First 600B parameter MoE for machine translation.
[^dbrx]: Databricks (2024). [Introducing DBRX: A New State-of-the-Art Open LLM](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm). Fine-grained MoE with 16 experts and dropless routing.
### Additional Key Papers
- Shazeer, N. et al. (2017). [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538). ICLR 2017. Revived MoE for deep learning, demonstrating capacity scaling.
- Du, N. et al. (2022). [GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/abs/2112.06905). ICML 2022. 1.2T parameter model trained with 1/3 of GPT-3's energy.
- Zhou, Y. et al. (2022). [Mixture-of-Experts with Expert Choice Routing](https://arxiv.org/abs/2202.09368). NeurIPS 2022. Novel routing where experts select tokens.
- Google Research (2022). [Mixture-of-Experts with Expert Choice Routing Blog Post](https://research.google/blog/mixture-of-experts-with-expert-choice-routing/). Accessible explanation of expert choice mechanism.
### Architecture Leaks and Analysis
- The Decoder (2023). [GPT-4 architecture, datasets, costs and more leaked](https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/). Summary of SemiAnalysis report on GPT-4's rumored MoE structure.
- KDnuggets (2023). [GPT-4: 8 Models in One; The Secret is Out](https://www.kdnuggets.com/2023/08/gpt4-8-models-one-secret.html). Analysis of GPT-4 MoE rumors.