Sparse / MoE Transformers

sparse-moe (E500)

← Back to pagePath: /knowledge-base/intelligence-paradigms/sparse-moe/

Page Metadata

{
  "id": "sparse-moe",
  "numericId": null,
  "path": "/knowledge-base/intelligence-paradigms/sparse-moe/",
  "filePath": "knowledge-base/intelligence-paradigms/sparse-moe.mdx",
  "title": "Sparse / MoE Transformers",
  "quality": null,
  "importance": 62,
  "contentFormat": "article",
  "tractability": null,
  "neglectedness": null,
  "uncertainty": null,
  "causalLevel": null,
  "lastUpdated": "2026-02-13",
  "llmSummary": "MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety implications remain uncertain - expert-level interpretability tools and routing behavior analysis are in early stages despite increasing adoption (Mixtral, DeepSeek-V3, rumored GPT-4).",
  "structuredSummary": null,
  "description": "Analysis of Mixture-of-Experts and sparse transformer architectures where only a subset of parameters activates per token. Covers Mixtral, Switch Transformer, and rumored GPT-4 architecture.",
  "ratings": {
    "novelty": 3.5,
    "rigor": 5,
    "actionability": 3,
    "completeness": 6.5
  },
  "category": "intelligence-paradigms",
  "subcategory": null,
  "clusters": [
    "ai-safety"
  ],
  "metrics": {
    "wordCount": 2652,
    "tableCount": 19,
    "diagramCount": 1,
    "internalLinks": 0,
    "externalLinks": 55,
    "footnoteCount": 0,
    "bulletRatio": 0.09,
    "sectionCount": 31,
    "hasOverview": true,
    "structuralScore": 12
  },
  "suggestedQuality": 80,
  "updateFrequency": 45,
  "evergreen": true,
  "wordCount": 2652,
  "unconvertedLinks": [],
  "unconvertedLinkCount": 0,
  "convertedLinkCount": 0,
  "backlinkCount": 0,
  "redundancy": {
    "maxSimilarity": 12,
    "similarPages": [
      {
        "id": "dense-transformers",
        "title": "Dense Transformers",
        "path": "/knowledge-base/intelligence-paradigms/dense-transformers/",
        "similarity": 12
      },
      {
        "id": "ssm-mamba",
        "title": "State-Space Models / Mamba",
        "path": "/knowledge-base/intelligence-paradigms/ssm-mamba/",
        "similarity": 11
      },
      {
        "id": "large-language-models",
        "title": "Large Language Models",
        "path": "/knowledge-base/capabilities/large-language-models/",
        "similarity": 10
      },
      {
        "id": "intervention-effectiveness-matrix",
        "title": "Intervention Effectiveness Matrix",
        "path": "/knowledge-base/models/intervention-effectiveness-matrix/",
        "similarity": 10
      }
    ]
  }
}

Entity Data

{
  "id": "sparse-moe",
  "type": "capability",
  "title": "Sparse / MoE Transformers",
  "description": "MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety research is underdeveloped - no expert-level interpretability tools exist despite rapid adoption (Mi",
  "tags": [],
  "relatedEntries": [],
  "sources": [],
  "lastUpdated": "2026-02",
  "customFields": []
}

Canonical Facts (0)

No facts for this entity

External Links

No external links

Backlinks (0)

No backlinks

Frontmatter

{
  "title": "Sparse / MoE Transformers",
  "description": "Analysis of Mixture-of-Experts and sparse transformer architectures where only a subset of parameters activates per token. Covers Mixtral, Switch Transformer, and rumored GPT-4 architecture.",
  "sidebar": {
    "label": "Sparse/MoE",
    "order": 6
  },
  "lastEdited": "2026-02-13",
  "importance": 62,
  "update_frequency": 45,
  "llmSummary": "MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety implications remain uncertain - expert-level interpretability tools and routing behavior analysis are in early stages despite increasing adoption (Mixtral, DeepSeek-V3, rumored GPT-4).",
  "ratings": {
    "novelty": 3.5,
    "rigor": 5,
    "actionability": 3,
    "completeness": 6.5
  },
  "clusters": [
    "ai-safety"
  ],
  "entityType": "intelligence-paradigm"
}

Raw MDX Source

---
title: "Sparse / MoE Transformers"
description: "Analysis of Mixture-of-Experts and sparse transformer architectures where only a subset of parameters activates per token. Covers Mixtral, Switch Transformer, and rumored GPT-4 architecture."
sidebar:
  label: "Sparse/MoE"
  order: 6
lastEdited: "2026-02-13"
importance: 62
update_frequency: 45
llmSummary: "MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety implications remain uncertain - expert-level interpretability tools and routing behavior analysis are in early stages despite increasing adoption (Mixtral, DeepSeek-V3, rumored GPT-4)."
ratings:
  novelty: 3.5
  rigor: 5
  actionability: 3
  completeness: 6.5
clusters: ["ai-safety"]
entityType: intelligence-paradigm
---
import {Mermaid, EntityLink, DataExternalLinks} from '@components/wiki';

<DataExternalLinks pageId="sparse-moe" />

## Overview

Sparse and Mixture-of-Experts (MoE) architectures are transformer variants where **only a subset of parameters activates for each token**. Instead of every parameter contributing to every forward pass, a routing mechanism selects which "expert" sub-networks to use.

This provides **parameter efficiency gains** - models can have 8x more total parameters while maintaining similar compute cost per token. For example, Mixtral 8x7B (46B total parameters, ~12.9B active) performs comparably to Llama 2 70B on standard benchmarks while requiring substantially fewer FLOPs per token.[^mixtral]

Unverified reports from 2023 suggest GPT-4 may use MoE architecture with approximately 1.76T total parameters.[^gpt4-rumors] Multiple major labs released open-weight MoE models in 2024-2025, including Mistral AI, Databricks, DeepSeek, and Alibaba.

### Quick Assessment

| Dimension | Assessment | Evidence |
|-----------|------------|----------|
| **Adoption Rate** | 4 major open-weight releases 2024-2025 | Mixtral (Dec 2023), DBRX (Mar 2024), DeepSeek-V2 (May 2024), Qwen3-MoE (Apr 2025); GPT-4 rumored MoE (unverified) |
| **Efficiency Gains** | 2-7x compute savings reported | [Switch Transformer: 7x pre-training speedup](https://arxiv.org/abs/2101.03961); [DBRX: 2x faster inference](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) |
| **Parameter Scaling** | Up to 671B total parameters demonstrated | [DeepSeek-V3: 671B total/37B active](https://arxiv.org/html/2412.19437v1); [GLaM: 1.2T total](https://arxiv.org/abs/2112.06905) |
| **Quality Parity** | Matches dense models on benchmarks | [Mixtral 8x7B (46B total, 12.9B active) matches Llama 2 70B](https://arxiv.org/abs/2401.04088) across MMLU, HellaSwag, GSM8K |
| **Safety Research** | 2 dedicated papers identified | [Expert-level interpretability tools](https://arxiv.org/abs/2202.09368); routing behavior analysis in progress |
| **Open-Weight Availability** | 4 of 4 major models Apache 2.0 | Mixtral, DBRX, DeepSeek, Qwen all open-weight |
| **Hardware Support** | Specialized libraries available | [MegaBlocks library](https://arxiv.org/abs/2211.15841) enables dropless MoE; inference optimization libraries released |


## Key Links

| Source | Link |
|--------|------|
| Official Website | [hyper.ai](https://hyper.ai/en/wiki/29102) |
| Wikipedia | [en.wikipedia.org](https://en.wikipedia.org/wiki/Mixture_of_experts) |
| arXiv | [arxiv.org](https://arxiv.org/abs/2101.03961) |


## Architecture

<Mermaid chart={`
flowchart TB
    subgraph INPUT["Input Processing"]
        token["Input Token Embedding"]
    end

    subgraph ROUTER["Gating/Router Network"]
        gate["Softmax Router"]
        scores["Expert Scores"]
        topk["Top-k Selection"]
        gate --> scores
        scores --> topk
    end

    subgraph EXPERTS["Expert Feed-Forward Networks"]
        exp1["Expert 1 (FFN)"]
        exp2["Expert 2 (FFN)"]
        exp3["Expert 3"]
        exp4["Expert 4"]
        expN["Expert N"]
    end

    subgraph OUTPUT["Output Combination"]
        weights["Router Weights"]
        combine["Weighted Sum"]
        residual["+ Residual Connection"]
        weights --> combine
        combine --> residual
    end

    token --> gate
    topk --> |"Selected (top-k)"| exp1
    topk --> |"Selected (top-k)"| exp2
    topk -.-> |"Not selected"| exp3
    topk -.-> |"Not selected"| exp4
    topk -.-> |"Not selected"| expN

    exp1 --> combine
    exp2 --> combine
    token --> residual
    residual --> final["Layer Output"]

    style exp3 fill:#f5f5f5,stroke:#ccc
    style exp4 fill:#f5f5f5,stroke:#ccc
    style expN fill:#f5f5f5,stroke:#ccc
    style exp1 fill:#d4edda,stroke:#28a745
    style exp2 fill:#d4edda,stroke:#28a745
`} />

### Key Components

| Component | Function | Trainable |
|-----------|----------|-----------|
| **Router** | Decides which experts to use | Yes |
| **Experts** | Specialized FFN sub-networks | Yes |
| **Load balancer** | Ensures experts are used evenly | Auxiliary loss |
| **Combiner** | Merges expert outputs | Weighted by router |

### Parameter Efficiency

| Model | Total Params | Active Params | Efficiency Ratio | Developer |
|-------|--------------|---------------|------------------|-----------|
| [Mixtral 8x7B](https://arxiv.org/abs/2401.04088) | 46.7B | 12.9B | 3.6x | Mistral AI |
| [Mixtral 8x22B](https://mistral.ai/news/mixtral-8x22b/) | 141B | 39B | 3.6x | Mistral AI |
| [DeepSeek-V3](https://arxiv.org/html/2412.19437v1) | 671B | 37B | 18x | DeepSeek |
| [DBRX](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) | 132B | 36B | 3.7x | Databricks |
| [GLaM](https://arxiv.org/abs/2112.06905) | 1.2T | 96.6B | 12x | Google |
| [Switch-XXL](https://arxiv.org/abs/2101.03961) | 1.6T | 100B | 16x | Google |
| Qwen3-MoE | 235B | 22B | 10.7x | Alibaba |
| GPT-4 (unverified) | ≈1.76T (rumored) | ≈220B | ≈8x | OpenAI |

## Key Properties

| Property | Rating | Assessment |
|----------|--------|------------|
| **White-box Access** | LOW | Similar opacity to dense transformers, with additional routing complexity |
| **Trainability** | HIGH | Standard training with load balancing losses |
| **Predictability** | LOW | Routing adds layer of complexity to activation patterns |
| **Modularity** | MEDIUM | Expert boundaries exist but interact through routing |
| **Formal Verifiability** | LOW | Combinatorial explosion of expert combinations |

## Safety Implications

### Potential Safety Research Directions

| Research Area | Description |
|---------------|-------------|
| **Expert analysis** | Study what individual experts learn through activation patterns |
| **Efficiency enables testing** | Lower cost per capability level may enable more comprehensive safety evaluation |
| **Modular structure** | Expert boundaries may enable ablation studies or targeted modifications |
| **Specialization patterns** | Routing patterns may reveal model structure |

### Open Safety Questions

| Question | Current Status | Source |
|----------|----------------|--------|
| **Routing unpredictability** | Router selection patterns not fully characterized | Limited published analysis |
| **Combinatorial complexity** | Testing all expert combinations infeasible for 8+ expert models | No systematic methodology exists |
| **Emergent routing** | Unclear if routing patterns encode unexpected behaviors | [Early analysis in Expert Choice paper](https://arxiv.org/abs/2202.09368) |
| **Specialized capabilities** | Unknown if specific experts develop concerning capabilities in isolation | No dedicated research identified |

### Interpretability Comparison

| Aspect | Dense | MoE |
|--------|-------|-----|
| Overall opacity | HIGH | HIGH |
| Modular structure | NONE | SOME (expert boundaries) |
| Analysis tools | SOME | FEWER (as of 2025) |
| Activation patterns | Complex | Complex + routing layer |

## Current MoE Models Comparison

| Model | Developer | Release | Total Params | Active Params | Experts | Top-k | Context | Training Data |
|-------|-----------|---------|--------------|---------------|---------|-------|---------|---------------|
| **[Mixtral 8x7B](https://arxiv.org/abs/2401.04088)** | Mistral AI | Dec 2023 | 46.7B | 12.9B | 8 | 2 | 32K | Undisclosed |
| **[Mixtral 8x22B](https://mistral.ai/news/mixtral-8x22b/)** | Mistral AI | Apr 2024 | 141B | 39B | 8 | 2 | 64K | Undisclosed |
| **[DeepSeek-V2](https://arxiv.org/abs/2405.04434)** | DeepSeek | May 2024 | 236B | 21B | 160 | 6 | 128K | Undisclosed |
| **[DeepSeek-V3](https://arxiv.org/html/2412.19437v1)** | DeepSeek | Dec 2024 | 671B | 37B | 256 | 8 | 128K | 14.8T tokens |
| **[DBRX](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm)** | Databricks | Mar 2024 | 132B | 36B | 16 | 4 | 32K | 12T tokens |
| **Qwen3-MoE** | Alibaba | Apr 2025 | 235B | 22B | 128 | 8 | 32K | 36T tokens |
| **Qwen3-Next** | Alibaba | Sep 2025 | 80B | 3B | 512 | Variable | 128K | Undisclosed |
| **[GLaM](https://arxiv.org/abs/2112.06905)** | Google | Dec 2021 | 1.2T | 96.6B | 64 | 2 | Undisclosed | 1.6T tokens |
| **[Switch-C](https://arxiv.org/abs/2101.03961)** | Google | Jan 2021 | 1.6T | 100B | 2048 | 1 | Undisclosed | C4 dataset |
| **GPT-4** | OpenAI | Mar 2023 | ≈1.76T (unverified) | ≈220B | ≈16 | 2 | 128K | Undisclosed |

**Performance benchmarks** (Mixtral 8x7B vs. dense models):[^mixtral]
- MMLU: 70.6% (vs. Llama 2 70B: 68.9%)
- HellaSwag: 86.7% (vs. Llama 2 70B: 87.3%)
- GSM8K: 74.4% (vs. Llama 2 70B: 56.8%)
- HumanEval: 40.2% (vs. Llama 2 70B: 29.9%)

## Technical Details

### Router Mechanisms

| Mechanism | Description | Trade-offs | Used By |
|-----------|-------------|------------|---------|
| **Top-k gating** | Select k highest-scoring experts per token | Simple implementation; may cause load imbalance; requires auxiliary loss | [Mixtral (k=2)](https://arxiv.org/abs/2401.04088), [DBRX (k=4)](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) |
| **Expert choice** | Each expert selects its top-k tokens | Guaranteed load balance; variable experts per token | [Google research models](https://arxiv.org/abs/2202.09368) |
| **Soft routing** | Weighted combination of all experts | Fully differentiable; less sparse; higher compute | Early MoE research |
| **Auxiliary-loss-free** | Bias terms adjusted based on load monitoring | No auxiliary loss needed; adaptive balancing | [DeepSeek-V3](https://arxiv.org/html/2412.19437v1) |

### Load Balancing

MoE training requires mechanisms to prevent expert collapse (where the model converges to using only a subset of experts):

**Auxiliary loss approach** (most common):
- Total Loss = Task Loss + α × Load Balance Loss
- Typical α values: 0.01-0.1
- Balance loss penalizes uneven expert utilization

**Without load balancing**: Empirical studies show models can collapse to using 10-20% of experts, reducing efficiency benefits.[^switch]

**DeepSeek innovation**: Auxiliary-loss-free balancing adds learnable bias terms to routing scores, adjusted at each training step based on observed load. DeepSeek-V3 reports this avoids training instability associated with auxiliary losses.[^deepseek]

### Expert Capacity and Overflow

Each expert has a fixed capacity (maximum tokens it can process per batch). When both selected experts are full:
- **Token dropping**: Overflow tokens skip MoE layer, passed via residual connection (GShard approach)[^gshard]
- **Dropless MoE**: Dynamic allocation prevents dropping; used by DBRX via MegaBlocks library[^dbrx]

**Capacity factor**: Typical values of 1.0-2.0x mean experts can handle 100-200% of perfectly balanced load.

## Limitations and Trade-offs

| Limitation | Description | Impact |
|------------|-------------|--------|
| **Memory requirements** | All experts must be loaded into memory despite sparse activation | Limits deployment to high-memory systems; increases serving costs |
| **Serving complexity** | Routing and expert selection add latency and implementation complexity | More difficult to optimize than dense models |
| **Training instability** | Load balancing and routing can cause training convergence issues | Requires careful hyperparameter tuning; auxiliary losses add complexity |
| **Fine-tuning challenges** | Adapting pre-trained MoE models requires expert-aware approaches | Standard fine-tuning may collapse to using subset of experts |
| **Memory bandwidth** | Moving expert weights can bottleneck despite compute savings | GPU memory bandwidth becomes limiting factor |

### When Dense Models May Be Preferable

| Scenario | Reason |
|----------|--------|
| **Latency-critical inference** | Routing overhead and expert loading add latency vs. dense models |
| **Memory-constrained deployment** | Must load all experts despite activating only subset |
| **Small model scale** | MoE overhead dominates at smaller parameter counts (less than 10B) |
| **Continuous batching** | Variable expert selection complicates batching optimization |
| **Edge deployment** | Memory and complexity constraints favor dense architectures |

## Research Landscape

### Foundational Papers

| Paper | Year | Key Contribution | Citation |
|-------|------|------------------|----------|
| [Outrageously Large Neural Networks](https://arxiv.org/abs/1701.06538) | 2017 | Introduced sparsely-gated MoE with 137B parameters; demonstrated capacity scaling with minimal efficiency loss | Shazeer et al. (Google), ICLR 2017 |
| [GShard](https://arxiv.org/abs/2006.16668) | 2020 | Scaled MoE to 600B parameters; auto-sharding for distributed training | Lepikhin et al. (Google) |
| [Switch Transformers](https://arxiv.org/abs/2101.03961) | 2021 | Simplified to single-expert routing; achieved 7x pre-training speedup; scaled to 1.6T parameters | Fedus, Zoph, Shazeer (Google), JMLR 2022 |
| [GLaM](https://arxiv.org/abs/2112.06905) | 2021 | 1.2T parameter model using 1/3 GPT-3 training energy; demonstrated energy efficiency at scale | Du et al. (Google), ICML 2022 |
| [Expert Choice Routing](https://arxiv.org/abs/2202.09368) | 2022 | Experts select tokens instead of vice versa; reported 2x faster convergence; intrinsic load balancing | Zhou et al. (Google), NeurIPS 2022 |
| [Mixtral of Experts](https://arxiv.org/abs/2401.04088) | 2024 | Open-weight MoE matching Llama 2 70B performance with 3.6x fewer active parameters | Jiang et al. (Mistral AI) |
| [DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1) | 2024 | 671B/37B model with auxiliary-loss-free load balancing; 256 fine-grained experts | DeepSeek |

### Key Labs and Contributions

| Lab | Models | Focus | Notable Contribution |
|-----|--------|-------|---------------------|
| **Google** | Switch, GLaM, GShard | Architecture research | First trillion-parameter MoE; routing mechanism innovations |
| **Mistral AI** | Mixtral 8x7B, 8x22B | Open-weight deployment | Matched 70B dense model with 12.9B active params |
| **DeepSeek** | V2, V3 | Ultra-sparse MoE | 671B total with 37B active (5.5% activation rate) |
| **Databricks** | DBRX | Enterprise deployment | Fine-grained MoE with dropless token routing |
| **Alibaba** | Qwen3-MoE, Qwen3-Next | Parameter efficiency | 80B total with 3B active (3.7% activation rate) |
| **OpenAI** | GPT-4 (unverified) | Production deployment | Unverified reports suggest ≈1.76T params MoE architecture |

## Adoption Trends

### Factors Driving MoE Adoption

| Factor | Evidence | Impact |
|--------|----------|--------|
| **Training efficiency** | [GLaM achieved GPT-3 quality with 1/3 training energy](https://arxiv.org/abs/2112.06905); [DBRX training is 2x more FLOP-efficient than dense](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) | Reduces training costs for equivalent capability |
| **Inference efficiency** | [DBRX: 2-3x higher throughput than 132B dense model](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm); [Mixtral serves at Llama 2 13B cost](https://arxiv.org/abs/2401.04088) | Enables larger-scale deployment |
| **Quality parity** | [Mixtral 8x7B matches/exceeds Llama 2 70B](https://arxiv.org/abs/2401.04088) on MMLU (+1.7%), GSM8K (+17.6%), HumanEval (+10.3%) | Demonstrates no capability penalty |
| **Parameter scaling** | Dense models face diminishing returns; MoE enables scaling to 671B+ parameters | Extends parameter scaling benefits |
| **Open ecosystem** | Mixtral, DBRX, DeepSeek, Qwen all Apache 2.0 licensed | Enables research and commercial fine-tuning |

### Timeline

| Period | Development | Context |
|--------|-------------|---------|
| 2017 | [Shazeer introduces modern MoE](https://arxiv.org/abs/1701.06538) | Demonstrates 137B parameter model viability |
| 2020-2021 | [Google scales to 1T+ parameters](https://arxiv.org/abs/2101.03961) | Switch Transformer, GLaM papers |
| 2023 | GPT-4 launches (rumored MoE) | Unverified reports suggest MoE architecture |
| 2024 | Mixtral, DBRX, DeepSeek-V2 released | Open-weight MoE ecosystem emerges |
| 2025 | DeepSeek-V3, Qwen3-Next released | Ultra-sparse MoE demonstrations (3-5% activation) |

### Remaining Advantages of Dense Models

Some research suggests dense models retain advantages in specific scenarios:

| Advantage | Context |
|-----------|---------|
| **Inference latency** | Single forward pass without routing overhead |
| **Memory efficiency** | Only active parameters need loading |
| **Simpler deployment** | Fewer moving parts in serving infrastructure |
| **Fine-tuning stability** | Standard approaches work without expert-specific considerations |
| **Small-scale models** | MoE overhead dominates at less than 10B parameters |

## Open Research Questions

### Key Uncertainties

| Question | Current Evidence | Source |
|----------|------------------|--------|
| Do MoE models exhibit different emergent behaviors than dense models? | Limited comparative studies | Routing complexity not fully characterized |
| Does routing create novel alignment challenges? | Unknown; router learns token-expert mappings | No systematic safety evaluation identified |
| Can expert specialization be meaningfully interpreted? | Some evidence experts specialize by domain/language | [Preliminary analysis in Expert Choice paper](https://arxiv.org/abs/2202.09368) |
| Will ultra-sparse (less than 5% activation) remain stable at larger scale? | DeepSeek-V3, Qwen3-Next demonstrate stability at 671B/80B | Long-term stability not yet evaluated |
| Does MoE training require modified safety evaluation approaches? | Not yet studied systematically | No published safety-specific methodology |

### Safety Research Directions

**Existing evaluation approaches that transfer:**
- Behavioral evaluations and capability assessments work similarly
- RLHF and other alignment training approaches remain applicable
- Red teaming and adversarial testing methodologies transfer

**Research gaps specific to MoE:**
- Expert-level interpretability: analysis tools for individual expert behavior
- Routing pattern analysis: characterizing when/why routing changes
- Combinatorial testing: approaches for covering expert combinations
- Expert ablation: feasibility of removing or modifying individual experts

**Exploratory research directions:**

| Approach | Description | Feasibility Assessment |
|----------|-------------|----------------------|
| Expert specialization analysis | Characterize what individual experts learn through activation patterns | Medium - requires large-scale activation logging |
| Selective expert ablation | Test if removing specific experts eliminates concerning behaviors | Unknown - may destabilize model |
| Routing intervention | Control which experts activate for safety-critical inputs | Possible - requires understanding routing mechanism |
| Expert-level alignment | Train specific experts for safety-related capabilities | Speculative - no published attempts |

## Sources and References

[^mixtral]: Jiang, A. et al. (2024). [Mixtral of Experts](https://arxiv.org/abs/2401.04088). Mistral AI. Open-weight MoE matching Llama 2 70B performance.

[^gpt4-rumors]: The Decoder (2023). [GPT-4 architecture, datasets, costs and more leaked](https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/). Summary of SemiAnalysis report on GPT-4's rumored MoE structure. Note: Unverified claims based on secondary sources.

[^switch]: Fedus, W., Zoph, B., Shazeer, N. (2021). [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961). JMLR 2022. Simplified MoE with single-expert routing, achieving 7x speedup.

[^deepseek]: DeepSeek (2024). [DeepSeek-V3 Technical Report](https://arxiv.org/html/2412.19437v1). 671B parameter model with auxiliary-loss-free load balancing.

[^gshard]: Lepikhin, D. et al. (2020). [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668). First 600B parameter MoE for machine translation.

[^dbrx]: Databricks (2024). [Introducing DBRX: A New State-of-the-Art Open LLM](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm). Fine-grained MoE with 16 experts and dropless routing.

### Additional Key Papers

- Shazeer, N. et al. (2017). [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538). ICLR 2017. Revived MoE for deep learning, demonstrating capacity scaling.

- Du, N. et al. (2022). [GLaM: Efficient Scaling of Language Models with Mixture-of-Experts](https://arxiv.org/abs/2112.06905). ICML 2022. 1.2T parameter model trained with 1/3 of GPT-3's energy.

- Zhou, Y. et al. (2022). [Mixture-of-Experts with Expert Choice Routing](https://arxiv.org/abs/2202.09368). NeurIPS 2022. Novel routing where experts select tokens.

- Google Research (2022). [Mixture-of-Experts with Expert Choice Routing Blog Post](https://research.google/blog/mixture-of-experts-with-expert-choice-routing/). Accessible explanation of expert choice mechanism.

### Architecture Leaks and Analysis

- The Decoder (2023). [GPT-4 architecture, datasets, costs and more leaked](https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/). Summary of SemiAnalysis report on GPT-4's rumored MoE structure.

- KDnuggets (2023). [GPT-4: 8 Models in One; The Secret is Out](https://www.kdnuggets.com/2023/08/gpt4-8-models-one-secret.html). Analysis of GPT-4 MoE rumors.