Longterm Wiki
Updated 2026-02-13HistoryData
Page StatusContent
Edited today2.7k words
62
ImportanceUseful
12
Structure12/15
1910055%9%
Updated every 6 weeksDue in 6 weeks
Summary

MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety implications remain uncertain - expert-level interpretability tools and routing behavior analysis are in early stages despite increasing adoption (Mixtral, DeepSeek-V3, rumored GPT-4).

Sparse / MoE Transformers

Capability

Sparse / MoE Transformers

MoE architectures activate only 3-18% of total parameters per token, achieving 2-7x compute savings while matching dense model performance (Mixtral 8x7B with 12.9B active matches Llama 2 70B). Safety implications remain uncertain - expert-level interpretability tools and routing behavior analysis are in early stages despite increasing adoption (Mixtral, DeepSeek-V3, rumored GPT-4).

2.7k words

Overview

Sparse and Mixture-of-Experts (MoE) architectures are transformer variants where only a subset of parameters activates for each token. Instead of every parameter contributing to every forward pass, a routing mechanism selects which "expert" sub-networks to use.

This provides parameter efficiency gains - models can have 8x more total parameters while maintaining similar compute cost per token. For example, Mixtral 8x7B (46B total parameters, ~12.9B active) performs comparably to Llama 2 70B on standard benchmarks while requiring substantially fewer FLOPs per token.1

Unverified reports from 2023 suggest GPT-4 may use MoE architecture with approximately 1.76T total parameters.2 Multiple major labs released open-weight MoE models in 2024-2025, including Mistral AI, Databricks, DeepSeek, and Alibaba.

Quick Assessment

DimensionAssessmentEvidence
Adoption Rate4 major open-weight releases 2024-2025Mixtral (Dec 2023), DBRX (Mar 2024), DeepSeek-V2 (May 2024), Qwen3-MoE (Apr 2025); GPT-4 rumored MoE (unverified)
Efficiency Gains2-7x compute savings reportedSwitch Transformer: 7x pre-training speedup; DBRX: 2x faster inference
Parameter ScalingUp to 671B total parameters demonstratedDeepSeek-V3: 671B total/37B active; GLaM: 1.2T total
Quality ParityMatches dense models on benchmarksMixtral 8x7B (46B total, 12.9B active) matches Llama 2 70B across MMLU, HellaSwag, GSM8K
Safety Research2 dedicated papers identifiedExpert-level interpretability tools; routing behavior analysis in progress
Open-Weight Availability4 of 4 major models Apache 2.0Mixtral, DBRX, DeepSeek, Qwen all open-weight
Hardware SupportSpecialized libraries availableMegaBlocks library enables dropless MoE; inference optimization libraries released

Key Links

SourceLink
Official Websitehyper.ai
Wikipediaen.wikipedia.org
arXivarxiv.org

Architecture

Loading diagram...

Key Components

ComponentFunctionTrainable
RouterDecides which experts to useYes
ExpertsSpecialized FFN sub-networksYes
Load balancerEnsures experts are used evenlyAuxiliary loss
CombinerMerges expert outputsWeighted by router

Parameter Efficiency

ModelTotal ParamsActive ParamsEfficiency RatioDeveloper
Mixtral 8x7B46.7B12.9B3.6xMistral AI
Mixtral 8x22B141B39B3.6xMistral AI
DeepSeek-V3671B37B18xDeepSeek
DBRX132B36B3.7xDatabricks
GLaM1.2T96.6B12xGoogle
Switch-XXL1.6T100B16xGoogle
Qwen3-MoE235B22B10.7xAlibaba
GPT-4 (unverified)≈1.76T (rumored)≈220B≈8xOpenAI

Key Properties

PropertyRatingAssessment
White-box AccessLOWSimilar opacity to dense transformers, with additional routing complexity
TrainabilityHIGHStandard training with load balancing losses
PredictabilityLOWRouting adds layer of complexity to activation patterns
ModularityMEDIUMExpert boundaries exist but interact through routing
Formal VerifiabilityLOWCombinatorial explosion of expert combinations

Safety Implications

Potential Safety Research Directions

Research AreaDescription
Expert analysisStudy what individual experts learn through activation patterns
Efficiency enables testingLower cost per capability level may enable more comprehensive safety evaluation
Modular structureExpert boundaries may enable ablation studies or targeted modifications
Specialization patternsRouting patterns may reveal model structure

Open Safety Questions

QuestionCurrent StatusSource
Routing unpredictabilityRouter selection patterns not fully characterizedLimited published analysis
Combinatorial complexityTesting all expert combinations infeasible for 8+ expert modelsNo systematic methodology exists
Emergent routingUnclear if routing patterns encode unexpected behaviorsEarly analysis in Expert Choice paper
Specialized capabilitiesUnknown if specific experts develop concerning capabilities in isolationNo dedicated research identified

Interpretability Comparison

AspectDenseMoE
Overall opacityHIGHHIGH
Modular structureNONESOME (expert boundaries)
Analysis toolsSOMEFEWER (as of 2025)
Activation patternsComplexComplex + routing layer

Current MoE Models Comparison

ModelDeveloperReleaseTotal ParamsActive ParamsExpertsTop-kContextTraining Data
Mixtral 8x7BMistral AIDec 202346.7B12.9B8232KUndisclosed
Mixtral 8x22BMistral AIApr 2024141B39B8264KUndisclosed
DeepSeek-V2DeepSeekMay 2024236B21B1606128KUndisclosed
DeepSeek-V3DeepSeekDec 2024671B37B2568128K14.8T tokens
DBRXDatabricksMar 2024132B36B16432K12T tokens
Qwen3-MoEAlibabaApr 2025235B22B128832K36T tokens
Qwen3-NextAlibabaSep 202580B3B512Variable128KUndisclosed
GLaMGoogleDec 20211.2T96.6B642Undisclosed1.6T tokens
Switch-CGoogleJan 20211.6T100B20481UndisclosedC4 dataset
GPT-4OpenAIMar 2023≈1.76T (unverified)≈220B≈162128KUndisclosed

Performance benchmarks (Mixtral 8x7B vs. dense models):1

  • MMLU: 70.6% (vs. Llama 2 70B: 68.9%)
  • HellaSwag: 86.7% (vs. Llama 2 70B: 87.3%)
  • GSM8K: 74.4% (vs. Llama 2 70B: 56.8%)
  • HumanEval: 40.2% (vs. Llama 2 70B: 29.9%)

Technical Details

Router Mechanisms

MechanismDescriptionTrade-offsUsed By
Top-k gatingSelect k highest-scoring experts per tokenSimple implementation; may cause load imbalance; requires auxiliary lossMixtral (k=2), DBRX (k=4)
Expert choiceEach expert selects its top-k tokensGuaranteed load balance; variable experts per tokenGoogle research models
Soft routingWeighted combination of all expertsFully differentiable; less sparse; higher computeEarly MoE research
Auxiliary-loss-freeBias terms adjusted based on load monitoringNo auxiliary loss needed; adaptive balancingDeepSeek-V3

Load Balancing

MoE training requires mechanisms to prevent expert collapse (where the model converges to using only a subset of experts):

Auxiliary loss approach (most common):

  • Total Loss = Task Loss + α × Load Balance Loss
  • Typical α values: 0.01-0.1
  • Balance loss penalizes uneven expert utilization

Without load balancing: Empirical studies show models can collapse to using 10-20% of experts, reducing efficiency benefits.3

DeepSeek innovation: Auxiliary-loss-free balancing adds learnable bias terms to routing scores, adjusted at each training step based on observed load. DeepSeek-V3 reports this avoids training instability associated with auxiliary losses.4

Expert Capacity and Overflow

Each expert has a fixed capacity (maximum tokens it can process per batch). When both selected experts are full:

  • Token dropping: Overflow tokens skip MoE layer, passed via residual connection (GShard approach)5
  • Dropless MoE: Dynamic allocation prevents dropping; used by DBRX via MegaBlocks library6

Capacity factor: Typical values of 1.0-2.0x mean experts can handle 100-200% of perfectly balanced load.

Limitations and Trade-offs

LimitationDescriptionImpact
Memory requirementsAll experts must be loaded into memory despite sparse activationLimits deployment to high-memory systems; increases serving costs
Serving complexityRouting and expert selection add latency and implementation complexityMore difficult to optimize than dense models
Training instabilityLoad balancing and routing can cause training convergence issuesRequires careful hyperparameter tuning; auxiliary losses add complexity
Fine-tuning challengesAdapting pre-trained MoE models requires expert-aware approachesStandard fine-tuning may collapse to using subset of experts
Memory bandwidthMoving expert weights can bottleneck despite compute savingsGPU memory bandwidth becomes limiting factor

When Dense Models May Be Preferable

ScenarioReason
Latency-critical inferenceRouting overhead and expert loading add latency vs. dense models
Memory-constrained deploymentMust load all experts despite activating only subset
Small model scaleMoE overhead dominates at smaller parameter counts (less than 10B)
Continuous batchingVariable expert selection complicates batching optimization
Edge deploymentMemory and complexity constraints favor dense architectures

Research Landscape

Foundational Papers

PaperYearKey ContributionCitation
Outrageously Large Neural Networks2017Introduced sparsely-gated MoE with 137B parameters; demonstrated capacity scaling with minimal efficiency lossShazeer et al. (Google), ICLR 2017
GShard2020Scaled MoE to 600B parameters; auto-sharding for distributed trainingLepikhin et al. (Google)
Switch Transformers2021Simplified to single-expert routing; achieved 7x pre-training speedup; scaled to 1.6T parametersFedus, Zoph, Shazeer (Google), JMLR 2022
GLaM20211.2T parameter model using 1/3 GPT-3 training energy; demonstrated energy efficiency at scaleDu et al. (Google), ICML 2022
Expert Choice Routing2022Experts select tokens instead of vice versa; reported 2x faster convergence; intrinsic load balancingZhou et al. (Google), NeurIPS 2022
Mixtral of Experts2024Open-weight MoE matching Llama 2 70B performance with 3.6x fewer active parametersJiang et al. (Mistral AI)
DeepSeek-V3 Technical Report2024671B/37B model with auxiliary-loss-free load balancing; 256 fine-grained expertsDeepSeek

Key Labs and Contributions

LabModelsFocusNotable Contribution
GoogleSwitch, GLaM, GShardArchitecture researchFirst trillion-parameter MoE; routing mechanism innovations
Mistral AIMixtral 8x7B, 8x22BOpen-weight deploymentMatched 70B dense model with 12.9B active params
DeepSeekV2, V3Ultra-sparse MoE671B total with 37B active (5.5% activation rate)
DatabricksDBRXEnterprise deploymentFine-grained MoE with dropless token routing
AlibabaQwen3-MoE, Qwen3-NextParameter efficiency80B total with 3B active (3.7% activation rate)
OpenAIGPT-4 (unverified)Production deploymentUnverified reports suggest ≈1.76T params MoE architecture

Adoption Trends

Factors Driving MoE Adoption

FactorEvidenceImpact
Training efficiencyGLaM achieved GPT-3 quality with 1/3 training energy; DBRX training is 2x more FLOP-efficient than denseReduces training costs for equivalent capability
Inference efficiencyDBRX: 2-3x higher throughput than 132B dense model; Mixtral serves at Llama 2 13B costEnables larger-scale deployment
Quality parityMixtral 8x7B matches/exceeds Llama 2 70B on MMLU (+1.7%), GSM8K (+17.6%), HumanEval (+10.3%)Demonstrates no capability penalty
Parameter scalingDense models face diminishing returns; MoE enables scaling to 671B+ parametersExtends parameter scaling benefits
Open ecosystemMixtral, DBRX, DeepSeek, Qwen all Apache 2.0 licensedEnables research and commercial fine-tuning

Timeline

PeriodDevelopmentContext
2017Shazeer introduces modern MoEDemonstrates 137B parameter model viability
2020-2021Google scales to 1T+ parametersSwitch Transformer, GLaM papers
2023GPT-4 launches (rumored MoE)Unverified reports suggest MoE architecture
2024Mixtral, DBRX, DeepSeek-V2 releasedOpen-weight MoE ecosystem emerges
2025DeepSeek-V3, Qwen3-Next releasedUltra-sparse MoE demonstrations (3-5% activation)

Remaining Advantages of Dense Models

Some research suggests dense models retain advantages in specific scenarios:

AdvantageContext
Inference latencySingle forward pass without routing overhead
Memory efficiencyOnly active parameters need loading
Simpler deploymentFewer moving parts in serving infrastructure
Fine-tuning stabilityStandard approaches work without expert-specific considerations
Small-scale modelsMoE overhead dominates at less than 10B parameters

Open Research Questions

Key Uncertainties

QuestionCurrent EvidenceSource
Do MoE models exhibit different emergent behaviors than dense models?Limited comparative studiesRouting complexity not fully characterized
Does routing create novel alignment challenges?Unknown; router learns token-expert mappingsNo systematic safety evaluation identified
Can expert specialization be meaningfully interpreted?Some evidence experts specialize by domain/languagePreliminary analysis in Expert Choice paper
Will ultra-sparse (less than 5% activation) remain stable at larger scale?DeepSeek-V3, Qwen3-Next demonstrate stability at 671B/80BLong-term stability not yet evaluated
Does MoE training require modified safety evaluation approaches?Not yet studied systematicallyNo published safety-specific methodology

Safety Research Directions

Existing evaluation approaches that transfer:

  • Behavioral evaluations and capability assessments work similarly
  • RLHF and other alignment training approaches remain applicable
  • Red teaming and adversarial testing methodologies transfer

Research gaps specific to MoE:

  • Expert-level interpretability: analysis tools for individual expert behavior
  • Routing pattern analysis: characterizing when/why routing changes
  • Combinatorial testing: approaches for covering expert combinations
  • Expert ablation: feasibility of removing or modifying individual experts

Exploratory research directions:

ApproachDescriptionFeasibility Assessment
Expert specialization analysisCharacterize what individual experts learn through activation patternsMedium - requires large-scale activation logging
Selective expert ablationTest if removing specific experts eliminates concerning behaviorsUnknown - may destabilize model
Routing interventionControl which experts activate for safety-critical inputsPossible - requires understanding routing mechanism
Expert-level alignmentTrain specific experts for safety-related capabilitiesSpeculative - no published attempts

Sources and References

Additional Key Papers

Architecture Leaks and Analysis

Footnotes

  1. Jiang, A. et al. (2024). Mixtral of Experts. Mistral AI. Open-weight MoE matching Llama 2 70B performance. 2

  2. The Decoder (2023). GPT-4 architecture, datasets, costs and more leaked. Summary of SemiAnalysis report on GPT-4's rumored MoE structure. Note: Unverified claims based on secondary sources.

  3. Fedus, W., Zoph, B., Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR 2022. Simplified MoE with single-expert routing, achieving 7x speedup.

  4. DeepSeek (2024). DeepSeek-V3 Technical Report. 671B parameter model with auxiliary-loss-free load balancing.

  5. Lepikhin, D. et al. (2020). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. First 600B parameter MoE for machine translation.

  6. Databricks (2024). Introducing DBRX: A New State-of-the-Art Open LLM. Fine-grained MoE with 16 experts and dropless routing.