docs: add parallelism strategy selection guide with MoE-specific layouts

yaoyu-33 · yaoyu-33 · commit 5c3175483c41 · 2026-03-15T17:44:01.000-07:00
Add docs/skills/knowledge split for parallelism strategy selection,
based on real recipe layouts from DeepSeek-V2/V3, Qwen3, Kimi-K2, etc.
Key insight: MoE models size TP by active params (often TP=1-2) and use
EP as the primary scaling dimension, unlike dense models.

Verified TP=2+PP=2+SP on cluster with llama32_1b_pretrain_config.

Signed-off-by: yaoyu-33 &lt;yaoyu.094@gmail.com&gt;
Made-with: Cursor
diff --git a/docs/parallelisms.md b/docs/parallelisms.md
@@ -435,6 +435,53 @@ For example, with 32 GPUs total and the configuration above:
 - `context_parallel_size = 2`
 - `data_parallel_size = 32 / (2 × 4 × 2) = 2`
 
+## Strategy Selection Guide
+
+Choosing the right combination depends on model size, hardware topology,
+and sequence length.
+
+### Dense Models by Size
+
+| Model size | GPUs | Recommended starting point |
+|---|---|---|
+| < 1B | 1-8 | DP only |
+| 1-10B | 8-16 | TP=2-4 + DP |
+| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP |
+| 70-175B | 64-256 | TP=8 + PP=4-8 + DP |
+| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP |
+
+### MoE Models
+
+MoE models differ fundamentally from dense models: only a fraction of
+parameters are active per token, so TP can often stay at 1 or 2. EP is
+the primary scaling dimension.
+
+| Total / active params | Typical layout |
+|---|---|
+| < 20B | EP only (TP=1, PP=1) |
+| 20-100B | TP=1-2 + PP=2-4 + EP=8-16 |
+| 100-500B | TP=2-4 + PP=8-16 + EP=8-32 |
+| 500B+ | TP=2 + PP=16 + EP=32-64 |
+
+### By Hardware Topology
+
+- **Single node with NVLink**: maximize TP within the node (up to TP=8).
+- **Multiple nodes with InfiniBand**: keep TP within a node, use PP across nodes.
+- **Limited network (Ethernet)**: minimize TP, prefer PP for cross-node scaling.
+
+### By Sequence Length
+
+| Sequence length | Recommendation |
+|---|---|
+| < 2K | standard TP + PP + DP |
+| 2K-8K | add SP (`sequence_parallel=True`) |
+| 8K-32K | add CP=2 |
+| 32K+ | add CP=4-8, consider hierarchical CP |
+
+For operational details on configuring combined parallelism, troubleshooting
+layouts, and memory estimation, see the
+[parallelism strategies skill](../skills/perf-techniques/parallelism-strategies.md).
+
 ## Configuration Guidelines
 
 ### Memory Optimization
@@ -458,6 +505,11 @@ For example, with 32 GPUs total and the configuration above:
 - **Token dropping** requires `alltoall` or `alltoall_seq` token dispatcher
 - All parallelism strategies can be combined, but total parallelism must divide evenly into the world size
 
+## Related Artifacts
+
+- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
+- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status
+
 ## Resources
 
 - [Megatron Core Developer Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/)
diff --git a/knowledge/techniques/parallelism_strategies.yaml b/knowledge/techniques/parallelism_strategies.yaml
@@ -0,0 +1,72 @@
+title: parallelism_strategies
+validated_on: "2026-03-15"
+summary: >
+  Megatron Bridge supports DP, TP, PP, SP, CP, and EP parallelism strategies
+  which can be combined for models from sub-1B to 500B+ parameters. The right
+  combination depends on model size, hardware topology, and sequence length.
+validation_status:
+  dp_ddp_distributed_optimizer:
+    - code_verified
+  tp_config_and_runtime:
+    - code_verified
+  pp_interleaved_schedule:
+    - code_verified
+  sp_activation_partitioning:
+    - code_verified
+  cp_context_parallel:
+    - code_verified
+  ep_expert_parallel:
+    - code_verified
+  combined_parallelism_init:
+    - code_verified
+  sizing_heuristics:
+    - doc_only
+feature_meaning:
+  data_parallel: >
+    Replicate model across GPUs, split data batches, synchronize gradients.
+  tensor_parallel: >
+    Split individual layer tensors across GPUs within a node.
+  pipeline_parallel: >
+    Assign consecutive layer groups to different GPUs, process microbatches
+    in a pipeline.
+  sequence_parallel: >
+    Partition activations along the sequence dimension within TP groups to
+    reduce activation memory.
+  context_parallel: >
+    Split long sequences across GPUs using ring attention or similar
+    communication patterns.
+  expert_parallel: >
+    Distribute MoE experts across GPUs, only applies to expert layers.
+recommended_path:
+  dense_under_1b: DP only
+  dense_1b_to_10b: TP=2-4 + DP
+  dense_10b_to_70b: TP=4-8 + PP=2-4 + DP
+  dense_70b_to_175b: TP=8 + PP=4-8 + DP
+  dense_175b_plus: TP=8 + PP=8-16 + CP=2 + DP
+  moe_under_20b: EP only (TP=1, PP=1)
+  moe_20b_to_100b: TP=1-2 + PP=2-4 + EP=8-16
+  moe_100b_to_500b: TP=2-4 + PP=8-16 + EP=8-32
+  moe_500b_plus: TP=2 + PP=16 + EP=32-64
+known_constraints:
+  - TP should stay within a single NVLink domain for performance.
+  - SP requires tensor_model_parallel_size > 1.
+  - CP requires seq_length divisible by 2 * context_parallel_size.
+  - EP requires num_moe_experts > 0 and expert_model_parallel_size divides num_moe_experts.
+  - PP interleaved schedule requires virtual_pipeline_model_parallel_size > 1.
+  - Total parallelism dimensions must divide evenly into world_size.
+known_limitations:
+  - Model-size-to-parallelism mapping is a heuristic, not a benchmark-proven table.
+  - Not every parallelism combination has the same level of in-repo functional test coverage.
+  - Memory estimates assume standard Adam optimizer and FP16/BF16 parameters.
+evidence:
+  - docs/parallelisms.md
+  - docs/performance-guide.md
+  - docs/training/communication-overlap.md
+  - docs/training/hybrid-context-parallel.md
+  - src/megatron/bridge/training/initialize.py
+  - src/megatron/bridge/training/config.py
+  - src/megatron/bridge/models/common/unimodal.py
+follow_up_validation:
+  - Add a checked-in combined parallelism functional smoke for TP+PP+CP.
+  - Add benchmark-backed sizing guidance for at least one model family.
+  - Add explicit EP+TP+PP functional smoke for MoE models.
diff --git a/skills/perf-techniques/parallelism-strategies.md b/skills/perf-techniques/parallelism-strategies.md
@@ -0,0 +1,233 @@
+---
+name: parallelism-strategies
+description: Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.
+---
+
+# Parallelism Strategy Selection Skill
+
+For stable background on each parallelism type, see:
+
+- `docs/parallelisms.md`
+- `knowledge/techniques/parallelism_strategies.yaml`
+
+## Decision by Model Size
+
+### Dense models
+
+| Model size | GPUs | Recommended starting point |
+|---|---|---|
+| < 1B | 1-8 | DP only |
+| 1-10B | 8-16 | TP=2-4 + DP |
+| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP |
+| 70-175B | 64-256 | TP=8 + PP=4-8 + DP |
+| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP |
+
+### MoE models
+
+MoE parallelism differs from dense models. Because only a fraction of
+parameters are active per token, TP can often stay at 1 or 2 — the active
+parameter shard already fits on a single GPU. EP is the primary scaling
+dimension, with PP handling cross-node layer distribution.
+
+| Model (total / active) | TP | PP | EP | Notes |
+|---|---|---|---|---|
+| OLMoE 7B / 1B | 1 | 1 | 8 | EP only, fits single node |
+| Moonlight 16B / 3B | 2 | 1 | 8 | small TP for shared layers |
+| DeepSeek-V2 236B / 21B | 1 | 4 | 32 | no TP at all |
+| GLM-4.5 Air 106B / 12B | 1 | 4 | 8 | no TP at all |
+| Qwen3 30B-A3B | 4 | 2 | 4 | |
+| GLM-4.5 355B / 32B | 2 | 8 | 16 | |
+| Qwen3 235B-A22B | 4 | 16 | 8 | CP=2 for pretrain |
+| DeepSeek-V3 671B / 37B | 2 | 16 | 64 | TP=2, not 8 |
+| Kimi-K2 1T | 2 | 16 | 32 | |
+
+Key patterns:
+
+- TP is sized by **active** params, not total params. A 671B MoE with
+  37B active needs far less TP than a 70B dense model.
+- EP scales with expert count. Common: EP = num_experts or
+  num_experts / experts_per_gpu.
+- PP handles depth. Large MoE models use PP=8-16 across nodes.
+- ETP (expert tensor parallelism) is rarely used. Llama 4 is an
+  exception (ETP=4).
+
+These are starting points, not hard rules. Always profile the first
+iteration to verify memory and communication.
+
+## Decision by Hardware Topology
+
+Single node with NVLink:
+
+```python
+cfg.model.tensor_model_parallel_size = 8
+```
+
+Multiple nodes with InfiniBand:
+
+```python
+cfg.model.tensor_model_parallel_size = 8
+cfg.model.pipeline_model_parallel_size = N
+```
+
+Limited network (Ethernet):
+
+```python
+cfg.model.tensor_model_parallel_size = 4
+cfg.model.pipeline_model_parallel_size = M
+```
+
+The stable rule is: keep TP within a single NVLink domain. Use PP or DP
+for cross-node scaling. TP across nodes is almost always a performance
+loss.
+
+## Decision by Sequence Length
+
+| Sequence length | Recommendation |
+|---|---|
+| < 2K | standard TP + PP + DP |
+| 2K-8K | add SP (`sequence_parallel=True`) |
+| 8K-32K | add CP=2 |
+| 32K+ | add CP=4-8, consider `a2a+p2p` for large CP |
+
+## Combined Parallelism Enablement
+
+3D parallelism (TP + PP + DP):
+
+```python
+cfg.model.tensor_model_parallel_size = 4
+cfg.model.pipeline_model_parallel_size = 4
+cfg.model.sequence_parallel = True
+```
+
+4D parallelism (TP + PP + CP + DP):
+
+```python
+cfg.model.tensor_model_parallel_size = 8
+cfg.model.pipeline_model_parallel_size = 8
+cfg.model.context_parallel_size = 2
+cfg.model.sequence_parallel = True
+```
+
+MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs):
+
+```python
+cfg.model.tensor_model_parallel_size = 1
+cfg.model.pipeline_model_parallel_size = 4
+cfg.model.expert_model_parallel_size = 32
+cfg.model.sequence_parallel = False
+```
+
+MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs):
+
+```python
+cfg.model.tensor_model_parallel_size = 2
+cfg.model.pipeline_model_parallel_size = 16
+cfg.model.expert_model_parallel_size = 64
+cfg.model.sequence_parallel = True
+```
+
+DP size is always implicit:
+
+```
+data_parallel_size = world_size / (TP * PP * CP)
+```
+
+## Memory Estimation
+
+Without parallelism (70B model, FP16):
+
+```
+parameters:       140 GB
+gradients:        140 GB
+optimizer states: 280 GB (Adam)
+activations:       48 GB (batch=1, seq=4K)
+total:            608 GB
+```
+
+With TP=4, PP=4, DP=4 (64 GPUs):
+
+```
+parameters:        8.75 GB per GPU
+gradients:         8.75 GB per GPU
+optimizer states: 17.50 GB per GPU
+activations:       3.00 GB per GPU
+total:           ~38    GB per GPU
+```
+
+## Code Anchors
+
+Parallelism dimensions set in model provider:
+
+```66:81:docs/parallelisms.md
+model_config = GPTModelProvider(
+    tensor_model_parallel_size=2,
+    # ... other model parameters
+)
+```
+
+DP size calculation:
+
+```424:436:docs/parallelisms.md
+data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size)
+```
+
+Bridge initialization wires parallelism into process groups:
+
+```618:628:src/megatron/bridge/training/initialize.py
+parallel_state.initialize_model_parallel(
+    tensor_model_parallel_size=model_config.tensor_model_parallel_size,
+    pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
+    ...
+    context_parallel_size=model_config.context_parallel_size,
+    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
+    expert_model_parallel_size=model_config.expert_model_parallel_size,
+    ...
+)
+```
+
+## Pitfalls
+
+1. TP across nodes destroys throughput. Always keep TP within a single
+   NVLink domain.
+
+2. PP without interleaving has large pipeline bubbles. Use
+   `virtual_pipeline_model_parallel_size` when possible.
+
+3. SP requires `tensor_model_parallel_size > 1`. Enabling SP alone
+   without TP is a config error.
+
+4. CP requires `seq_length % (2 * context_parallel_size) == 0`.
+
+5. EP is only for MoE models. Setting `expert_model_parallel_size` on a
+   dense model is a no-op or error.
+
+6. The model-size-to-parallelism table above is a starting heuristic.
+   Always profile the first iteration to check memory and communication.
+
+7. `CUDA_DEVICE_MAX_CONNECTIONS` and related env vars interact with
+   overlap settings. See `skills/perf-techniques/tp-dp-comm-overlap.md`.
+
+## Verification
+
+Quick sanity check that combined parallelism initializes correctly using
+the smallest available recipe with overridden parallelism:
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
+  scripts/training/run_recipe.py \
+  --recipe llama32_1b_pretrain_config \
+  model.tensor_model_parallel_size=2 \
+  model.pipeline_model_parallel_size=2 \
+  model.sequence_parallel=True \
+  train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \
+  scheduler.lr_warmup_iters=0 \
+  validation.eval_iters=0 validation.eval_interval=0 \
+  checkpoint.save_interval=0 \
+  logger.log_interval=1
+```
+
+Success criteria:
+
+- exit code 0
+- finite loss at iteration 3 (e.g. `lm loss: 1.003808E+01`)
+- log shows TP=2 PP=2 DP=1 layout with 4 ranks