|
| 1 | +--- |
| 2 | +name: parallelism-strategies |
| 3 | +description: Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration. |
| 4 | +--- |
| 5 | + |
| 6 | +# Parallelism Strategy Selection Skill |
| 7 | + |
| 8 | +For stable background on each parallelism type, see: |
| 9 | + |
| 10 | +- `docs/parallelisms.md` |
| 11 | +- `knowledge/techniques/parallelism_strategies.yaml` |
| 12 | + |
| 13 | +## Decision by Model Size |
| 14 | + |
| 15 | +### Dense models |
| 16 | + |
| 17 | +| Model size | GPUs | Recommended starting point | |
| 18 | +|---|---|---| |
| 19 | +| < 1B | 1-8 | DP only | |
| 20 | +| 1-10B | 8-16 | TP=2-4 + DP | |
| 21 | +| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP | |
| 22 | +| 70-175B | 64-256 | TP=8 + PP=4-8 + DP | |
| 23 | +| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP | |
| 24 | + |
| 25 | +### MoE models |
| 26 | + |
| 27 | +MoE parallelism differs from dense models. Because only a fraction of |
| 28 | +parameters are active per token, TP can often stay at 1 or 2 — the active |
| 29 | +parameter shard already fits on a single GPU. EP is the primary scaling |
| 30 | +dimension, with PP handling cross-node layer distribution. |
| 31 | + |
| 32 | +| Model (total / active) | TP | PP | EP | Notes | |
| 33 | +|---|---|---|---|---| |
| 34 | +| OLMoE 7B / 1B | 1 | 1 | 8 | EP only, fits single node | |
| 35 | +| Moonlight 16B / 3B | 2 | 1 | 8 | small TP for shared layers | |
| 36 | +| DeepSeek-V2 236B / 21B | 1 | 4 | 32 | no TP at all | |
| 37 | +| GLM-4.5 Air 106B / 12B | 1 | 4 | 8 | no TP at all | |
| 38 | +| Qwen3 30B-A3B | 4 | 2 | 4 | | |
| 39 | +| GLM-4.5 355B / 32B | 2 | 8 | 16 | | |
| 40 | +| Qwen3 235B-A22B | 4 | 16 | 8 | CP=2 for pretrain | |
| 41 | +| DeepSeek-V3 671B / 37B | 2 | 16 | 64 | TP=2, not 8 | |
| 42 | +| Kimi-K2 1T | 2 | 16 | 32 | | |
| 43 | + |
| 44 | +Key patterns: |
| 45 | + |
| 46 | +- TP is sized by **active** params, not total params. A 671B MoE with |
| 47 | + 37B active needs far less TP than a 70B dense model. |
| 48 | +- EP scales with expert count. Common: EP = num_experts or |
| 49 | + num_experts / experts_per_gpu. |
| 50 | +- PP handles depth. Large MoE models use PP=8-16 across nodes. |
| 51 | +- ETP (expert tensor parallelism) is rarely used. Llama 4 is an |
| 52 | + exception (ETP=4). |
| 53 | + |
| 54 | +These are starting points, not hard rules. Always profile the first |
| 55 | +iteration to verify memory and communication. |
| 56 | + |
| 57 | +## Decision by Hardware Topology |
| 58 | + |
| 59 | +Single node with NVLink: |
| 60 | + |
| 61 | +```python |
| 62 | +cfg.model.tensor_model_parallel_size = 8 |
| 63 | +``` |
| 64 | + |
| 65 | +Multiple nodes with InfiniBand: |
| 66 | + |
| 67 | +```python |
| 68 | +cfg.model.tensor_model_parallel_size = 8 |
| 69 | +cfg.model.pipeline_model_parallel_size = N |
| 70 | +``` |
| 71 | + |
| 72 | +Limited network (Ethernet): |
| 73 | + |
| 74 | +```python |
| 75 | +cfg.model.tensor_model_parallel_size = 4 |
| 76 | +cfg.model.pipeline_model_parallel_size = M |
| 77 | +``` |
| 78 | + |
| 79 | +The stable rule is: keep TP within a single NVLink domain. Use PP or DP |
| 80 | +for cross-node scaling. TP across nodes is almost always a performance |
| 81 | +loss. |
| 82 | + |
| 83 | +## Decision by Sequence Length |
| 84 | + |
| 85 | +| Sequence length | Recommendation | |
| 86 | +|---|---| |
| 87 | +| < 2K | standard TP + PP + DP | |
| 88 | +| 2K-8K | add SP (`sequence_parallel=True`) | |
| 89 | +| 8K-32K | add CP=2 | |
| 90 | +| 32K+ | add CP=4-8, consider `a2a+p2p` for large CP | |
| 91 | + |
| 92 | +## Combined Parallelism Enablement |
| 93 | + |
| 94 | +3D parallelism (TP + PP + DP): |
| 95 | + |
| 96 | +```python |
| 97 | +cfg.model.tensor_model_parallel_size = 4 |
| 98 | +cfg.model.pipeline_model_parallel_size = 4 |
| 99 | +cfg.model.sequence_parallel = True |
| 100 | +``` |
| 101 | + |
| 102 | +4D parallelism (TP + PP + CP + DP): |
| 103 | + |
| 104 | +```python |
| 105 | +cfg.model.tensor_model_parallel_size = 8 |
| 106 | +cfg.model.pipeline_model_parallel_size = 8 |
| 107 | +cfg.model.context_parallel_size = 2 |
| 108 | +cfg.model.sequence_parallel = True |
| 109 | +``` |
| 110 | + |
| 111 | +MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs): |
| 112 | + |
| 113 | +```python |
| 114 | +cfg.model.tensor_model_parallel_size = 1 |
| 115 | +cfg.model.pipeline_model_parallel_size = 4 |
| 116 | +cfg.model.expert_model_parallel_size = 32 |
| 117 | +cfg.model.sequence_parallel = False |
| 118 | +``` |
| 119 | + |
| 120 | +MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs): |
| 121 | + |
| 122 | +```python |
| 123 | +cfg.model.tensor_model_parallel_size = 2 |
| 124 | +cfg.model.pipeline_model_parallel_size = 16 |
| 125 | +cfg.model.expert_model_parallel_size = 64 |
| 126 | +cfg.model.sequence_parallel = True |
| 127 | +``` |
| 128 | + |
| 129 | +DP size is always implicit: |
| 130 | + |
| 131 | +``` |
| 132 | +data_parallel_size = world_size / (TP * PP * CP) |
| 133 | +``` |
| 134 | + |
| 135 | +## Memory Estimation |
| 136 | + |
| 137 | +Without parallelism (70B model, FP16): |
| 138 | + |
| 139 | +``` |
| 140 | +parameters: 140 GB |
| 141 | +gradients: 140 GB |
| 142 | +optimizer states: 280 GB (Adam) |
| 143 | +activations: 48 GB (batch=1, seq=4K) |
| 144 | +total: 608 GB |
| 145 | +``` |
| 146 | + |
| 147 | +With TP=4, PP=4, DP=4 (64 GPUs): |
| 148 | + |
| 149 | +``` |
| 150 | +parameters: 8.75 GB per GPU |
| 151 | +gradients: 8.75 GB per GPU |
| 152 | +optimizer states: 17.50 GB per GPU |
| 153 | +activations: 3.00 GB per GPU |
| 154 | +total: ~38 GB per GPU |
| 155 | +``` |
| 156 | + |
| 157 | +## Code Anchors |
| 158 | + |
| 159 | +Parallelism dimensions set in model provider: |
| 160 | + |
| 161 | +```66:81:docs/parallelisms.md |
| 162 | +model_config = GPTModelProvider( |
| 163 | + tensor_model_parallel_size=2, |
| 164 | + # ... other model parameters |
| 165 | +) |
| 166 | +``` |
| 167 | + |
| 168 | +DP size calculation: |
| 169 | + |
| 170 | +```424:436:docs/parallelisms.md |
| 171 | +data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size) |
| 172 | +``` |
| 173 | + |
| 174 | +Bridge initialization wires parallelism into process groups: |
| 175 | + |
| 176 | +```618:628:src/megatron/bridge/training/initialize.py |
| 177 | +parallel_state.initialize_model_parallel( |
| 178 | + tensor_model_parallel_size=model_config.tensor_model_parallel_size, |
| 179 | + pipeline_model_parallel_size=model_config.pipeline_model_parallel_size, |
| 180 | + ... |
| 181 | + context_parallel_size=model_config.context_parallel_size, |
| 182 | + hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes, |
| 183 | + expert_model_parallel_size=model_config.expert_model_parallel_size, |
| 184 | + ... |
| 185 | +) |
| 186 | +``` |
| 187 | + |
| 188 | +## Pitfalls |
| 189 | + |
| 190 | +1. TP across nodes destroys throughput. Always keep TP within a single |
| 191 | + NVLink domain. |
| 192 | + |
| 193 | +2. PP without interleaving has large pipeline bubbles. Use |
| 194 | + `virtual_pipeline_model_parallel_size` when possible. |
| 195 | + |
| 196 | +3. SP requires `tensor_model_parallel_size > 1`. Enabling SP alone |
| 197 | + without TP is a config error. |
| 198 | + |
| 199 | +4. CP requires `seq_length % (2 * context_parallel_size) == 0`. |
| 200 | + |
| 201 | +5. EP is only for MoE models. Setting `expert_model_parallel_size` on a |
| 202 | + dense model is a no-op or error. |
| 203 | + |
| 204 | +6. The model-size-to-parallelism table above is a starting heuristic. |
| 205 | + Always profile the first iteration to check memory and communication. |
| 206 | + |
| 207 | +7. `CUDA_DEVICE_MAX_CONNECTIONS` and related env vars interact with |
| 208 | + overlap settings. See `skills/perf-techniques/tp-dp-comm-overlap.md`. |
| 209 | + |
| 210 | +## Verification |
| 211 | + |
| 212 | +Quick sanity check that combined parallelism initializes correctly using |
| 213 | +the smallest available recipe with overridden parallelism: |
| 214 | + |
| 215 | +```bash |
| 216 | +CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \ |
| 217 | + scripts/training/run_recipe.py \ |
| 218 | + --recipe llama32_1b_pretrain_config \ |
| 219 | + model.tensor_model_parallel_size=2 \ |
| 220 | + model.pipeline_model_parallel_size=2 \ |
| 221 | + model.sequence_parallel=True \ |
| 222 | + train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \ |
| 223 | + scheduler.lr_warmup_iters=0 \ |
| 224 | + validation.eval_iters=0 validation.eval_interval=0 \ |
| 225 | + checkpoint.save_interval=0 \ |
| 226 | + logger.log_interval=1 |
| 227 | +``` |
| 228 | + |
| 229 | +Success criteria: |
| 230 | + |
| 231 | +- exit code 0 |
| 232 | +- finite loss at iteration 3 (e.g. `lm loss: 1.003808E+01`) |
| 233 | +- log shows TP=2 PP=2 DP=1 layout with 4 ranks |
0 commit comments