Skip to content
2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ training/checkpointing.md
training/megatron-fsdp.md
training/resiliency.md
training/mixed-precision.md
training/cuda-graphs.md
training/hybrid-context-parallel.md
training/communication-overlap.md
training/attention-optimizations.md
training/activation-recomputation.md
Expand Down
52 changes: 52 additions & 0 deletions docs/parallelisms.md
Original file line number Diff line number Diff line change
Expand Up @@ -435,6 +435,53 @@ For example, with 32 GPUs total and the configuration above:
- `context_parallel_size = 2`
- `data_parallel_size = 32 / (2 × 4 × 2) = 2`

## Strategy Selection Guide

Choosing the right combination depends on model size, hardware topology,
and sequence length.

### Dense Models by Size

| Model size | GPUs | Recommended starting point |
|---|---|---|
| < 1B | 1-8 | DP only |
| 1-10B | 8-16 | TP=2-4 + DP |
| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP |
| 70-175B | 64-256 | TP=8 + PP=4-8 + DP |
| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP |

### MoE Models

MoE models differ fundamentally from dense models: only a fraction of
parameters are active per token, so TP can often stay at 1 or 2. EP is
the primary scaling dimension.

| Total / active params | Typical layout |
|---|---|
| < 20B | EP only (TP=1, PP=1) |
| 20-100B | TP=1-2 + PP=2-4 + EP=8-16 |
| 100-500B | TP=2-4 + PP=8-16 + EP=8-32 |
| 500B+ | TP=2 + PP=16 + EP=32-64 |

### By Hardware Topology

- **Single node with NVLink**: maximize TP within the node (up to TP=8).
- **Multiple nodes with InfiniBand**: keep TP within a node, use PP across nodes.
- **Limited network (Ethernet)**: minimize TP, prefer PP for cross-node scaling.

### By Sequence Length

| Sequence length | Recommendation |
|---|---|
| < 2K | standard TP + PP + DP |
| 2K-8K | add SP (`sequence_parallel=True`) |
| 8K-32K | add CP=2 |
| 32K+ | add CP=4-8, consider hierarchical CP |

For operational details on configuring combined parallelism, troubleshooting
layouts, and memory estimation, see the
[parallelism strategies skill](../skills/perf-techniques/parallelism-strategies/SKILL.md).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: This links to ../skills/perf-techniques/parallelism-strategies.md, but the actual file is skills/perf-techniques/parallelism-strategies/SKILL.md. The link will 404.

Suggested change
[parallelism strategies skill](../skills/perf-techniques/parallelism-strategies/SKILL.md).

## Configuration Guidelines

### Memory Optimization
Expand All @@ -458,6 +505,11 @@ For example, with 32 GPUs total and the configuration above:
- **Token dropping** requires `alltoall` or `alltoall_seq` token dispatcher
- All parallelism strategies can be combined, but total parallelism must divide evenly into the world size

## Related Artifacts

- **Operational skill**: [skills/perf-techniques/parallelism-strategies/SKILL.md](../skills/perf-techniques/parallelism-strategies/SKILL.md) — enablement, pitfalls, memory estimation, verification
- **Knowledge card**: [skills/perf-techniques/parallelism-strategies/card.yaml](../skills/perf-techniques/parallelism-strategies/card.yaml) — structured metadata and validation status

Comment on lines +508 to +512
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix broken cross-reference to knowledge card.

The pipeline failure indicates the path ../knowledge/techniques/parallelism_strategies.yaml does not exist. Based on the README at knowledge/techniques/README.md, cards have been relocated to skills/perf-techniques/<technique>/card.yaml. Update the reference to point to the correct location.

🔧 Proposed fix
 ## Related Artifacts
 
 - **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
-- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status
+- **Knowledge card**: [skills/perf-techniques/parallelism-strategies/card.yaml](../skills/perf-techniques/parallelism-strategies/card.yaml) — structured metadata and validation status
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Related Artifacts
- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status
## Related Artifacts
- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
- **Knowledge card**: [skills/perf-techniques/parallelism-strategies/card.yaml](../skills/perf-techniques/parallelism-strategies/card.yaml) — structured metadata and validation status
🧰 Tools
🪛 GitHub Actions: Build docs

[warning] 511-511: Sphinx: cross-reference target not found: '../knowledge/techniques/parallelism_strategies.yaml' [myst.xref_missing]

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/parallelisms.md` around lines 508 - 512, Update the broken
cross-reference in docs/parallelisms.md: replace the non-existent
../knowledge/techniques/parallelism_strategies.yaml link with the relocated card
path under skills/perf-techniques (e.g.,
../skills/perf-techniques/parallelism-strategies/card.yaml) so the "Knowledge
card" link in the "Related Artifacts" section points to the correct card.yaml;
edit the link text that currently references parallelism_strategies.yaml and
ensure it matches the directory name used by the existing skill file
(parallelism-strategies) and the card.yaml filename.

## Resources

- [Megatron Core Developer Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/)
Expand Down
4 changes: 2 additions & 2 deletions docs/performance-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -299,7 +299,7 @@ Additionally, because CP shards activations, it also partitions optimizer states

> 1. Megatron-Bridge supports graph capture, significantly reducing host overhead. CUDA Graph is applicable only to LLMs with a static tensor shape across training steps. For example, it supports fixed-size packed sequences but does not handle sequences with varying lengths at each step. Also, MoE models with token-dropless propagation have limited CUDA graph support, restricted to the dense modules only.
> 2. CUDA graph requires additional memory for static buffer management, typically adding a few gigabytes for static buffers, while models with PP size > 1 may consume over 10GB. We are actively working to reduce this memory overhead.
> 3. `TransformerConfig.enable_cuda_graph=true`
> 3. See [CUDA Graphs](training/cuda-graphs.md) for configuration details (`cuda_graph_impl`, `cuda_graph_scope`).

5. Bind CPU memory for GPU processes

Expand Down Expand Up @@ -677,7 +677,7 @@ python -u /home/dpsk_a2a/deepep/tests/test_internode.py
- `TransformerConfig.cpu_offloading_num_layers`
- `TransformerConfig.cpu_offloading_weights`
- `GPTProvider.cross_entropy_loss_fusion`
- `TransformerConfig.enable_cuda_graph`
- `TransformerConfig.cuda_graph_impl` / `cuda_graph_scope` (see [CUDA Graphs](training/cuda-graphs.md))
- `MixedPrecisionConfig.fp8_param_gather`
- `GPTProvider.gradient_accumulation_fusion`
- `TransformerConfig.masked_softmax_fusion`
Expand Down
2 changes: 2 additions & 0 deletions docs/training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ This directory contains comprehensive documentation for training and customizing
| **[Optimizer & Scheduler](optimizer-scheduler.md)** | Optimizer and learning rate scheduler configuration | Setting up optimization |
| **[Mixed Precision](mixed-precision.md)** | Mixed precision training for memory efficiency | Reducing memory usage |
| **[Communication Overlap](communication-overlap.md)** | Overlapping communication with computation | Optimizing distributed training |
| **[Hybrid Context Parallel](hybrid-context-parallel.md)** | Hierarchical `a2a+p2p` context parallel guidance | Advanced long-sequence scaling |
| **[Attention Optimizations](attention-optimizations.md)** | Optimizing attention mechanisms | Improving training speed |
| **[Activation Recomputation](activation-recomputation.md)** | Gradient checkpointing strategies | Reducing memory footprint |
| **[CPU Offloading](cpu-offloading.md)** | Offloading to CPU for memory management | Working with limited GPU memory |
Expand All @@ -59,6 +60,7 @@ This directory contains comprehensive documentation for training and customizing
|----------|---------|--------------|
| **[PEFT](peft.md)** | Parameter-Efficient Fine-Tuning (LoRA, etc.) | Fine-tuning with limited resources |
| **[Packed Sequences](packed-sequences.md)** | Sequence packing for efficiency | Optimizing data loading |
| **[Megatron FSDP](megatron-fsdp.md)** | Stable overview of Megatron FSDP | Choosing an FSDP path |
| **[Distillation](distillation.md)** | Knowledge distillation techniques | Transferring knowledge between models |
| **[Checkpointing](checkpointing.md)** | Checkpoint saving, loading, and resuming | Managing training state |
| **[Callbacks](callbacks.md)** | Inject custom logic into training loop | Custom logging, metrics, third-party integrations |
Expand Down
Loading
Loading