Skip to content

Commit e397f32

Browse files
committed
docs: separate technique guidance into docs, skills, and knowledge
Clarify which training guidance is stable documentation versus operational enablement so technique knowledge can be maintained without duplicated docs. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
1 parent a235e03 commit e397f32

17 files changed

+1214
-628
lines changed

docs/training/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ This directory contains comprehensive documentation for training and customizing
4141
| **[Optimizer & Scheduler](optimizer-scheduler.md)** | Optimizer and learning rate scheduler configuration | Setting up optimization |
4242
| **[Mixed Precision](mixed-precision.md)** | Mixed precision training for memory efficiency | Reducing memory usage |
4343
| **[Communication Overlap](communication-overlap.md)** | Overlapping communication with computation | Optimizing distributed training |
44+
| **[Hybrid Context Parallel](hybrid-context-parallel.md)** | Hierarchical `a2a+p2p` context parallel guidance | Advanced long-sequence scaling |
4445
| **[Attention Optimizations](attention-optimizations.md)** | Optimizing attention mechanisms | Improving training speed |
4546
| **[Activation Recomputation](activation-recomputation.md)** | Gradient checkpointing strategies | Reducing memory footprint |
4647
| **[CPU Offloading](cpu-offloading.md)** | Offloading to CPU for memory management | Working with limited GPU memory |
@@ -59,6 +60,7 @@ This directory contains comprehensive documentation for training and customizing
5960
|----------|---------|--------------|
6061
| **[PEFT](peft.md)** | Parameter-Efficient Fine-Tuning (LoRA, etc.) | Fine-tuning with limited resources |
6162
| **[Packed Sequences](packed-sequences.md)** | Sequence packing for efficiency | Optimizing data loading |
63+
| **[Megatron FSDP](megatron-fsdp.md)** | Stable overview of Megatron FSDP | Choosing an FSDP path |
6264
| **[Distillation](distillation.md)** | Knowledge distillation techniques | Transferring knowledge between models |
6365
| **[Checkpointing](checkpointing.md)** | Checkpoint saving, loading, and resuming | Managing training state |
6466
| **[Callbacks](callbacks.md)** | Inject custom logic into training loop | Custom logging, metrics, third-party integrations |

docs/training/communication-overlap.md

Lines changed: 75 additions & 203 deletions
Large diffs are not rendered by default.
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Hybrid / Hierarchical Context Parallel
2+
3+
This page covers the stable Bridge-facing meaning of hierarchical context
4+
parallelism, especially the `a2a+p2p` transport path and
5+
`hierarchical_context_parallel_sizes`.
6+
7+
For operational setup, code anchors, and verification commands, see
8+
`skills/perf-techniques/hybrid-context-parallel.md`.
9+
10+
## What It Is
11+
12+
In upstream Megatron-Core, `cp_comm_type="a2a+p2p"` plus
13+
`hierarchical_context_parallel_sizes` enables a hierarchical context-parallel
14+
transport path. This is the Bridge-relevant form of hierarchical context
15+
parallelism.
16+
17+
It is important to separate that from the upstream boolean
18+
`hybrid_context_parallel`, which is a different feature for balancing packed or
19+
variable-length workloads. The two concepts should not be treated as
20+
interchangeable.
21+
22+
## When to Use It
23+
24+
Hierarchical context parallelism is relevant when:
25+
26+
- plain context parallelism is already required
27+
- larger CP sizes make flat `p2p` less attractive
28+
- you specifically want the hierarchical `a2a+p2p` transport path
29+
30+
It should be treated as an advanced feature rather than a default recommendation.
31+
32+
## Stable Bridge Limitation
33+
34+
The most important Bridge-specific limitation is that hierarchical context
35+
parallelism is currently supported only on the MPU initialization path.
36+
37+
In practice, that means:
38+
39+
- `dist.use_decentralized_pg=False` is the supported Bridge path
40+
- the decentralized process-group path should not be assumed to materialize HCP
41+
groups
42+
43+
## Stable Constraints
44+
45+
The durable constraints are:
46+
47+
- `hierarchical_context_parallel_sizes` must match
48+
`context_parallel_size` multiplicatively
49+
- the usual CP sequence-length divisibility rules still apply
50+
- Transformer Engine version support matters for `a2a+p2p`
51+
52+
## Recommendation Level
53+
54+
Use hierarchical context parallelism in Bridge only when you intentionally want
55+
that transport path and are prepared to validate execution-path details. It is
56+
not yet the kind of feature that should be presented as universally safe across
57+
all Bridge initialization modes.
58+
59+
## Related Docs
60+
61+
- `docs/performance-guide.md`
62+
- `docs/training/communication-overlap.md`
63+
- `skills/perf-techniques/hybrid-context-parallel.md`
64+
- `knowledge/techniques/hybrid_context_parallel.yaml`

0 commit comments

Comments
 (0)