diff --git a/.gitignore b/.gitignore index a6d79ddb5b..40cd346861 100644 --- a/.gitignore +++ b/.gitignore @@ -91,6 +91,7 @@ instance/ # Sphinx documentation docs/_build docs/apidocs +docs/skills # PyBuilder target/ diff --git a/docs/conf.py b/docs/conf.py index 5452109236..23b5f156dd 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -18,9 +18,19 @@ # https://www.sphinx-doc.org/en/master/usage/configuration.html import os +import shutil import sys +_SKILLS_SRC = os.path.join(os.path.dirname(__file__), os.pardir, "skills") +_SKILLS_DST = os.path.join(os.path.dirname(__file__), "skills") + +if os.path.isdir(_SKILLS_SRC): + if os.path.exists(_SKILLS_DST): + shutil.rmtree(_SKILLS_DST) + shutil.copytree(_SKILLS_SRC, _SKILLS_DST) + + # -- Project information ----------------------------------------------------- # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information @@ -43,7 +53,16 @@ ] templates_path = ["_templates"] -exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] +exclude_patterns = [ + "_build", + "Thumbs.db", + ".DS_Store", + "skills/perf-techniques/README.md", +] + +suppress_warnings = [ + "misc.highlighting_failure", # skills use Cursor-specific code reference syntax +] # -- Options for MyST Parser (Markdown) -------------------------------------- # MyST Parser settings diff --git a/docs/index.md b/docs/index.md index ce46e8c346..e8208eb314 100644 --- a/docs/index.md +++ b/docs/index.md @@ -44,6 +44,8 @@ training/checkpointing.md training/megatron-fsdp.md training/resiliency.md training/mixed-precision.md +training/cuda-graphs.md +training/hybrid-context-parallel.md training/communication-overlap.md training/attention-optimizations.md training/activation-recomputation.md @@ -82,6 +84,13 @@ releases/changelog.md releases/known-issues.md ``` +```{toctree} +:caption: Agent Skills +:hidden: + +skills-index +``` + ```{toctree} :caption: Directory Readme Files :hidden: diff --git a/docs/parallelisms.md b/docs/parallelisms.md index e3fa7dea79..ac601d37c0 100644 --- a/docs/parallelisms.md +++ b/docs/parallelisms.md @@ -435,6 +435,53 @@ For example, with 32 GPUs total and the configuration above: - `context_parallel_size = 2` - `data_parallel_size = 32 / (2 × 4 × 2) = 2` +## Strategy Selection Guide + +Choosing the right combination depends on model size, hardware topology, +and sequence length. + +### Dense Models by Size + +| Model size | GPUs | Recommended starting point | +|---|---|---| +| < 1B | 1-8 | DP only | +| 1-10B | 8-16 | TP=2-4 + DP | +| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP | +| 70-175B | 64-256 | TP=8 + PP=4-8 + DP | +| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP | + +### MoE Models + +MoE models differ fundamentally from dense models: only a fraction of +parameters are active per token, so TP can often stay at 1 or 2. EP is +the primary scaling dimension. + +| Total / active params | Typical layout | +|---|---| +| < 20B | EP only (TP=1, PP=1) | +| 20-100B | TP=1-2 + PP=2-4 + EP=8-16 | +| 100-500B | TP=2-4 + PP=8-16 + EP=8-32 | +| 500B+ | TP=2 + PP=16 + EP=32-64 | + +### By Hardware Topology + +- **Single node with NVLink**: maximize TP within the node (up to TP=8). +- **Multiple nodes with InfiniBand**: keep TP within a node, use PP across nodes. +- **Limited network (Ethernet)**: minimize TP, prefer PP for cross-node scaling. + +### By Sequence Length + +| Sequence length | Recommendation | +|---|---| +| < 2K | standard TP + PP + DP | +| 2K-8K | add SP (`sequence_parallel=True`) | +| 8K-32K | add CP=2 | +| 32K+ | add CP=4-8, consider hierarchical CP | + +For operational details on configuring combined parallelism, troubleshooting +layouts, and memory estimation, see the +[parallelism strategies skill](skills/perf-techniques/parallelism-strategies/SKILL.md). + ## Configuration Guidelines ### Memory Optimization @@ -458,6 +505,11 @@ For example, with 32 GPUs total and the configuration above: - **Token dropping** requires `alltoall` or `alltoall_seq` token dispatcher - All parallelism strategies can be combined, but total parallelism must divide evenly into the world size +## Related Artifacts + +- **Operational skill**: [skills/perf-techniques/parallelism-strategies/SKILL.md](skills/perf-techniques/parallelism-strategies/SKILL.md) — enablement, pitfalls, memory estimation, verification +- **Knowledge card**: [skills/perf-techniques/parallelism-strategies/card.yaml](skills/perf-techniques/parallelism-strategies/card.yaml) — structured metadata and validation status + ## Resources - [Megatron Core Developer Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/) diff --git a/docs/performance-guide.md b/docs/performance-guide.md index cf3a62ba65..94254382b0 100644 --- a/docs/performance-guide.md +++ b/docs/performance-guide.md @@ -299,7 +299,7 @@ Additionally, because CP shards activations, it also partitions optimizer states > 1. Megatron-Bridge supports graph capture, significantly reducing host overhead. CUDA Graph is applicable only to LLMs with a static tensor shape across training steps. For example, it supports fixed-size packed sequences but does not handle sequences with varying lengths at each step. Also, MoE models with token-dropless propagation have limited CUDA graph support, restricted to the dense modules only. > 2. CUDA graph requires additional memory for static buffer management, typically adding a few gigabytes for static buffers, while models with PP size > 1 may consume over 10GB. We are actively working to reduce this memory overhead. - > 3. `TransformerConfig.enable_cuda_graph=true` + > 3. See [CUDA Graphs](training/cuda-graphs.md) for configuration details (`cuda_graph_impl`, `cuda_graph_scope`). 5. Bind CPU memory for GPU processes @@ -677,7 +677,7 @@ python -u /home/dpsk_a2a/deepep/tests/test_internode.py - `TransformerConfig.cpu_offloading_num_layers` - `TransformerConfig.cpu_offloading_weights` - `GPTProvider.cross_entropy_loss_fusion` -- `TransformerConfig.enable_cuda_graph` +- `TransformerConfig.cuda_graph_impl` / `cuda_graph_scope` (see [CUDA Graphs](training/cuda-graphs.md)) - `MixedPrecisionConfig.fp8_param_gather` - `GPTProvider.gradient_accumulation_fusion` - `TransformerConfig.masked_softmax_fusion` diff --git a/docs/skills-index.md b/docs/skills-index.md new file mode 100644 index 0000000000..4cad929006 --- /dev/null +++ b/docs/skills-index.md @@ -0,0 +1,28 @@ +# Agent Skills Reference + +Operational guides and validated knowledge cards for Megatron Bridge. + +Each skill contains enablement snippets, code anchors, constraints, pitfalls, +and verification steps. + +```{toctree} +:caption: Performance Techniques +:maxdepth: 1 + +skills/perf-techniques/parallelism-strategies/SKILL +skills/perf-techniques/cuda-graphs/SKILL +skills/perf-techniques/tp-dp-comm-overlap/SKILL +skills/perf-techniques/megatron-fsdp/SKILL +skills/perf-techniques/packed-sequences-long-context/SKILL +skills/perf-techniques/sequence-packing/SKILL +skills/perf-techniques/hybrid-context-parallel/SKILL +skills/perf-techniques/expert-parallel-overlap/SKILL +skills/perf-techniques/moe-comm-overlap/SKILL +``` + +```{toctree} +:caption: Resiliency +:maxdepth: 1 + +skills/resiliency/SKILL +``` diff --git a/docs/training/README.md b/docs/training/README.md index e8e086ba8c..5eda8095a5 100644 --- a/docs/training/README.md +++ b/docs/training/README.md @@ -41,6 +41,7 @@ This directory contains comprehensive documentation for training and customizing | **[Optimizer & Scheduler](optimizer-scheduler.md)** | Optimizer and learning rate scheduler configuration | Setting up optimization | | **[Mixed Precision](mixed-precision.md)** | Mixed precision training for memory efficiency | Reducing memory usage | | **[Communication Overlap](communication-overlap.md)** | Overlapping communication with computation | Optimizing distributed training | +| **[Hybrid Context Parallel](hybrid-context-parallel.md)** | Hierarchical `a2a+p2p` context parallel guidance | Advanced long-sequence scaling | | **[Attention Optimizations](attention-optimizations.md)** | Optimizing attention mechanisms | Improving training speed | | **[Activation Recomputation](activation-recomputation.md)** | Gradient checkpointing strategies | Reducing memory footprint | | **[CPU Offloading](cpu-offloading.md)** | Offloading to CPU for memory management | Working with limited GPU memory | @@ -59,6 +60,7 @@ This directory contains comprehensive documentation for training and customizing |----------|---------|--------------| | **[PEFT](peft.md)** | Parameter-Efficient Fine-Tuning (LoRA, etc.) | Fine-tuning with limited resources | | **[Packed Sequences](packed-sequences.md)** | Sequence packing for efficiency | Optimizing data loading | +| **[Megatron FSDP](megatron-fsdp.md)** | Stable overview of Megatron FSDP | Choosing an FSDP path | | **[Distillation](distillation.md)** | Knowledge distillation techniques | Transferring knowledge between models | | **[Checkpointing](checkpointing.md)** | Checkpoint saving, loading, and resuming | Managing training state | | **[Callbacks](callbacks.md)** | Inject custom logic into training loop | Custom logging, metrics, third-party integrations | diff --git a/docs/training/communication-overlap.md b/docs/training/communication-overlap.md index 198d9cd347..5009c05370 100644 --- a/docs/training/communication-overlap.md +++ b/docs/training/communication-overlap.md @@ -1,232 +1,111 @@ # Communication Overlap -Megatron Bridge supports overlapping communication with computation in distributed training to improve performance and throughput. This optimization technique reduces the impact of inter-GPU communication overhead by executing communication operations concurrently with computational operations whenever possible. +Communication overlap reduces exposed communication cost in distributed training +by overlapping collectives or point-to-point transfers with useful compute. +Megatron Bridge supports overlap across several parallelism dimensions, but the +available behavior is not identical for every mode. -Communication overlap is managed through the {py:class}`bridge.training.comm_overlap.CommOverlapConfig` class and can be applied to different types of parallelism: tensor parallelism (TP), pipeline parallelism (PP), data parallelism (DP), and context parallelism (CP). +This page is the stable overview for what communication overlap is, when to use +it, and which constraints are durable. For operational setup, code anchors, and +verification commands, see: -## Data-parallel Communication Overlap +- [skills/perf-techniques/tp-dp-comm-overlap/SKILL.md](../skills/perf-techniques/tp-dp-comm-overlap/SKILL.md) +- [skills/perf-techniques/expert-parallel-overlap/SKILL.md](../skills/perf-techniques/expert-parallel-overlap/SKILL.md) -Megatron Bridge supports the overlap of data-parallel (DP) communications with computations in LLM training. The framework features a Distributed Optimizer that distributes optimizer states and high-precision master parameters across GPUs. This introduces two types of data-parallel communications: reduce-scatter of gradients and all-gather of updated parameters. +## What It Is -The DP communication is chunked by the granularity of a Transformer layer and overlaps each communication chunk with computation. This overlap method exposes only one DP communication chunk ensuring efficient large-scale LLM training. When training with pipeline parallelism, the granularity of DP communication becomes the Transformer layers per virtual pipeline stage. +In Bridge, communication overlap spans several related subfeatures: -### Configuration +- data-parallel overlap for gradient reduce-scatter and parameter all-gather +- tensor-parallel overlap for TP communication under GEMM work +- pipeline-parallel overlap for PP send and receive behavior +- context-parallel overlap built into context-parallel execution paths +- MoE expert-parallel overlap for expert token dispatch communication -DP communication overlap settings can be inspected in Megatron Core via the `DistributedDataParallelConfig` class. DP gradient reduce-scatter and parameter all-gather overlaps are enabled when setting `overlap_grad_reduce=True` and `overlap_param_gather=True`, respectively. The precision of gradient reduce-scatter is controlled by `grad_reduce_in_fp32`. When `grad_reduce_in_fp32=False`, gradients are reduced in bf16, leading to improved performance in large-scale training compared to the default fp32 precision. When training in fp8 computing precision, setting `fp8_param_gather=True` conducts the parameter all-gather in fp8, reducing the all-gather overhead by half. +These are related performance techniques, but they do not share the same gates, +defaults, or operational risks. -Data parallel communication overlap settings are controlled through the distributed data parallel and communication overlap configurations. +## When to Use It -```{note} -Data-parallel overlap relies on attributes such as `grad_reduce_in_fp32` and `fp8_param_gather`. When a mixed-precision recipe (for example `bf16_mixed`, `fp16_mixed`, `bf16_with_fp8_delayed_scaling_mixed`, etc.) is provided, those attributes are sourced from the recipe stored in the `MixedPrecisionConfig`. Set the desired values inside the mixed-precision configuration rather than overriding them directly on the optimizer or DDP configs. This ensures the communication overlap settings and the selected precision recipe remain consistent. -``` +Communication overlap is a good fit when: -For example: +- the model already needs TP, DP, PP, CP, or EP for scale +- communication is a meaningful part of step time +- correctness is already established and you are tuning for throughput -```python -from megatron.bridge.training.config import ConfigContainer, OptimizerConfig -from megatron.bridge.training.comm_overlap import CommOverlapConfig -from megatron.bridge.training.mixed_precision import get_mixed_precision_config +It is less appropriate when: -# Configure communication overlap -comm_overlap_config = CommOverlapConfig( - tp_comm_overlap=False, # Tensor parallel overlap - overlap_grad_reduce=True, # Gradient reduce-scatter overlap - overlap_param_gather=True, # Parameter all-gather overlap - overlap_param_gather_with_optimizer_step=False, # Advanced optimization - bucket_size=128 * 1024 * 1024, # 128MB bucket size -) +- you are still bringing up a new training path and want minimal moving parts +- the feature combination is branch-sensitive or weakly validated +- launch-time environment tuning is likely to conflict with another technique -# Configure distributed optimizer -optimizer_config = OptimizerConfig( - optimizer="adam", - lr=3e-4, - use_distributed_optimizer=True, # Required for DP overlap - # ... other optimizer parameters -) +## Stable Per-Mode Guidance -# Mixed precision configuration controls overlap-related attributes -mixed_precision_config = get_mixed_precision_config("bf16_mixed") -mixed_precision_config.grad_reduce_in_fp32 = False # Use bf16 for gradient reduction -mixed_precision_config.fp8_param_gather = False +### Data Parallel -config = ConfigContainer( - comm_overlap=comm_overlap_config, - optimizer=optimizer_config, - mixed_precision=mixed_precision_config, - # ... other config parameters -) -``` +DP overlap is tied to the distributed-optimizer path. It is the natural overlap +mechanism for sharded optimizer-state training and should be reasoned about +together with distributed optimizer behavior rather than as an isolated knob. -Key data parallel overlap options: +### Tensor Parallel -- `overlap_grad_reduce`: Overlaps gradient reduce-scatter with computation (default: True) -- `overlap_param_gather`: Overlaps parameter all-gather with computation (default: True) -- `overlap_param_gather_with_optimizer_step`: Advanced optimization for pipeline parallelism -- `bucket_size`: Controls the granularity of communication chunking (default: 128MB) -- `grad_reduce_in_fp32`: Controls gradient reduction precision (False for bf16, True for fp32) -- `fp8_param_gather`: Enables fp8 parameter all-gather for reduced communication overhead +TP overlap is conceptually tied to sequence parallelism. If sequence +parallelism is not available or not enabled, TP overlap should not be assumed to +remain active. -## Tensor-parallel Communication Overlap +### Pipeline Parallel -Tensor parallelism, used with sequence-parallel activation sharding (`sequence_parallel=True`), introduces activation (gradient) all-gather and reduce-scatter operations. Megatron Bridge provides various options to overlap the tensor-parallel (TP) communications with computation. +PP overlap is not a blanket property of all pipeline-parallel training. In +practice, interleaved pipeline schedules are the most important positive case. -![Tensor-parallel Communication Overlap](images/tp_comm_overlap.png) -*Figure: Tensor-parallel communication overlap showing bulk and pipelined overlap strategies.* +### Context Parallel -The TP communication without direct computation dependency are overlapped with the computation in bulk (the linear layer and TP communication pairs in the yellow boxes). The bulk TP communication is enabled by default. The other TP communications with direct computation dependency are overlapped in pipelined fashion (the linear layer and TP communication pairs in the red boxes). +CP overlap is part of Bridge's context-parallel execution model rather than a +separate standalone technique page. For hierarchical or `a2a+p2p` CP guidance, +see `docs/training/hybrid-context-parallel.md`. -In the pipelined overlap, the activation (gradient) tensor all-gather is replaced with multiple steps of input P2P ring exchanges, and reduce-scatter is replaced with multiple steps of GEMM output P2P ring exchanges followed by a reduction of the received outputs. +### MoE Expert Parallel -### Configuration +MoE expert-parallel overlap hides the cost of token dispatch/combine all-to-all +communication by overlapping it with expert FFN compute. Optionally, delayed +expert weight-gradient computation (`moe_delay_wgrad_compute`) provides +additional overlap. -```python -from megatron.bridge.training.comm_overlap import ( - CommOverlapConfig, - TransformerLayerTPOverlapCfg, - userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192 -) +MoE overlap should be treated separately from generic TP, DP, and PP overlap. +Its constraints depend on dispatcher choice (`alltoall` or `flex`), expert +parallelism degree, precision (BF16/FP16), and runtime support. When pipeline +parallelism is used, virtual pipeline parallelism is required for the overlap +scheduling to interleave correctly. -# Configure tensor parallel overlap -comm_overlap_config = CommOverlapConfig( - tp_comm_overlap=True, # Enable TP communication overlap - tp_comm_overlap_cfg=userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192, # Predefined config - tp_comm_bootstrap_backend="nccl", # Communication backend -) -``` +## Stable Constraints and Caveats -Requirements for TP communication overlap: -- `tensor_model_parallel_size >= 2` -- `sequence_parallel=True` -- Appropriate hardware configuration +The most durable caveats are: -### Advanced Configuration +1. Not all overlap modes are auto-enabled in the same situations. +2. Some overlap-related precision settings are owned by mixed-precision config, + not by standalone overlap tuning alone. +3. Launch-time environment settings are part of the technique in practice, + especially for TP, CP, and MoE overlap paths. +4. Recipe defaults are often conservative; feature existence does not imply that + every public recipe enables the corresponding overlap path. -For most use cases, setting `tp_comm_overlap=True` with `tp_comm_overlap_cfg=None` (the default) will automatically configure appropriate overlap settings. For advanced users requiring custom optimization, Megatron Bridge includes predefined configurations optimized for specific hardware and model combinations. These configurations are available in the `comm_overlap` module but require expert knowledge to use effectively. - -## Pipeline-parallel Communication Overlap - -Pipeline parallelism introduces P2P activation (gradient) sends and receives between pipeline-parallel (PP) GPUs. The PP communication frequency increases when increasing the virtual-pipeline-parallel size because the number of Transformer layers executed per micro-batch decreases. - -![Pipeline-parallel Communication Overlap](images/pp_comm_overlap.png) -*Figure: Pipeline-parallel communication overlap in 1F1B pipelining phase.* - -Megatron Bridge supports the overlap of PP communications with non-dependent computations in the 1F1B stage (the body of pipelining, where 1 forward and 1 backward micro-batch executions are interleaved). The PP communications in pipeline fill and flush stages are still exposed. - -### Configuration - -```python -comm_overlap_config = CommOverlapConfig( - tp_comm_overlap=False, - overlap_p2p_comm=True, # Enable PP communication overlap - batch_p2p_comm=False, # Use separate send/receive kernels -) -``` - -PP communication overlap settings: -- `overlap_p2p_comm`: Enables overlap of P2P communications (default: auto-configured) -- `batch_p2p_comm`: Uses batched vs separate kernels (default: auto-configured based on virtual PP) - -The overlap is automatically enabled when: -- `pipeline_model_parallel_size > 1` -- `virtual_pipeline_model_parallel_size > 1` (for optimal performance) - -## Context-parallel Communication Overlap - -Context parallelism partitions activations (gradients) on all layers in the sequence domain. This introduces all-gather and reduce-scatter of activations (gradients) in self-attention forward- and back-propagations. - -Megatron Bridge hides the context-parallel (CP) communications under the self-attention computation. Like the TP communication overlaps, the CP communications are chunked then pipeline-overlapped with the self-attention computation, where the all-gather and the reduce-scatter of activations (gradients) are replaced with P2P ring exchanges of data. - -### Automatic Configuration - -The CP communication overlap is automatically enabled when context parallelism is used (`context_parallel_size > 1`). No additional configuration is required as the overlap is built into the context parallelism implementation. - -## MoE Expert Parallel Communication Overlap - -For Mixture of Experts (MoE) models, Megatron Bridge supports overlapping expert parallel all-to-all communications with computation. - -### Configuration - -```python -comm_overlap_config = CommOverlapConfig( - tp_comm_overlap=False, - overlap_moe_expert_parallel_comm=True, # Enable MoE EP overlap - delay_wgrad_compute=True, # Advanced MoE optimization -) -``` - -Requirements for MoE expert parallel overlap: -- `expert_model_parallel_size > 1` -- `num_moe_experts > 1` -- `moe_token_dispatcher_type` in ["alltoall", "flex"] -- BF16 or FP16 precision -- PyTorch >= 2.6.0 -- Specific recomputation settings - -## Complete Configuration Example - -Here's a comprehensive example combining multiple communication overlap strategies: - -```python -from megatron.bridge.training.config import ConfigContainer, OptimizerConfig -from megatron.bridge.training.comm_overlap import ( - CommOverlapConfig, - userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192 -) -from megatron.bridge.models import GPTModelProvider - -# Model configuration with parallelism -model_config = GPTModelProvider( - # Parallelism settings - tensor_model_parallel_size=4, - pipeline_model_parallel_size=2, - virtual_pipeline_model_parallel_size=2, - context_parallel_size=2, - sequence_parallel=True, - - # Model parameters - hidden_size=8192, - num_layers=32, - # ... other model parameters -) - -# Communication overlap configuration -comm_overlap_config = CommOverlapConfig( - # Tensor parallel overlap - tp_comm_overlap=True, - tp_comm_overlap_cfg=userbuffers_bf16_h100_h8192_tp4_mbs1_seqlen8192, - - # Pipeline parallel overlap - overlap_p2p_comm=True, - batch_p2p_comm=False, - - # Data parallel overlap - overlap_grad_reduce=True, - overlap_param_gather=True, - bucket_size=128 * 1024 * 1024, -) - -# Optimizer with distributed settings -optimizer_config = OptimizerConfig( - optimizer="adam", - lr=3e-4, - use_distributed_optimizer=True, -) - -# Complete configuration -config = ConfigContainer( - model=model_config, - comm_overlap=comm_overlap_config, - optimizer=optimizer_config, -) -``` - - -## API Reference - -For detailed API documentation, see: -- {py:class}`bridge.training.comm_overlap.CommOverlapConfig` - Main configuration class -- {py:class}`bridge.training.comm_overlap.TransformerLayerTPOverlapCfg` - Tensor parallel overlap configuration -- {py:class}`bridge.training.comm_overlap.BulkOverlapCfg` - Bulk overlap configuration -- {py:class}`bridge.training.comm_overlap.PipelineOverlapCfg` - Pipeline overlap configuration -- {py:class}`bridge.training.comm_overlap.RingExchangeOverlapCfg` - Ring exchange overlap configuration -megatron-core/developer-guide/latest/api-guide/tensor_parallel.html) - Underlying implementation details +## Recommendation Level + +Treat communication overlap as a tuning layer on top of a working distributed +configuration, not as the first knob to reach for when basic correctness is +still uncertain. + +For most teams, the right order is: + +1. establish a correct distributed configuration +2. choose the necessary parallelism strategy +3. enable or tune overlap for the specific communication bottleneck + +## Related Docs + +- [docs/performance-guide.md](../performance-guide.md) +- [docs/training/hybrid-context-parallel.md](hybrid-context-parallel.md) +- [skills/perf-techniques/tp-dp-comm-overlap/SKILL.md](../skills/perf-techniques/tp-dp-comm-overlap/SKILL.md) +- [skills/perf-techniques/expert-parallel-overlap/SKILL.md](../skills/perf-techniques/expert-parallel-overlap/SKILL.md) +- [skills/perf-techniques/moe-comm-overlap/SKILL.md](../skills/perf-techniques/moe-comm-overlap/SKILL.md) +- [skills/perf-techniques/moe-comm-overlap/card.yaml](../skills/perf-techniques/moe-comm-overlap/card.yaml) diff --git a/docs/training/cuda-graphs.md b/docs/training/cuda-graphs.md new file mode 100644 index 0000000000..0890bae21c --- /dev/null +++ b/docs/training/cuda-graphs.md @@ -0,0 +1,93 @@ +# CUDA Graphs + +CUDA graphs capture a sequence of GPU operations once and replay them with +minimal host overhead, eliminating repeated kernel-launch and driver costs on +every training step. Megatron Bridge supports two capture implementations and +fine-grained scope selection to balance performance gain against memory cost. + +This page is the stable overview for what CUDA graphs are, when to use them, +and which constraints are durable. For operational setup, code anchors, and +verification commands, see [skills/perf-techniques/cuda-graphs/SKILL.md](../skills/perf-techniques/cuda-graphs/SKILL.md). + +## What It Is + +CUDA graphs work by recording a sequence of GPU operations (kernels, memory +copies, etc.) into a graph during a capture phase, then replaying that graph +on subsequent steps. This eliminates per-step host-side overhead such as +kernel launch latency and driver API calls. + +In Bridge, there are two capture implementations: + +| `cuda_graph_impl` | Mechanism | Scope support | +|---|---|---| +| `"local"` | MCore `CudaGraphManager` / `FullCudaGraphWrapper` | `full_iteration` (whole fwd+bwd) | +| `"transformer_engine"` | TE `make_graphed_callables()` per layer | `attn`, `mlp`, `moe`, `moe_router`, `moe_preprocess`, `mamba` | +| `"none"` (default) | Disabled | — | + +## When to Use It + +CUDA graphs are most effective when: + +- **Tensor shapes are static** across training steps (fixed sequence length, + fixed micro-batch size). Variable-length sequences break graph replay + assumptions. +- **Host overhead is significant** relative to GPU compute — smaller models + or high step rates benefit most. +- **Memory budget allows it** — graph capture allocates static buffers, + typically adding a few GB. Models with `PP > 1` can consume over 10 GB + of additional memory. + +### Local full-iteration graphs + +Captures the entire forward-backward pass as one graph. Provides the highest +host-overhead reduction but requires disabling NaN checks and has the largest +memory footprint. + +### Transformer Engine scoped graphs + +Captures individual layer components (attention, MLP, MoE router, etc.) +through TE. More flexible, works with MoE models where only dense modules +can be graphed, and supports selective scope combinations. + +## Configuration + +```python +cfg.model.cuda_graph_impl = "transformer_engine" # or "local" +cfg.model.cuda_graph_scope = ["attn", "moe_router"] # scope list +cfg.model.cuda_graph_warmup_steps = 3 # warmup before capture +cfg.rng.te_rng_tracker = True # required +``` + +### Key constraints + +- `cfg.rng.te_rng_tracker` must be `True` when `cuda_graph_impl != "none"`. +- `full_iteration` scope requires `cuda_graph_impl = "local"` and + `rerun_state_machine.check_for_nan_in_loss = False`. +- MoE models with token-dropless routing have limited graph support + (dense modules only). +- `cuda_graph_impl = "none"` automatically clears `cuda_graph_scope`. + +## MoE Considerations + +MoE models often cannot graph the full expert dispatch path due to dynamic +token routing. Common practice: + +- Graph `moe_router` and `moe_preprocess` (static portions). +- Add `attn` scope for the dense attention blocks. +- Leave expert dispatch in eager mode. + +Do not combine `moe` scope with `moe_router` scope — they are mutually +exclusive. + +## Memory Impact + +CUDA graphs allocate static buffers that persist for the duration of training. +Expect a few GB of additional memory. With `PP > 1`, memory overhead can +exceed 10 GB due to pipeline-stage buffering. Plan activation memory +accordingly. + +## Related Docs + +- [docs/performance-guide.md](../performance-guide.md) +- [docs/training/communication-overlap.md](communication-overlap.md) +- [skills/perf-techniques/cuda-graphs/SKILL.md](../skills/perf-techniques/cuda-graphs/SKILL.md) diff --git a/docs/training/hybrid-context-parallel.md b/docs/training/hybrid-context-parallel.md new file mode 100644 index 0000000000..04561ae67a --- /dev/null +++ b/docs/training/hybrid-context-parallel.md @@ -0,0 +1,113 @@ +# Hybrid / Hierarchical Context Parallel + +This page covers the stable Bridge-facing meaning of hierarchical context +parallelism, especially the `a2a+p2p` transport path and +`hierarchical_context_parallel_sizes`. + +For operational setup, code anchors, and verification commands, see +[skills/perf-techniques/hybrid-context-parallel/SKILL.md](../skills/perf-techniques/hybrid-context-parallel/SKILL.md). + +## What It Is + +Context parallelism (CP) splits the input sequence across GPUs so each rank +processes a chunk. The GPUs must communicate KV data during attention. There are +several CP communication backends: + +| `cp_comm_type` | Mechanism | Async / Overlap | Constraint | +|---|---|---|---| +| `"p2p"` | Ring-exchange of KV chunks | Yes | None | +| `"all_gather"` | All-gather full KV before attention | No | None | +| `"a2a"` | All-to-all: scatter heads, gather full sequence (Ulysses-style) | N/A | **CP <= num_kv_heads** | +| `"a2a+p2p"` | Hierarchical: a2a within inner group, p2p across outer group | Partial (p2p part) | Requires `hierarchical_context_parallel_sizes` | + +**HCP (`a2a+p2p`)** exists to scale CP beyond the KV head count by combining +a2a (fast, head-parallel) on intra-node links with p2p (async, +sequence-parallel) on inter-node links. + +It is important to separate this from the upstream boolean +`hybrid_context_parallel`, which is a different feature for balancing packed or +variable-length workloads. The two concepts should not be treated as +interchangeable. + +### Why a2a is limited by KV heads + +a2a transposes the parallelism dimension: each rank trades its sequence chunk +for a subset of attention heads. After the all-to-all, every rank has the +**full sequence** but only `heads / CP` heads. This means: + +- `heads / CP` must be a positive integer. +- The bottleneck is KV heads (not Q heads), because in GQA the KV heads are the + indivisible unit. +- If the model has 8 KV heads, pure a2a supports at most CP=8. + +HCP breaks this limit by applying a2a only within a sub-group small enough to +fit within the KV head count. + +## When to Use It + +**Use HCP when ALL of these are true:** + +1. You need CP larger than `num_kv_heads / TP` (pure a2a won't fit). +2. You cannot (or don't want to) increase TP to shrink CP. +3. Your cluster has a clear bandwidth hierarchy (e.g., NVLink intra-node >> IB + inter-node). + +**Prefer pure `a2a` when:** + +- You can adjust TP so that `CP <= num_kv_heads / TP`. This is simpler, avoids + the p2p overhead, and often yields the same throughput with better memory + headroom. + +**Prefer pure `p2p` when:** + +- You have very few KV heads or want maximum CP flexibility. +- Your workload can hide the p2p latency behind compute (long sequences help). + +### Decision example + +Model: 8 KV heads. Cluster: 4 nodes x 8 GPUs. Goal: train 128K sequences. + +| Option | TP | CP | `cp_comm_type` | Notes | +|---|---|---|---|---| +| A | 1 | 16 | `a2a+p2p` with `[8,2]` | a2a intra-node (8 GPUs), p2p across 2 node-groups | +| B | 2 | 4 | `a2a` | CP=4 <= 8 KV heads. Simpler. Often same throughput. | +| C | 1 | 16 | `p2p` | Works but no a2a bandwidth benefit intra-node | + +In practice, **option B is usually preferred** -- benchmarks showed identical +throughput to option A with more memory headroom. + +It should be treated as an advanced feature rather than a default recommendation. + +## Stable Bridge Limitation + +The most important Bridge-specific limitation is that hierarchical context +parallelism is currently supported only on the MPU initialization path. + +In practice, that means: + +- `dist.use_decentralized_pg=False` is the supported Bridge path +- the decentralized process-group path should not be assumed to materialize HCP + groups + +## Stable Constraints + +The durable constraints are: + +- `hierarchical_context_parallel_sizes` must match + `context_parallel_size` multiplicatively +- the usual CP sequence-length divisibility rules still apply +- Transformer Engine version support matters for `a2a+p2p` + +## Recommendation Level + +Use hierarchical context parallelism in Bridge only when you intentionally want +that transport path and are prepared to validate execution-path details. It is +not yet the kind of feature that should be presented as universally safe across +all Bridge initialization modes. + +## Related Docs + +- [docs/performance-guide.md](../performance-guide.md) +- [docs/training/communication-overlap.md](communication-overlap.md) +- [skills/perf-techniques/hybrid-context-parallel/SKILL.md](../skills/perf-techniques/hybrid-context-parallel/SKILL.md) +- [skills/perf-techniques/hybrid-context-parallel/card.yaml](../skills/perf-techniques/hybrid-context-parallel/card.yaml) diff --git a/docs/training/megatron-fsdp.md b/docs/training/megatron-fsdp.md index 4c13ccbba1..58d6a79d68 100644 --- a/docs/training/megatron-fsdp.md +++ b/docs/training/megatron-fsdp.md @@ -1,277 +1,116 @@ # Megatron FSDP -Megatron Fully Sharded Data Parallel (FSDP) is a memory-optimized data parallelism strategy that shards model parameters, gradients, and optimizer states across GPUs. This approach provides significant memory savings compared to standard Distributed Data Parallel (DDP), enabling training of larger models or using larger batch sizes on the same hardware. +Megatron FSDP is the practical fully sharded data parallel path in Megatron +Bridge today. It shards parameters, gradients, and optimizer state across data +parallel ranks, which can reduce model-state memory substantially compared with +plain Distributed Data Parallel (DDP) or the distributed optimizer path. -## Overview +This page is the stable overview for what Megatron FSDP is, when to use it, and +what constraints matter. For operational enablement, code anchors, and +verification commands, see [skills/perf-techniques/megatron-fsdp/SKILL.md](../skills/perf-techniques/megatron-fsdp/SKILL.md). -FSDP reduces memory consumption by: -- **Sharding model parameters** across data parallel ranks instead of replicating them -- **Sharding optimizer states** across data parallel ranks -- **Sharding gradients** across data parallel ranks -- **Gathering parameters on-demand** during forward and backward passes +## What It Is -This strategy is particularly effective for large models where parameter and optimizer state memory dominates GPU memory usage. +Megatron FSDP is the Megatron-Core custom FSDP implementation exposed in Bridge +through `use_megatron_fsdp`. -## Comparison with Other Strategies +Compared with other data-parallel strategies: | Feature | DDP | Distributed Optimizer | Megatron FSDP | -|---------|-----|----------------------|---------------| -| **Parameter Storage** | Replicated | Replicated | Sharded | -| **Optimizer States** | Replicated | Sharded | Sharded | -| **Gradient Communication** | All-reduce | Reduce-scatter | Reduce-scatter | -| **Parameter Communication** | None | All-gather (after update) | All-gather (on-demand) | -| **Memory Efficiency** | Baseline | High | Highest | -| **Communication Overhead** | Low | Medium | Medium-High | - -**When to use each strategy:** -- **DDP**: Default choice for most training scenarios with sufficient memory -- **Distributed Optimizer**: Good balance of memory savings and performance -- **Megatron FSDP**: Maximum memory savings when training very large models - -## Configuration - -### Enable Megatron FSDP - -To enable Megatron FSDP, set `use_megatron_fsdp=True` in both the {py:class}`bridge.training.config.DistributedInitConfig` and {py:class}`bridge.training.config.DistributedDataParallelConfig`: - -```python -from megatron.bridge.training.config import ( - ConfigContainer, - DistributedInitConfig, - DistributedDataParallelConfig, - CheckpointConfig, -) - -# Enable Megatron FSDP -dist_config = DistributedInitConfig( - use_megatron_fsdp=True, -) - -ddp_config = DistributedDataParallelConfig( - use_megatron_fsdp=True, -) - -# Required checkpoint format -checkpoint_config = CheckpointConfig( - ckpt_format="fsdp_dtensor", - save="/path/to/checkpoints", -) - -config = ConfigContainer( - dist=dist_config, - ddp=ddp_config, - checkpoint=checkpoint_config, - # ... other config parameters -) -``` - -### Checkpoint Format Requirement - -**Important**: Megatron FSDP requires the `fsdp_dtensor` checkpoint format. This format is specifically designed to handle the sharded parameter layout used by FSDP. - -```python -checkpoint_config = CheckpointConfig( - ckpt_format="fsdp_dtensor", # Required for Megatron FSDP - save="/path/to/checkpoints", - load="/path/to/checkpoints", # Optional: resume from checkpoint -) -``` - -Attempting to use other checkpoint formats (`torch_dist`, `zarr`) with Megatron FSDP will result in a configuration error. - -## Compatibility and Limitations - -### Compatible With -- **Tensor Parallelism**: Can be combined with TP for additional memory savings -- **Pipeline Parallelism**: Can be combined with PP for multi-stage model training -- **Context Parallelism**: Can be combined with CP for long sequence training -- **Expert Parallelism**: Can be combined with EP for MoE models -- **Mixed Precision**: Supports BF16 and FP16 training -- **Distributed Checkpointing**: Uses `fsdp_dtensor` format - -### Not Compatible With -- **`use_tp_pp_dp_mapping`**: This alternative rank initialization order conflicts with FSDP's sharding strategy -- **`reuse_grad_buf_for_mxfp8_param_ag`**: Gradient buffer reuse optimizations are disabled with FSDP -- **Legacy checkpoint formats**: Must use `fsdp_dtensor` format - -### Automatic Configuration Adjustments - -When Megatron FSDP is enabled, the following settings are automatically adjusted: -- `ddp.average_in_collective` is set to `False` (FSDP handles gradient synchronization differently) -- `model.gradient_accumulation_fusion` is set to `False` (not compatible with FSDP) -- `ddp.reuse_grad_buf_for_mxfp8_param_ag` is set to `False` (disabled for FSDP) -- `optimizer.reuse_grad_buf_for_mxfp8_param_ag` is set to `False` (disabled for FSDP) - -## Complete Configuration Example - -Here's a complete example showing how to configure training with Megatron FSDP: - -```python -from megatron.bridge.models import GPTModelProvider -from megatron.bridge.training.config import ( - ConfigContainer, - DistributedInitConfig, - DistributedDataParallelConfig, - OptimizerConfig, - SchedulerConfig, - TrainingConfig, - CheckpointConfig, -) - -# Model configuration with tensor parallelism -model_config = GPTModelProvider( - num_layers=32, - hidden_size=4096, - num_attention_heads=32, - seq_length=2048, - tensor_model_parallel_size=2, # Optional: combine with TP - # ... other model parameters -) - -# Enable Megatron FSDP -dist_config = DistributedInitConfig( - use_megatron_fsdp=True, -) - -ddp_config = DistributedDataParallelConfig( - use_megatron_fsdp=True, -) - -# Optimizer configuration -optimizer_config = OptimizerConfig( - optimizer="adam", - lr=3e-4, - weight_decay=0.1, - adam_beta1=0.9, - adam_beta2=0.95, - clip_grad=1.0, -) - -# Scheduler configuration -scheduler_config = SchedulerConfig( - lr_decay_style="cosine", - lr_warmup_iters=1000, -) - -# Training configuration -train_config = TrainingConfig( - micro_batch_size=2, - global_batch_size=32, - train_iters=10000, -) - -# Checkpoint configuration with required format -checkpoint_config = CheckpointConfig( - save="/path/to/checkpoints", - save_interval=1000, - ckpt_format="fsdp_dtensor", # Required for FSDP -) - -# Create complete configuration -config = ConfigContainer( - model=model_config, - dist=dist_config, - ddp=ddp_config, - optimizer=optimizer_config, - scheduler=scheduler_config, - train=train_config, - checkpoint=checkpoint_config, - # ... other config parameters -) -``` - -## Migration from DDP - -To migrate from standard DDP to Megatron FSDP: - -1. **Enable FSDP** in both `dist` and `ddp` configurations: - ```python - dist_config.use_megatron_fsdp = True - ddp_config.use_megatron_fsdp = True - ``` - -2. **Change checkpoint format** to `fsdp_dtensor`: - ```python - checkpoint_config.ckpt_format = "fsdp_dtensor" - ``` - -3. **Remove incompatible settings**: - - Remove `use_tp_pp_dp_mapping=True` if set - - Remove `reuse_grad_buf_for_mxfp8_param_ag=True` if set - -4. **Start training** - automatic configuration adjustments will be applied - -**Note**: Checkpoints saved with DDP cannot be directly loaded with FSDP due to different parameter layouts. You'll need to restart training or convert checkpoints using the appropriate conversion tools. - -## Torch FSDP2 Alternative - -Megatron Bridge also supports an alternative FSDP implementation through PyTorch's FSDP2: - -```python -dist_config = DistributedInitConfig( - use_torch_fsdp2=True, # Use PyTorch FSDP2 instead -) -``` - -**Important**: `use_megatron_fsdp` and `use_torch_fsdp2` are mutually exclusive - you can only enable one at a time. - -**Limitations of Torch FSDP2**: -- Not currently compatible with Pipeline Parallelism -- Still in experimental stage with potential bugs -- Does not require `fsdp_dtensor` checkpoint format - -## Performance Considerations - -### Memory Savings -- **Parameters**: Reduced by factor of data parallel size -- **Optimizer States**: Reduced by factor of data parallel size -- **Gradients**: Reduced by factor of data parallel size -- **Activations**: Not affected by FSDP (use activation checkpointing separately) - -### Communication Overhead -- **Parameter All-Gather**: Additional communication during forward and backward passes -- **Gradient Reduce-Scatter**: Similar to distributed optimizer -- **Network Sensitivity**: Performance depends on inter-GPU bandwidth - -### Optimization Tips -1. **Combine with Tensor Parallelism**: Reduces memory further and improves compute efficiency -2. **Use Larger Batch Sizes**: Take advantage of freed memory for better throughput -3. **High-Bandwidth Interconnects**: FSDP benefits from fast inter-GPU communication (NVLink, InfiniBand) -4. **Enable Mixed Precision**: Reduces communication volume and memory footprint - -## Troubleshooting - -### Configuration Errors - -**Error: "use_tp_pp_dp_mapping is not supported with Megatron FSDP"** -- Remove `use_tp_pp_dp_mapping=True` from your configuration -- FSDP requires standard rank initialization order - -**Error: "Megatron FSDP only supports fsdp_dtensor checkpoint format"** -- Set `checkpoint.ckpt_format="fsdp_dtensor"` in your configuration -- Other formats are not compatible with FSDP's sharded layout - -**Error: "Using use_megatron_fsdp and use_torch_fsdp2 at the same time is not supported"** -- Choose one FSDP implementation: either Megatron FSDP or Torch FSDP2 -- Do not enable both simultaneously - -### Performance Issues - -**Slow Training with FSDP** -- Check inter-GPU bandwidth (use `nvidia-smi topo -m`) -- Ensure NVLink or high-speed interconnects are available -- Consider using larger micro-batch sizes to amortize communication -- Profile with `nsys` or PyTorch profiler to identify bottlenecks - -**Out of Memory with FSDP Enabled** -- Verify FSDP is correctly enabled in both `dist` and `ddp` configs -- Check that `fsdp_dtensor` checkpoint format is being used -- Reduce micro-batch size or model size -- Enable activation checkpointing for additional memory savings +|---|---|---|---| +| Parameter Storage | Replicated | Replicated | Sharded | +| Optimizer States | Replicated | Sharded | Sharded | +| Gradient Communication | All-reduce | Reduce-scatter | Reduce-scatter | +| Parameter Communication | None | All-gather (after update) | All-gather (on-demand) | +| Memory Efficiency | Baseline | High | Highest | +| Communication Overhead | Low | Medium | Medium-High | -## Resources +The practical consequence is that Megatron FSDP is most useful when model-state +memory, rather than activation memory, is the main bottleneck. -- {doc}`checkpointing` - Checkpoint saving and loading with FSDP -- {doc}`../parallelisms` - Understanding data and model parallelism strategies -- {doc}`config-container-overview` - Complete configuration reference -- [Megatron Core Developer Guide](https://docs.nvidia.com/megatron-core/developer-guide/latest/) +## When to Use It + +Megatron FSDP is a good fit when all of the following are true: + +- the model is too large for plain DDP or distributed optimizer +- you want the strongest currently supported FSDP path in Bridge +- you are willing to trade more communication for lower memory +- you can adopt the required FSDP checkpoint format + +Prefer another path when: + +- DDP already fits comfortably and simplicity matters most +- distributed optimizer gives enough memory relief without fully sharding +- you are evaluating PyTorch FSDP2 for production use on this branch + +## Stable Requirements + +Megatron FSDP in Bridge requires: + +- `use_megatron_fsdp` to be enabled +- checkpoint format `fsdp_dtensor` +- standard rank initialization order + +The `fsdp_dtensor` format uses PyTorch DTensor and +`torch.distributed.checkpoint` (DCP) to store sharded parameters and optimizer +state. It is **not interchangeable** with `torch_dist` or `zarr` checkpoints — +you cannot load an `fsdp_dtensor` checkpoint into a non-FSDP run or vice versa. + +`fsdp_dtensor` is compatible with 5D parallelism (TP + PP + DP + CP + EP). +Because DCP stores DTensor placement metadata, checkpoints saved under one +parallelism layout can be loaded under a different layout (e.g., change TP or PP +size between runs) — DCP handles the shard remapping automatically. The one +unsupported combination is `use_tp_pp_dp_mapping=True`, which uses an +alternative rank-initialization order that conflicts with FSDP sharding. + +Important stable constraints: + +- `use_megatron_fsdp` and `use_torch_fsdp2` are mutually exclusive +- `use_tp_pp_dp_mapping` is not supported with Megatron FSDP +- legacy checkpoint formats such as `torch_dist` and `zarr` are not valid for + Megatron FSDP save/load + +When Megatron FSDP is enabled, Bridge also adjusts some settings +automatically, including disabling `average_in_collective` and several +buffer-reuse optimizations that do not match the FSDP path. + +## Compatibility and Caveats + +At the configuration level, Megatron FSDP is intended to work with: + +- tensor parallelism +- pipeline parallelism +- context parallelism +- expert parallelism +- BF16 or FP16 mixed precision + +However, not every combination has the same level of in-repo validation or +performance evidence. Treat broad compatibility as code-supported first, not as +fully benchmark-proven for every combination. + +Two practical caveats matter most: + +1. Public recipes may expose `use_megatron_fsdp` while still defaulting to a + non-FSDP checkpoint format. The checkpoint requirement is stable and + mandatory even when recipe ergonomics lag behind. +2. FSDP reduces model-state memory, not activation memory. For long-sequence or + activation-bound workloads, other techniques such as context parallelism, + activation recomputation, or CPU offloading may still be needed. + +## Torch FSDP2 Status + +Megatron Bridge also exposes a PyTorch FSDP2 path via `use_torch_fsdp2`, but +that path should still be treated as experimental on this branch. + +The stable recommendation today is: + +- use Megatron FSDP if you need an FSDP path in Bridge +- do not treat FSDP2 as interchangeable with Megatron FSDP + +## Related Docs + +- [docs/training/checkpointing.md](checkpointing.md) +- [docs/training/cpu-offloading.md](cpu-offloading.md) +- [docs/performance-guide.md](../performance-guide.md) +- [skills/perf-techniques/megatron-fsdp/SKILL.md](../skills/perf-techniques/megatron-fsdp/SKILL.md) +- [skills/perf-techniques/megatron-fsdp/card.yaml](../skills/perf-techniques/megatron-fsdp/card.yaml) diff --git a/docs/training/packed-sequences.md b/docs/training/packed-sequences.md index ab741881c9..1d57247b1d 100644 --- a/docs/training/packed-sequences.md +++ b/docs/training/packed-sequences.md @@ -1,184 +1,96 @@ # Packed Sequences -This guide explains how to use packed sequences in Megatron Bridge for efficient supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT). +Packed sequences are a fine-tuning technique that reduces padding waste by +concatenating multiple examples into one pack while preserving sequence +boundaries for attention. In Megatron Bridge, this is primarily a supervised +fine-tuning and PEFT optimization rather than a general pretraining feature. -## Overview +This page is the stable overview for what packed sequences are, when to use +them, and which constraints are durable. For operational setup, code anchors, +and verification commands, see [skills/perf-techniques/sequence-packing/SKILL.md](../skills/perf-techniques/sequence-packing/SKILL.md). -When fine-tuning large language models, GPU under-utilization often occurs due to inefficient input data structure. This inefficiency arises because many fine-tuning datasets have a skewed distribution of sequence lengths, with many short sequences and a few long ones, following [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law). Since transformer models require fixed-length inputs, shorter sequences must be padded with many padding tokens. +## What It Is -This leads to two main inefficiencies: - -- Computation performed on the pad tokens is eventually masked out, resulting in wasted GPU computation. -- Micro batch size is often limited by the batch which contains longer sequences, so that most other micro batches have under-utilized GPU memory. - -Packed sequences is a training technique where multiple training sequences (examples) are concatenated into one long sequence (pack). This technique greatly reduces the number of padding tokens, allowing more meaningful tokens to be processed in each micro batch. As a result, it maximizes both GPU compute and GPU memory utilization. - -**Note:** Sequence packing is primarily beneficial for fine-tuning workloads. Megatron-style pretraining datasets (using `IndexedDataset` and `GPTDataset`) already concatenate documents during sampling to fill sequences to the target length, eliminating padding tokens without requiring the boundary-aware packing infrastructure described here. For supervised fine-tuning, however, naive concatenation is insufficient—each training example must be treated individually to preserve data quality. - -The conventional solution is to build a custom attention mask (specifically, a block triangular mask) to mask out attention values between sequences. However, this increases the complexity of attention from $\sum_i {s_i}^2$ to $\Big({\sum_i {s_i}}\Big)^2$, where $s_i$ is the length of the $i$th subsequence. In practice, the conventional solution puts a limit on the packed sequence size. - -Instead, Megatron Bridge provides a highly optimized version of sequence packing which makes use of variable-length attention kernels in FlashAttention and TransformerEngine. Instead of providing a custom attention mask, information about sequence boundaries is passed in with the `cu_seqlens` variable (short for cumulative sequence length). With this approach, attention values between sequences are never calculated, so the complexity of attention remains at $\sum_i {s_i}^2$. This allows the packed sequence size to increase to arbitrary lengths without affecting the memory complexity, so that GPU memory can be fully utilized. - -The packed sequence implementation automatically creates {py:class}`bridge.data.datasets.sft.GPTSFTPackedDataset` instances when `.npy` files are detected, providing optimized data loading and batching for packed sequences. - -## Using Packed Sequences - -### Prepare the Dataset - -In Megatron Bridge, the packed dataset is automatically prepared before training using the {py:func}`bridge.data.datasets.packed_sequence.prepare_packed_sequence_data` function, eliminating the need for any additional preprocessing steps. - -### Configure Packed Sequences - -Packed sequences are configured through the {py:class}`bridge.training.config.FinetuningDatasetConfig` by specifying `packed_sequence_specs`: - -```python -from megatron.bridge.training.config import ConfigContainer, FinetuningDatasetConfig -from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs - -config = ConfigContainer( - # ... other configurations - dataset=FinetuningDatasetConfig( - dataset_root="/path/to/your/dataset", - seq_length=2048, - packed_sequence_specs=PackedSequenceSpecs( - packed_sequence_size=2048, - tokenizer_model_name="your_tokenizer_name", - ), - ), - # ... other configurations -) -``` - -### PackedSequenceSpecs Configuration - -The {py:class}`bridge.data.datasets.packed_sequence.PackedSequenceSpecs` class provides the following configuration options: - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `packed_sequence_size` | `int` | `-1` | If positive, enables sequence packing with the specified pack size. If ≤ 0, sequence packing is disabled. | -| `tokenizer_model_name` | `str` | `None` | Tokenizer model name for tracking, since different tokenizers produce different packed datasets. | -| `packed_train_data_path` | `str` | `None` | Custom path for packed training dataset file (`.npy` format). | -| `packed_val_data_path` | `str` | `None` | Custom path for packed validation dataset file (`.npy` format). | -| `packed_metadata_path` | `str` | `None` | Custom path for packing metadata file (`.jsonl` format). | -| `pad_seq_to_mult` | `int \| None` | `None` | Pad each sample to a multiple of this value when generating packed datasets (e.g., set to `2 * context_parallel_size` for THD CP). | -| `pad_cu_seqlens` | `bool` | `False` | Whether to pad `cu_seqlens` to constant size, required for CUDA graphs. | - -### Batch Size Considerations - -When using packed sequences, you must adjust your batch sizes: - -1. **Micro batch size must be set to 1**: This constraint arises because samples in a micro batch are no longer stacked; they are now concatenated during the data preparation step. Consequently, micro batch size becomes irrelevant when using packed sequences. - -2. **Global batch size must be adjusted**: Since each pack now contains multiple sequences, the global batch size needs to be reduced by the average number of sequences per pack `n` where `n = num_sequences_in_dataset / num_packs` (equivalently, `n = packed_sequence_size / average_seq_len`). This ensures that each gradient iteration sees, on average, the same number of tokens. The value of `n` is printed out during the data preparation step. You may need to run training once, obtain the value of `n` from the logs, then run your training script again with the updated global batch size. - -### Full Configuration Example - -```python -from megatron.bridge.training.config import ( - ConfigContainer, TrainingConfig, CheckpointConfig, SchedulerConfig -) -from megatron.bridge.training.config import FinetuningDatasetConfig -from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs -from megatron.bridge.peft.lora import LoRA -from megatron.core.optimizer import OptimizerConfig - -config = ConfigContainer( - model=model_provider, - train=TrainingConfig( - train_iters=1000, - global_batch_size=32, # Reduced from original due to packing - micro_batch_size=1, # Required for packed sequences - eval_interval=100, - ), - optimizer=OptimizerConfig( - optimizer="adam", - lr=1e-4, - weight_decay=0.01, - bf16=True, - use_distributed_optimizer=True, - ), - scheduler=SchedulerConfig( - lr_decay_style="cosine", - lr_warmup_iters=100, - lr_decay_iters=1000, - ), - dataset=FinetuningDatasetConfig( - dataset_root="/path/to/dataset", - seq_length=2048, - packed_sequence_specs=PackedSequenceSpecs( - packed_sequence_size=2048, - tokenizer_model_name="llama2_tokenizer", - ), - ), - checkpoint=CheckpointConfig( - pretrained_checkpoint="/path/to/pretrained/model", - save="/path/to/checkpoints", - save_interval=200, - ), - peft=LoRA( - target_modules=["linear_qkv", "linear_proj", "linear_fc1", "linear_fc2"], - dim=16, - alpha=32, - dropout=0.1, - ), - # ... other configurations -) -``` - -## File Organization - -When using packed sequences, the {py:class}`bridge.data.builders.finetuning_dataset.FinetuningDatasetBuilder` automatically organizes files in your dataset directory: - -``` -dataset_root/ -├── training.jsonl # Original training data -├── validation.jsonl # Original validation data -└── packed/ - └── {tokenizer_name}/ - ├── training_{packed_size}.npy # Packed training data - ├── validation_{packed_size}.npy # Packed validation data - └── {packed_size}_metadata.jsonl # Packing metadata -``` - -The tokenizer name and packed sequence size are automatically incorporated into the file paths to avoid conflicts when using different configurations. - -## Advanced Configuration - -### Custom File Paths - -You can specify custom paths for packed data files: - -```python -packed_sequence_specs = PackedSequenceSpecs( - packed_sequence_size=4096, - tokenizer_model_name="custom_tokenizer", - packed_train_data_path="/custom/path/training_packed.npy", - packed_val_data_path="/custom/path/validation_packed.npy", - packed_metadata_path="/custom/path/metadata.jsonl", -) -``` - -### CUDA Graphs Support - -For CUDA graphs compatibility, enable `pad_cu_seqlens`: - -```python -packed_sequence_specs = PackedSequenceSpecs( - packed_sequence_size=2048, - pad_cu_seqlens=True, # Required for CUDA graphs - tokenizer_model_name="your_tokenizer", -) -``` - -When `pad_cu_seqlens=True`, you must also set `pad_to_max_length=True` in your dataset configuration. - -## API Reference - -For detailed API documentation, see: - -- {py:class}`bridge.training.config.FinetuningDatasetConfig` - Main dataset configuration class -- {py:class}`bridge.data.datasets.packed_sequence.PackedSequenceSpecs` - Packed sequence configuration -- {py:func}`bridge.data.datasets.packed_sequence.prepare_packed_sequence_data` - Data preparation function -- {py:class}`bridge.data.datasets.sft.GPTSFTPackedDataset` - Packed sequence dataset implementation -- {py:class}`bridge.data.builders.finetuning_dataset.FinetuningDatasetBuilder` - Dataset builder with packing support -- {py:func}`bridge.training.gpt_step.get_packed_seq_params` - Packed sequence parameter extraction for training +Fine-tuning datasets often contain examples with highly variable lengths. When +those examples are batched conventionally, many tokens in each batch are just +padding. Packed sequences reduce that waste by building longer packs from +multiple examples and carrying boundary metadata into the attention path. + +In Bridge today, there are two distinct packing paths plus long-context +enablement through context parallelism: + +| Path | Use case | Key config | +|---|---|---| +| Offline packed SFT | Text-only finetuning | `packed_sequence_specs` | +| VLM in-batch packing | VLM finetuning | `pack_sequences_in_batch=True` | +| Long-context (CP) | Pretrain / finetune at 16K-128K+ | `context_parallel_size > 1` | + +These are related but they are not the same knob. Offline packed SFT and VLM +in-batch packing solve padding waste; long-context training primarily addresses +activation memory and communication tradeoffs at larger sequence lengths. + +## When to Use It + +Packed sequences are a good fit when all of the following are true: + +- you are doing SFT, PEFT, or VLM finetuning (all three packing paths are + supported; see the path table above) +- your examples have variable lengths and padding waste is significant +- you can tolerate the micro-batch constraints of packed training + +Packed sequences are usually not the right answer when: + +- you are doing standard Megatron-style pretraining, which already concatenates + documents during sampling +- you want long-context training in general, where context parallelism is often + the main technique +- your model family or recipe explicitly opts out of packed-sequence support + +## Stable Constraints + +The durable constraints for packed sequences in Bridge are: + +- packed SFT requires `micro_batch_size == 1` +- when context parallelism is used, sequence length must satisfy the standard + CP divisibility constraints +- for fine-tuning with CP enabled, per-token loss behavior and reduction + settings matter +- CUDA-graph-friendly packed metadata requires additional padding constraints + +Model-family support is not universal. Some families and recipe paths explicitly +opt out of packed sequences or related packing modes. + +## Relationship to Long-Sequence Training + +Packed sequences and long-sequence training are often mentioned together because +both affect sequence layout and memory behavior, but they solve different +problems: + +- packed sequences mainly reduce padding waste in fine-tuning datasets +- long-sequence training mainly addresses activation memory and communication + tradeoffs at larger sequence lengths + +For long-sequence training guidance, see: + +- `docs/performance-guide.md` +- `docs/training/hybrid-context-parallel.md` + +## Practical Caveats + +The most stable caveats to remember are: + +1. Packed-sequence support is recipe- and model-family-specific. +2. Fine-tuning sequence packing should not be assumed to work with every other + training feature. +3. Packed sequences improve efficiency primarily by reducing padding waste, not + by replacing long-context parallelism or memory-planning techniques. + +## Related Docs + +- [docs/training/multi-token-prediction.md](multi-token-prediction.md) +- [docs/performance-guide.md](../performance-guide.md) +- [docs/training/hybrid-context-parallel.md](hybrid-context-parallel.md) +- [skills/perf-techniques/sequence-packing/SKILL.md](../skills/perf-techniques/sequence-packing/SKILL.md) +- [skills/perf-techniques/sequence-packing/card.yaml](../skills/perf-techniques/sequence-packing/card.yaml) +- [skills/perf-techniques/packed-sequences-long-context/SKILL.md](../skills/perf-techniques/packed-sequences-long-context/SKILL.md) +- [skills/perf-techniques/packed-sequences-long-context/card.yaml](../skills/perf-techniques/packed-sequences-long-context/card.yaml) diff --git a/docs/training/resiliency.md b/docs/training/resiliency.md index 1796532ecc..0b67c10965 100644 --- a/docs/training/resiliency.md +++ b/docs/training/resiliency.md @@ -1,779 +1,192 @@ # Resiliency -Megatron Bridge incorporates resilient training features from the [NVIDIA Resiliency Extension](https://github.com/NVIDIA/nvidia-resiliency-ext). This extension provides fault-tolerant capabilities that help minimize downtime due to failures and interruptions during training. +Megatron Bridge incorporates resilient training features from the +[NVIDIA Resiliency Extension](https://github.com/NVIDIA/nvidia-resiliency-ext). +This extension provides fault-tolerant capabilities that help minimize downtime +due to failures and interruptions during training. -## Fault Tolerance: In Job Restart +This page is the stable overview for what each resiliency feature is, when to +use it, and which constraints are durable. For operational setup, config knobs, +parameter tables, code anchors, and verification commands, see [skills/resiliency/SKILL.md](../skills/resiliency/SKILL.md). -The fault tolerance feature can detect hangs during training and automatically restart a workload due to a hang or error. This is particularly useful when training on unreliable hardware, at very large scale, or when transient faults are common. +## What It Is -### Key Features +| Feature | Purpose | Maturity | Cluster | +|---|---|---|---| +| Fault tolerance | Hang detection + automatic job restart | Production | Slurm only | +| NVRx straggler detection | Identify slow GPUs | Production | Any | +| Preemption | Graceful shutdown before time limit | Production | Slurm only | +| Async checkpoint save | Non-blocking checkpoint writes | Production | Any | +| Local checkpointing | Fast local save with replication | Production | Any | +| Re-run state machine | NaN / spiky loss attribution | Experimental | Any | +| In-process restart | Restart within the same process | Experimental | Any | -- **Hang Detection**: Monitors training progress and detects when ranks become unresponsive. -- **Automatic Restart**: Automatically restarts training from the last checkpoint when faults are detected. -- **Section-based Monitoring**: Uses different timeout thresholds for setup, training steps, and checkpointing operations. -- **Timeout Calculation**: Can automatically calculate optimal timeouts based on observed training behavior. -- **Multi-level Restart Logic**: Supports both in-job restarts and new job launches on failure. +## Fault Tolerance -### Prerequisites +The fault tolerance feature detects hangs during training and automatically +restarts the workload. It uses section-based monitoring with different timeout +thresholds for setup, training steps, and checkpointing operations. -> **Warning**: This feature is currently only supported on Slurm-based clusters. +### When to Use It -Before using fault tolerance features, ensure the following: +Fault tolerance is a good fit when: -1. **Slurm Environment**: The system must be running on a Slurm-based cluster. -2. **Checkpoint Configuration**: A valid directory for saving checkpoints must be properly configured. +- training on unreliable hardware or at very large scale +- transient faults (network glitches, GPU errors) are common +- you want automatic recovery without manual intervention -### Usage Options +### Stable Constraints -Megatron Bridge provides two ways to enable fault tolerance: +- Requires Slurm and `ft_launcher` (not `torchrun`) +- Checkpoint directory must be configured and accessible +- Uses `nvidia-resiliency-ext` RankMonitorClient +- Not compatible with NSys profiling -#### Option 1: NeMo Run Plugin - -If you're using NeMo Run, the {py:class}`bridge.recipes.run_plugins.FaultTolerancePlugin` provides the simplest integration: - -```python -from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin -import nemo_run as run - -# Configure your task -task = run.Script(...) - -# Add fault tolerance plugin -run_plugins = [ - FaultTolerancePlugin( - enable_ft_package=True, - calc_ft_timeouts=True, - num_in_job_restarts=3, - num_job_retries_on_failure=2, - initial_rank_heartbeat_timeout=1800, - rank_heartbeat_timeout=300, - ) -] - -# Run with fault tolerance -run.run(task, plugins=run_plugins, executor=executor) -``` - -#### Option 2: Direct Configuration - -If you’re a user who wants more direct control, you can configure fault tolerance manually: - -```python -from megatron.bridge.training.config import FaultToleranceConfig - -# Configure fault tolerance in your config -config.ft = FaultToleranceConfig( - enable_ft_package=True, - calc_ft_timeouts=True, - # Optional: simulate faults for testing - simulate_fault=False, - simulated_fault_type="random", -) -``` - -When directly using the configuration, you must launch your training script using the `ft_launcher` tool: - -```bash -ft_launcher \ - --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \ - --nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \ - --ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \ - --ft-rank_out_of_section_timeout=300 \ - your_training_script.py -``` - -> **Note**: For local testing or non-Slurm environments, you must set the `GROUP_RANK` environment variable before launching `ft_launcher`: -> ```bash -> export GROUP_RANK=0 # For single-node runs -> ft_launcher ... -> ``` -> This is required because `ft_launcher` uses `use_infra_group_rank=True` by default, which expects either `SLURM_PROCID` or `GROUP_RANK` to be set. - -### Configuration Options - -The fault tolerance system can be configured through {py:class}`bridge.training.config.FaultToleranceConfig`: - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `enable_ft_package` | `bool` | `False` | Enable the fault tolerance package | -| `calc_ft_timeouts` | `bool` | `False` | Automatically compute optimal timeouts | -| `simulate_fault` | `bool` | `False` | Enable fault simulation for testing | -| `simulated_fault_type` | `str` | `"random"` | Type of fault to simulate: `"rank_hung"`, `"rank_killed"`, or `"random"` | -| `simulated_fault_rank` | `int` | `None` | Specific rank to simulate fault on (random if not specified) | -| `simulated_fault_base_delay` | `int` | `0` | Base delay before simulating fault | - -### Plugin Configuration Options - -When using the {py:class}`bridge.recipes.run_plugins.FaultTolerancePlugin`, additional options are available: - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `num_in_job_restarts` | `int` | `3` | Maximum number of restarts within the same job | -| `num_job_retries_on_failure` | `int` | `2` | Maximum number of new job launches on failure | -| `initial_rank_heartbeat_timeout` | `int` | `1800` | Timeout for initial heartbeat (seconds) | -| `rank_heartbeat_timeout` | `int` | `300` | Timeout for subsequent heartbeats (seconds) | - -### What to Expect - -When fault tolerance is enabled and a hang or fault is detected, you should see log messages similar to: - -``` -[WARNING] [RankMonitorServer:34] Did not get subsequent heartbeat. Waited 171.92 seconds. -[WARNING] [RankMonitorServer:58] Did not get subsequent heartbeat. Waited 171.92 seconds. -FT: Simulating fault: rank_killed; rank to fail: 2 -torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 453152 closing signal SIGTERM -``` - -The system will then automatically restart training from the most recent checkpoint. - -### How It Works - -The fault tolerance system integrates with Megatron Bridge's training pipeline through several key points: - -1. **Setup Phase**: Initializes fault tolerance monitoring before training begins. -2. **Training Steps**: Wraps each training iteration with timeout monitoring. -3. **Evaluation Steps**: Monitors evaluation iterations separately. -4. **Checkpointing**: Tracks checkpoint saving operations with dedicated timeouts. -5. **State Persistence**: Saves timeout calculations to `ft_state.json` for future runs. - -The system uses a section-based approach with different timeout thresholds: -- **Setup Section**: Covers initialization and checkpoint loading. -- **Step Section**: Monitors individual training/evaluation iterations. -- **Checkpointing Section**: Tracks checkpoint saving operations. -- **Out-of-Section**: Handles time between sections. - -### Best Practices - -1. **Enable Automatic Timeout Calculation**: Set `calc_ft_timeouts=True` to let the system learn optimal timeouts from your workload. -2. **Conservative Restart Limits**: Use reasonable limits for `num_in_job_restarts` and `num_job_retries_on_failure` to avoid infinite restart loops. -3. **Monitor Logs**: Watch for fault tolerance messages to understand when and why restarts occur. -4. **Test with Simulation**: Use the fault simulation features to test your fault tolerance setup before production runs. -5. **Checkpoint Frequency**: Ensure regular checkpointing to minimize lost work during restarts. - -### Limitations - -- Currently only supported on Slurm-based clusters. -- Not compatible with NSys profiling (the plugin will automatically disable nsys if enabled). -- Checkpoint save directory must be configured and accessible. +The system supports both in-job restarts (within the same Slurm allocation) and +new job launches on failure, with configurable limits for each. ## Straggler Detection -The straggler detection feature identifies slow-performing ranks and can optionally terminate training if performance falls below specified thresholds. This helps ensure efficient training by detecting and mitigating the impact of underperforming nodes. - -### Key Features - -- **Performance Monitoring**: Tracks individual and relative GPU performance scores. -- **Automatic Detection**: Identifies stragglers based on configurable thresholds. -- **Detailed Reporting**: Provides comprehensive performance reports with best/worst performing ranks. -- **Optional Termination**: Can automatically stop training when stragglers are detected. -- **Flexible Configuration**: Supports various reporting intervals and threshold settings. - -### Configuration - -Enable straggler detection through the {py:class}`bridge.training.config.NVRxStragglerDetectionConfig`: - -```python -from megatron.bridge.training.config import NVRxStragglerDetectionConfig - -# Configure straggler detection in your config -config.nvrx_straggler = NVRxStragglerDetectionConfig( - enabled=True, - report_time_interval=300.0, # Report every 5 minutes - calc_relative_gpu_perf=True, - calc_individual_gpu_perf=True, - num_gpu_perf_scores_to_print=5, - gpu_relative_perf_threshold=0.7, - gpu_individual_perf_threshold=0.7, - stop_if_detected=False, # Set to True to stop training on detection - enable_logging=True, -) -``` - -### Configuration Options - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `enabled` | `bool` | `False` | Enable NVRx straggler detection | -| `report_time_interval` | `float` | `300.0` | Interval in seconds between straggler checks | -| `calc_relative_gpu_perf` | `bool` | `True` | Calculate relative GPU performance scores | -| `calc_individual_gpu_perf` | `bool` | `True` | Calculate individual GPU performance scores | -| `num_gpu_perf_scores_to_print` | `int` | `5` | Number of best/worst scores to print (0 disables periodic printing) | -| `gpu_relative_perf_threshold` | `float` | `0.7` | Threshold for relative performance (0.0-1.0) | -| `gpu_individual_perf_threshold` | `float` | `0.7` | Threshold for individual performance (0.0-1.0) | -| `stop_if_detected` | `bool` | `False` | Terminate training if stragglers are detected (saves checkpoint before exiting) | -| `enable_logging` | `bool` | `True` | Log GPU performance scores as structured data | -| `profiling_interval` | `int` | `1` | Profiling interval for the detector | -| `logger_name` | `str` | `"megatron.bridge.NVRxStragglerDetection"` | Logger name for messages | - -### Expected Output - -When straggler detection is enabled, you'll see performance reports in the training logs similar to: - -``` -GPU relative performance: - Worst performing 5/512 ranks: - Rank=76 Node=h100-001-253-012 Score=0.94 - Rank=13 Node=h100-001-010-003 Score=0.94 - Rank=45 Node=h100-001-172-026 Score=0.94 - Rank=433 Node=h100-004-141-026 Score=0.95 - Rank=308 Node=h100-003-263-012 Score=0.95 - Best performing 5/512 ranks: - Rank=432 Node=h100-004-141-026 Score=0.99 - Rank=376 Node=h100-004-005-003 Score=0.98 - Rank=487 Node=h100-004-255-026 Score=0.98 - Rank=369 Node=h100-004-004-033 Score=0.98 - Rank=361 Node=h100-004-004-023 Score=0.98 - -GPU individual performance: - Worst performing 5/512 ranks: - Rank=76 Node=h100-001-253-012 Score=0.98 - Rank=162 Node=h100-002-042-026 Score=0.98 - Rank=79 Node=h100-001-253-012 Score=0.98 - Rank=357 Node=h100-004-004-013 Score=0.98 - Rank=85 Node=h100-001-253-026 Score=0.98 - Best performing 5/512 ranks: - Rank=297 Node=h100-003-095-026 Score=1.00 - Rank=123 Node=h100-001-273-026 Score=1.00 - Rank=21 Node=h100-001-010-013 Score=1.00 - Rank=389 Node=h100-004-074-012 Score=1.00 - Rank=489 Node=h100-004-269-026 Score=1.00 - - Straggler report processing time: 0.042 sec. -``` - -If stragglers are detected and thresholds are exceeded, you'll see warnings like: - -``` -STRAGGLER DETECTION WARNING: Some GPUs have worse relative performance. Affected ranks: [76, 13, 45] -STRAGGLER DETECTION WARNING: Some GPUs performance dropped. Affected ranks: [162, 79, 357] -``` - -### Performance Scores - -The system calculates two types of performance scores: - -1. **Relative Performance**: Compares each rank's performance relative to other ranks in the same training run. -2. **Individual Performance**: Tracks each rank's performance over time to detect degradation. - -Scores range from 0.0 to 1.0, where: -- **1.0**: Best possible performance -- **0.7** (default threshold): Below this indicates a potential straggler -- **Lower values**: Indicate worse performance - -### How It Works - -The straggler detection system: - -1. **Initialization**: Sets up the NVRx detector during training setup. -2. **Monitoring**: Wraps the training step function to monitor execution time. -3. **Periodic Reporting**: Generates performance reports at specified intervals. -4. **Straggler Identification**: Compares performance scores against thresholds. -5. **Action**: Optionally saves a checkpoint and terminates training if stragglers are detected. - -### Best Practices - -1. **Appropriate Intervals**: Set `report_time_interval` based on your training characteristics. -2. **Threshold Tuning**: Adjust thresholds based on your hardware and expected performance variability. -3. **Gradual Rollout**: Start with `stop_if_detected=False` to observe performance patterns before enabling automatic termination. -4. **Monitor Logs**: Regularly check straggler reports to identify persistent hardware issues. -5. **Performance Impact**: The overhead is minimal, but you can adjust `profiling_interval` if needed. - -### Integration with Training - -The straggler detection integrates directly with the training loop: - -- Automatically initializes when {py:class}`bridge.training.resiliency.NVRxStragglerDetectionManager` is configured. -- Monitors training steps without affecting the training logic. -- Provides exit conditions that the training loop respects. -- Safely shuts down when training completes. - -## Preemption - -Training foundation models can take several hours or even days to complete. In some cases, training jobs must be halted preemptively due to cluster time limits, higher priority jobs, or other reasons. - -Megatron Bridge provides functionality to gracefully perform preemptive shutdown of training. This feature listens for user-specified signals and saves a checkpoint before exiting when the signal is received. - -### Key Features - -- **Signal-based Shutdown**: Listens for signals (default: SIGTERM) during training. -- **Graceful Exit**: Saves checkpoint before terminating to preserve training progress. -- **Distributed Coordination**: Ensures all ranks receive and handle the signal properly. -- **Flexible Configuration**: Supports different signals and timing configurations. - -### Usage Options - -Megatron Bridge provides two ways to enable preemption handling: - -#### Option 1: NeMo Run Plugin (Recommended) - -> **Warning**: This plugin is currently only supported on Slurm-based clusters. +NVRx straggler detection monitors GPU performance across ranks and identifies +slow-performing nodes. It calculates both relative and individual performance +scores, and can optionally terminate training if performance falls below +configurable thresholds. -If you're using NeMo Run, the {py:class}`bridge.recipes.run_plugins.PreemptionPlugin` provides the simplest integration: +### When to Use It -```python -from megatron.bridge.recipes.run_plugins import PreemptionPlugin -import nemo_run as run +Straggler detection is useful when: -# Configure your task -task = run.Script(...) +- training at scale where one slow node degrades overall throughput +- you want visibility into per-rank GPU performance +- you need to identify persistent hardware issues -# Add preemption plugin -run_plugins = [ - PreemptionPlugin( - preempt_time=60, # Send signal 60 seconds before time limit - enable_exit_handler=True, - enable_exit_handler_for_data_loader=False, - ) -] +### Stable Constraints -# Run with preemption support -run.run(task, plugins=run_plugins, executor=executor) -``` +- Requires `nvidia-resiliency-ext` +- Overhead is minimal but can be tuned via `profiling_interval` +- Does **not** stop training by default; `stop_if_detected` must be + explicitly set to `True` for automatic termination -#### Option 2: Direct Configuration - -Configure preemption handling directly in your training configuration: +## Preemption -```python -from megatron.bridge.training.config import TrainingConfig -import signal +Preemption handling provides graceful shutdown when a training job receives a +termination signal (default: SIGTERM). It saves a checkpoint before exiting to +preserve training progress. -# Configure preemption in training config -config.train = TrainingConfig( - exit_signal_handler=True, - exit_signal=signal.SIGTERM, # Signal to listen for - exit_signal_handler_for_dataloader=False, - # ... other training config options -) -``` +### When to Use It -### Configuration Options +Preemption is important when: -#### PreemptionPlugin Options +- running on shared clusters with job time limits +- higher-priority jobs may preempt your allocation +- you want to minimize lost work on job termination -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `preempt_time` | `int` | `60` | Time in seconds before job limit to send preemption signal | -| `enable_exit_handler` | `bool` | `True` | Enable the exit signal handler in training | -| `enable_exit_handler_for_data_loader` | `bool` | `False` | Enable signal handler for dataloader workers | +### Stable Constraints -#### Training Configuration Options +- The `PreemptionPlugin` is Slurm-specific +- Direct configuration via `exit_signal_handler` works on any cluster +- Signal detection happens at the end of each training step -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `exit_signal_handler` | `bool` | `False` | Enable signal handler for graceful shutdown | -| `exit_signal` | `int` | `signal.SIGTERM` | Signal to listen for (default: SIGTERM) | -| `exit_signal_handler_for_dataloader` | `bool` | `False` | Enable signal handler for dataloader workers | +## Async Checkpoint Save -### Expected Behavior +Async checkpoint save overlaps checkpoint I/O with training compute using +persistent background workers. Training continues immediately after scheduling +the save rather than blocking until the write completes. -When a preemption signal is received, you'll see log messages similar to: +### When to Use It -``` -Received signal 15, initiating graceful stop -Signal handler installed for 15 -exiting program after receiving SIGTERM. -``` +Async save is valuable when: -The system will: -1. **Detect the signal** at the end of the current training step. -2. **Save a checkpoint** to preserve training progress. -3. **Log the shutdown reason** for debugging purposes. -4. **Exit gracefully** with proper cleanup. +- checkpoint save time is a significant fraction of step time +- you are using `torch_dist` checkpoint format -### How It Works +### Stable Constraints -The preemption system operates through several components: +- Requires `ckpt_format="torch_dist"` +- Other formats (zarr, fsdp_dtensor) do not support async save +- The persistent checkpoint worker must be enabled -1. **Signal Handler Installation**: Sets up a distributed signal handler using {py:class}`bridge.training.resiliency.DistributedSignalHandler`. -2. **Signal Detection**: Checks for received signals at the end of each training step. -3. **Distributed Coordination**: Uses all-gather to ensure all ranks are aware of the signal. -4. **Checkpoint Saving**: Automatically saves a checkpoint before exiting. -5. **Graceful Shutdown**: Properly cleans up resources and exits. +## Local Checkpointing -### Signal Handling Details +Local checkpointing saves checkpoint data to node-local storage first, then +replicates across a configurable number of nodes. This avoids the latency of +writing to shared network storage during the critical path. -The `DistributedSignalHandler` class provides: -- **Cross-rank coordination**: Ensures all ranks handle the signal consistently. -- **Original handler preservation**: Restores original signal handlers on exit. -- **Flexible signal support**: Can handle different signal types (SIGTERM, SIGINT, etc.). +### When to Use It -### Integration with Slurm +Local checkpointing is useful when: -When using Slurm, the system automatically: -- **Receives SIGTERM** when approaching job time limits. -- **Coordinates across nodes** to ensure consistent shutdown. -- **Saves progress** before the job is forcibly terminated. +- shared-storage checkpoint writes are the bottleneck in your checkpoint interval +- you want faster recovery from node failures without depending on network filesystem availability +- training at scale where network-storage contention is common -### Best Practices +### Stable Constraints -1. **Use Appropriate Timing**: Set `preempt_time` to allow sufficient time for checkpoint saving. -2. **Monitor Logs**: Watch for preemption messages to understand shutdown patterns. -3. **Test Signal Handling**: Verify preemption works correctly in your environment. -4. **Regular Checkpointing**: Ensure regular checkpoint intervals to minimize potential data loss. -5. **Resource Cleanup**: The system handles cleanup automatically, but monitor for any resource leaks. +- Node-local storage must have sufficient capacity for at least one checkpoint +- Replication degree must be configured to survive the expected failure rate +- Requires compatible checkpoint format (see [skills/resiliency/SKILL.md](../skills/resiliency/SKILL.md)) ## Re-run State Machine -The re-run state machine is an experimental feature that helps with attribution of unexpected results such as NaN values, spiky loss, or other computational anomalies. It works by re-running computations to determine whether issues are transient errors, persistent hardware faults, or actually correct results. - -> **Disclaimer**: This is an experimental alpha-level feature for result attribution. Nodes flagged by this system should be subjected to standard diagnostic test suites for confirmation. - -### Key Features - -- **Automatic Re-run Logic**: Detects unexpected results and automatically re-runs computations to verify reproducibility. -- **Error Attribution**: Classifies issues as transient errors, persistent errors, or correct results. -- **Multi-stage Validation**: Uses in-place re-runs and checkpoint-based re-runs on different hardware. -- **Determinism Tracking**: Can report statistics on computational non-determinism. -- **State Management**: Handles RNG state and data iterator state for reproducible re-runs. - -### How It Works - -The re-run state machine operates through several stages: - -1. **Initial Run**: Executes the training step normally, validating results. -2. **First Re-run (In-place)**: If validation fails, re-runs on the same GPU to check reproducibility. -3. **Second Re-run (Different GPU)**: If the issue is reproducible, saves checkpoint and re-runs on different hardware. -4. **Attribution**: Determines if the issue is a transient error, persistent error, or correct result. - -### Configuration - -Configure the re-run state machine through {py:class}`bridge.training.config.RerunStateMachineConfig`: - -```python -from megatron.bridge.training.config import RerunStateMachineConfig +The re-run state machine is an experimental feature for attributing unexpected +results (NaN loss, spiky loss) to transient errors, persistent hardware faults, +or correct-but-unexpected results. It works by re-running computations on the +same and different GPUs. -# Configure re-run state machine in your config -config.rerun_state_machine = RerunStateMachineConfig( - rerun_mode="validate_results", # or "report_determinism_stats" or "disabled" - check_for_nan_in_loss=True, - check_for_spiky_loss=False, - spiky_loss_factor=10.0, # Adjust for your model architecture - error_injection_rate=0, # For testing only - error_injection_type="transient_error", -) -``` +### When to Use It -### Configuration Options +Consider the re-run state machine when: -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `rerun_mode` | `str` | `"disabled"` | Operating mode: `"disabled"`, `"validate_results"`, or `"report_determinism_stats"` | -| `check_for_nan_in_loss` | `bool` | `True` | Check for NaN values in loss | -| `check_for_spiky_loss` | `bool` | `False` | Check for unexpectedly large loss values | -| `spiky_loss_factor` | `float` | `10.0` | Factor for spiky loss detection. Loss is flagged if it exceeds this multiple of max observed loss. Larger models may need higher values (e.g., 15-20 for 70B+). | -| `error_injection_rate` | `int` | `0` | Rate for injecting test errors (testing only) | -| `error_injection_type` | `str` | `"transient_error"` | Type of error to inject for testing | +- you need automated NaN detection and attribution +- you want to distinguish hardware faults from training instability -### Operating Modes +### Stable Constraints -#### 1. Disabled Mode (`disabled`) -- **Purpose**: No result validation or re-run logic. -- **Behavior**: Training proceeds normally without any result checking. -- **Use Case**: When re-run overhead is not acceptable or validation is not needed. - -#### 2. Report Stats Mode (`report_determinism_stats`) -- **Purpose**: Collect statistics on computational determinism. -- **Behavior**: Re-runs every step once to measure variability. -- **Output**: Reports on computational non-determinism without stopping training. - -#### 3. Validate Results Mode (`validate_results`) -- **Purpose**: Full validation with re-runs and hardware fault attribution. -- **Behavior**: Re-runs computations when unexpected results are detected. -- **Exit Conditions**: May exit with specific codes for checkpointing or validation failure. - -### Integration with Training - -The re-run state machine integrates at the training step level: - -```python -# In train_step function -rerun_state_machine = get_rerun_state_machine() -while rerun_state_machine.should_run_forward_backward(data_iterator): - # Execute forward-backward pass - loss_dict = forward_backward_func(...) - - # Validate results (automatically handled in forward_step) - # check_for_nan_in_loss and check_for_spiky_loss are passed to loss function - -should_checkpoint, should_exit, exit_code = rerun_state_machine.should_checkpoint_and_exit() -if should_checkpoint: - save_checkpoint(...) -if should_exit: - sys.exit(exit_code) -``` - -### Exit Codes - -The re-run state machine uses specific exit codes to control job behavior: - -- **Exit Code 16** (`EXIT_CODE_RESUME_TO_DISAMBIGUATE`): Job should be restarted from checkpoint to re-run on different hardware. -- **Exit Code 17** (`EXIT_CODE_FAILED_ON_RESULT_VALIDATION`): Job failed validation and should not continue. - -### Expected Behavior - -#### Validation Success -When validation passes, training continues normally with no additional overhead. - -#### Transient Error Detection -``` -Unexpected result tensor(nan) on rank 0 at iteration #150 invocation #1 (message='loss is NaN') -First rerun: unexpected result is not reproducible within the tolerance -Possible transient error! -``` - -#### Persistent Error Detection -``` -First rerun: unexpected result is reproducible within the tolerance -Need to rerun on a different GPU to verify correctness -Second rerun: unexpected result is not reproducible on a different GPU, therefore was likely incorrect -Possible persistent error! -``` - -#### Correct Result (False Positive) -``` -Second rerun: unexpected result is reproducible on a different GPU, therefore it was likely correct -Correct result (but possible Application error) -``` - -### Result Attribution Categories - -1. **Transient Error**: Result not reproducible on same GPU - likely temporary hardware glitch. -2. **Persistent Error**: Result reproducible on same GPU but different on other GPU - likely hardware fault. -3. **Correct Result**: Result reproducible across different GPUs - likely correct but unexpected. - -### Data Iterator Integration - -The system uses `RerunDataIterator` to handle data replay: -- **State Saving**: Captures data iterator state for reproducible re-runs. -- **Replay Capability**: Can rewind and replay the same data batches. -- **Checkpoint Support**: Saves/restores iterator state across job restarts. +- Alpha-level feature; full integration is limited +- Three modes: `disabled`, `validate_results`, `report_determinism_stats` +- Uses specific exit codes (16, 17) to control job behavior ## In-Process Restart -> **Warning**: This is a highly experimental feature and is subject to change in backwards incompatible ways without notice. - -The in-process restart mechanism provides automatic fault recovery by restarting the training function within the same operating system process when failures occur. Unlike traditional scheduler-level restarts, in-process restart eliminates the overhead of launching new jobs, starting containers, initializing Python interpreters, and creating new CUDA contexts. - -> **Note**: In-process restart is not suitable for all types of failures. Hardware-level failures such as switch failures, network partitions, or multiple node failures that render nodes inaccessible cannot be recovered through in-process restart alone. For comprehensive fault tolerance, it is recommended to combine in-process restart with the fault tolerance system (in-job restarts) described earlier in this document. This layered approach provides both fast recovery for software faults and robust handling of hardware-level failures. - -For comprehensive information about this functionality, refer to the [NVIDIA Resiliency Extension In-Process Restart documentation](https://nvidia.github.io/nvidia-resiliency-ext/inprocess/index.html). - -### Key Features - -- **In-Process Recovery**: Restarts training within the same process, avoiding container and interpreter restart overhead. -- **Automatic Fault Detection**: Detects unhandled Python exceptions, deadlocks, and livelocks across all distributed ranks. -- **Coordinated Restart**: Ensures all healthy ranks restart simultaneously when any rank encounters a fault. -- **Timeout Mechanisms**: Provides both soft and hard timeouts to detect and recover from hangs. -- **Rank Reassignment**: Supports excluding unhealthy ranks and utilizing warm reserve workers. -- **State Reuse**: Enables reuse of process-group-independent objects across restart attempts to minimize latency. -- **Granular Control**: Supports both node-level and rank-level restart granularity. -- **Health Checks**: Performs GPU health validation and optionally tracks fault counts. - -### Prerequisites - -Before using in-process restart, ensure the following requirements are met: - -1. **PyTorch Version**: PyTorch v2.5.1 or higher is required. -2. **NCCL Version**: NCCL v2.26.2 or higher is required. -3. **Checkpoint Configuration**: A valid checkpoint directory must be configured for state recovery. -4. **GIL-Released Operations**: All operations that wait on NCCL kernels or synchronize with GPU must release the Python Global Interpreter Lock (GIL). - -> **Important**: If operations hold the GIL during a fault, graceful restart cannot proceed, and affected ranks will be forcibly terminated. - -### Configuration - -Configure in-process restart through {py:class}`bridge.training.config.InProcessRestartConfig`: - -```python -from megatron.bridge.training.config import InProcessRestartConfig - -# Configure in-process restart in your config -config.inprocess_restart = InProcessRestartConfig( - enabled=True, - active_world_size=None, # Defaults to WORLD_SIZE, set lower to use warm reserves - granularity="node", # or "rank" for rank-level restart - max_iterations=None, # No limit on restart attempts - soft_timeout=60.0, # Timeout for detecting GIL-released hangs - hard_timeout=90.0, # Timeout for forcibly terminating hung ranks - heartbeat_interval=30.0, - heartbeat_timeout=60.0, - monitor_thread_interval=1.0, - monitor_process_interval=1.0, - progress_watchdog_interval=1.0, - barrier_timeout=120.0, - completion_timeout=120.0, - last_call_wait=1.0, - termination_grace_time=1.0, - empty_cuda_cache=True, - max_rank_faults=None, # No limit on rank faults - monitor_process_logdir=None, # Disable monitor process logging -) -``` - -### Configuration Options - -| Parameter | Type | Default | Description | -|-----------|------|---------|-------------| -| `enabled` | `bool` | `False` | Enable in-process restart mechanism | -| `active_world_size` | `int` | `None` | Number of ranks initially executing workload (remaining ranks are warm reserves) | -| `granularity` | `str` | `"node"` | Restart granularity: `"node"` or `"rank"` | -| `max_iterations` | `int` | `None` | Maximum number of restart iterations (None = unlimited) | -| `soft_timeout` | `float` | `60.0` | Soft progress timeout in seconds (for detecting GIL-released hangs) | -| `hard_timeout` | `float` | `90.0` | Hard progress timeout in seconds (for forcibly terminating hung ranks) | -| `heartbeat_interval` | `float` | `30.0` | Interval in seconds for heartbeat monitoring | -| `heartbeat_timeout` | `float` | `60.0` | Timeout in seconds for detecting missing rank heartbeats | -| `monitor_thread_interval` | `float` | `1.0` | Monitoring interval in seconds for monitoring thread | -| `monitor_process_interval` | `float` | `1.0` | Monitoring interval in seconds for monitoring process | -| `progress_watchdog_interval` | `float` | `1.0` | Interval in seconds for automatic progress watchdog updates | -| `barrier_timeout` | `float` | `120.0` | Timeout in seconds for internal distributed barriers | -| `completion_timeout` | `float` | `120.0` | Timeout in seconds for completion barrier on all ranks | -| `last_call_wait` | `float` | `1.0` | Time interval in seconds for other ranks to report concurrent failures | -| `termination_grace_time` | `float` | `1.0` | Interval in seconds between SIGTERM and SIGKILL on hard timeout | -| `empty_cuda_cache` | `bool` | `True` | Empty CUDA cache during restart finalization | -| `max_rank_faults` | `int` | `None` | Maximum number of rank faults allowed before terminating (None = unlimited) | -| `monitor_process_logdir` | `str` | `None` | Directory for monitor process log files (None = disabled) | - -### Slurm Configuration Requirements - -> **Warning**: Running in-process restart through NeMo-Run's Slurm Executor is **not currently supported**. - -If you need to use in-process restart with Slurm, you must launch your jobs directly using `srun` with the proper configuration. Refer to the [NVIDIA Resiliency Extension Slurm configuration guide](https://nvidia.github.io/nvidia-resiliency-ext/inprocess/usage_guide.html#running-with-slurm) for detailed instructions on: - -- Setting `--kill-on-bad-exit=0` to prevent Slurm from terminating the entire job on rank failures -- Using the `wait_daemon.py` utility for proper monitoring process cleanup -- Configuring SLURM PMI for compatibility - -#### Monitor Process Log Files - -When `monitor_process_logdir` is configured, the system automatically generates monitor process log files for rank 0 only. The log file path must be coordinated between your Python configuration and the `wait_daemon.py` script used in your Slurm launch command. - -The system creates log files with the following naming convention: - -``` -monitor_{SLURM_JOB_ID}_{hostname}_{SLURM_PROCID}_{SLURM_LOCALID}.log -``` - -Where: -- `SLURM_JOB_ID`: The Slurm job ID from the `SLURM_JOB_ID` environment variable -- `hostname`: The hostname of the node where rank 0 is running -- `SLURM_PROCID`: The global rank from the `SLURM_PROCID` environment variable -- `SLURM_LOCALID`: The local rank on the node from the `SLURM_LOCALID` environment variable - -**Python Configuration:** - -```python -config.inprocess_restart = InProcessRestartConfig( - enabled=True, - monitor_process_logdir="/scratch/logs/monitor", # Provide directory only -) -``` - -**Corresponding Slurm Launch Command:** - -You must pass the same log file path pattern to `wait_daemon.py` in your sbatch script. The path should include `{rank}` as a placeholder that will be substituted with the actual rank: - -```bash -srun --kill-on-bad-exit=0 \ - python -m nvidia_resiliency_ext.inprocess.wait_daemon \ - --monitor-process-logfile=/scratch/logs/monitor/monitor_${SLURM_JOB_ID}_$(hostname)_\${SLURM_PROCID}_\${SLURM_LOCALID}.log \ - -- \ - python your_training_script.py -``` - -> **Important**: The monitor process log file path must match between your Python configuration (`monitor_process_logdir`) and the `wait_daemon.py` command-line argument. This coordination ensures that `wait_daemon.py` can properly monitor and wait for the monitor process to complete its cleanup before exiting. - -### Integration in Megatron Bridge - -The in-process restart system integrates with Megatron Bridge's training pipeline through several mechanisms: - -#### 1. Function Wrapping - -The `pretrain()` function detects when in-process restart is enabled and wraps the internal `_pretrain()` function with the restart mechanism: - -```python -if config.inprocess_restart and config.inprocess_restart.enabled: - from megatron.bridge.training.inprocess_restart import maybe_wrap_for_inprocess_restart - - wrapped_pretrain, store = maybe_wrap_for_inprocess_restart( - _pretrain, config.inprocess_restart, state - ) - wrapped_pretrain(state, forward_step_func, store=store) -``` - -#### 2. Coordination Store - -A dedicated `TCPStore` is created for coordination between ranks during restart operations: -- Uses `MASTER_PORT + 1` to avoid conflicts with PyTorch distributed -- Enables rank-to-rank communication for fault detection and recovery -- Supports prefix-based isolation for each restart attempt - -#### 3. State Management - -During restart, the system performs comprehensive cleanup: - -- **PyTorch State**: Destroys distributed process groups via `torch.distributed.destroy_process_group()` -- **Megatron State**: Cleans up global state through `destroy_global_state()` -- **Training State**: Resets the `GlobalState` object for fresh initialization -- **CUDA State**: Optionally empties CUDA cache to free memory -- **Async Workers**: Aborts persistent async checkpoint worker processes - -#### 4. Restart Flow - -When a fault occurs on any rank: - -1. **Fault Detection**: The wrapper detects the exception, timeout, or missing heartbeat -2. **Distributed Abort**: All ranks are notified and begin coordinated shutdown -3. **State Cleanup**: Each rank cleans up PyTorch, Megatron, and CUDA state -4. **Health Check**: GPU health is validated on each rank -5. **Rank Reassignment**: Unhealthy ranks are excluded, reserves may be activated -6. **Barrier Synchronization**: All healthy ranks wait at a distributed barrier -7. **Function Restart**: The wrapped function restarts on all healthy ranks simultaneously - -#### 5. Restart Iterations - -The `CallWrapper` tracks restart iterations and provides this information to the wrapped function: -- Iteration 0: Initial execution -- Iteration 1+: Restart attempts after faults -- Used to create isolated `PrefixStore` instances per restart attempt - -### Environment Configuration - -#### Required Environment Variables - -Set these environment variables to optimize in-process restart behavior: +In-process restart provides automatic fault recovery by restarting the training +function within the same OS process. This avoids the overhead of launching new +jobs, starting containers, and creating new CUDA contexts. -```bash -# Suppress c10d TCPStore wait timeout warnings -export TORCH_CPP_LOG_LEVEL=error +### When to Use It -# Prevent PyTorch NCCL Watchdog from forcibly terminating on NCCL/CUDA errors -export TORCH_NCCL_RETHROW_CUDA_ERRORS=0 +In-process restart is appropriate when: -# Disable NVLS support in NCCL (required for in-process restart) -export NCCL_NVLS_ENABLE=0 -``` +- software faults (exceptions, deadlocks) are more common than hardware faults +- restart latency matters and you want to avoid full job relaunch +- you can accept the experimental status and compatibility constraints -#### PyTorch NCCL Watchdog Timeout +### Stable Constraints -Configure the PyTorch NCCL watchdog timeout to be longer than the `hard_timeout`: +- Requires PyTorch >= 2.5.1 and NCCL >= 2.26.2 +- Not compatible with NeMo-Run or Slurm preemption plugins +- Requires specific environment variables (`NCCL_NVLS_ENABLE=0`, etc.) +- The PyTorch NCCL watchdog timeout must exceed `hard_timeout` +- Supports both node-level and rank-level restart granularity -```python -import torch.distributed as dist -from datetime import timedelta +In-process restart is not suitable for hardware-level failures such as switch +failures or network partitions. For comprehensive fault tolerance, combine it +with job-level fault tolerance. -# When initializing the distributed backend -dist.init_process_group( - backend='nccl', - timeout=timedelta(seconds=config.inprocess_restart.hard_timeout + 60) -) -``` +## Practical Caveats -### Known Issues +1. No single resiliency feature covers all failure modes. The recommended + approach is to layer features (e.g., fault tolerance + straggler detection + + async checkpoint). +2. Not all recipes enable resiliency features by default. Check and enable + explicitly. +3. Two straggler detectors exist in the codebase (NVRx and legacy MCore). + Use the NVRx version; do not enable both. -Refer to the [NVIDIA Resiliency Extension Known Issues](https://nvidia.github.io/nvidia-resiliency-ext/inprocess/usage_guide.html#known-issues) for the most up-to-date list of limitations and workarounds related to: +## Related Docs -- PyTorch distributed limitations -- NCCL collective termination -- CUDA context handling -- Checkpoint worker cleanup +- [docs/training/checkpointing.md](checkpointing.md) +- [docs/performance-guide.md](../performance-guide.md) +- [skills/resiliency/SKILL.md](../skills/resiliency/SKILL.md) +- [skills/resiliency/card.yaml](../skills/resiliency/card.yaml) +- [NVIDIA Resiliency Extension](https://github.com/NVIDIA/nvidia-resiliency-ext) +- [In-Process Restart Guide](https://nvidia.github.io/nvidia-resiliency-ext/inprocess/index.html) diff --git a/skills/perf-techniques/README.md b/skills/perf-techniques/README.md new file mode 100644 index 0000000000..006bd1db2a --- /dev/null +++ b/skills/perf-techniques/README.md @@ -0,0 +1,13 @@ +# Performance Technique Skills + +This directory stores operational guides for performance and parallelism +techniques. + +Each technique lives in its own subfolder with two files: + +- `SKILL.md` — operational guide (enablement, code anchors, pitfalls, + verification) +- `card.yaml` — machine-readable structured metadata (validation status, + constraints, evidence) + +Stable human-facing docs live in `docs/training/*.md`. diff --git a/skills/perf-techniques/cuda-graphs/SKILL.md b/skills/perf-techniques/cuda-graphs/SKILL.md new file mode 100644 index 0000000000..4848f4d97f --- /dev/null +++ b/skills/perf-techniques/cuda-graphs/SKILL.md @@ -0,0 +1,272 @@ +--- +name: cuda-graphs +description: Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules. +--- + +# CUDA Graphs + +Stable docs: `docs/training/cuda-graphs.md` +Card: `card.yaml` (co-located) + +## What It Is + +CUDA graphs capture GPU operations once and replay them with minimal +host-driver overhead. Bridge supports two implementations: + +| `cuda_graph_impl` | Mechanism | Scope support | +|---|---|---| +| `"local"` | MCore `FullCudaGraphWrapper` wrapping entire fwd+bwd | `full_iteration` | +| `"transformer_engine"` | TE `make_graphed_callables()` per layer | `attn`, `mlp`, `moe`, `moe_router`, `moe_preprocess`, `mamba` | + +## Enablement + +### Local full-iteration graph + +```python +cfg.model.cuda_graph_impl = "local" +cfg.model.cuda_graph_scope = ["full_iteration"] +cfg.model.cuda_graph_warmup_steps = 3 +cfg.model.use_te_rng_tracker = True +cfg.rng.te_rng_tracker = True +cfg.rerun_state_machine.check_for_nan_in_loss = False +cfg.ddp.check_for_nan_in_grad = False +``` + +### TE scoped graph (dense model) + +```python +cfg.model.cuda_graph_impl = "transformer_engine" +cfg.model.cuda_graph_scope = ["attn"] # or ["attn", "mlp"] +cfg.model.cuda_graph_warmup_steps = 3 +cfg.model.use_te_rng_tracker = True +cfg.rng.te_rng_tracker = True +``` + +### TE scoped graph (MoE model) + +```python +cfg.model.cuda_graph_impl = "transformer_engine" +cfg.model.cuda_graph_scope = ["attn", "moe_router", "moe_preprocess"] +cfg.model.cuda_graph_warmup_steps = 3 +cfg.model.use_te_rng_tracker = True +cfg.rng.te_rng_tracker = True +``` + +### Performance harness CLI + +```bash +python scripts/performance/run_performance_workload.py \ + --cuda_graph_impl transformer_engine \ + --cuda_graph_scope attn moe_router moe_preprocess \ + ... +``` + +Valid CLI values live in `scripts/performance/argument_parser.py`: +- `VALID_CUDA_GRAPH_IMPLS`: `["none", "local", "transformer_engine"]` +- `VALID_CUDA_GRAPH_SCOPES`: `["full_iteration", "attn", "mlp", "moe", "moe_router", "moe_preprocess", "mamba"]` + +### Required constraints + +- `use_te_rng_tracker = True` (enforced in `gpt_provider.py`) +- `full_iteration` scope only with `cuda_graph_impl = "local"` +- `full_iteration` scope requires `check_for_nan_in_loss = False` +- Do not combine `moe` scope and `moe_router` scope +- Tensor shapes must be static (fixed seq_length, fixed micro_batch_size) +- MoE token-dropless routing limits graphable scope to dense modules +- With `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, set + `NCCL_GRAPH_REGISTER=0` (MCore enforces for local impl on arch < sm_100; + TE impl asserts unconditionally) +- CPU offloading is incompatible with CUDA graphs +- `moe_preprocess` scope requires `moe_router` scope to also be set + +## Code Anchors + +### Bridge config and validation + +```1524:1531:src/megatron/bridge/training/config.py + # CUDA graph scope validation: check_for_nan_in_loss must be disabled with full_iteration graph + if self.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in self.model.cuda_graph_scope: + assert not self.rerun_state_machine.check_for_nan_in_loss, ( + "check_for_nan_in_loss must be disabled when using full_iteration CUDA graph. " + "Set rerun_state_machine.check_for_nan_in_loss=False." + ) + if self.model.cuda_graph_impl == "none": + self.model.cuda_graph_scope = [] +``` + +### TE RNG tracker requirement + +```213:216:src/megatron/bridge/models/gpt_provider.py + if self.cuda_graph_impl != "none": + assert getattr(self, "use_te_rng_tracker", False), ( + "Transformer engine's RNG tracker is required for cudagraphs, it can be " + "enabled with use_te_rng_tracker=True'." +``` + +### Graph creation and capture in training loop + +```231:255:src/megatron/bridge/training/train.py + # Capture CUDA Graphs. + cuda_graph_helper = None + if model_config.cuda_graph_impl == "transformer_engine": + cuda_graph_helper = TECudaGraphHelper(...) + # ... + if config.model.cuda_graph_impl == "local" and CudaGraphScope.full_iteration in config.model.cuda_graph_scope: + forward_backward_func = FullCudaGraphWrapper( + forward_backward_func, cuda_graph_warmup_steps=config.model.cuda_graph_warmup_steps + ) +``` + +### TE graph capture after warmup + +```338:350:src/megatron/bridge/training/train.py + # Capture CUDA Graphs after warmup. + if ( + model_config.cuda_graph_impl == "transformer_engine" + and cuda_graph_helper is not None + and not cuda_graph_helper.graphs_created() + and global_state.train_state.step - start_iteration == model_config.cuda_graph_warmup_steps + ): + if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook: + disable_forward_pre_hook(model, param_sync=False) + cuda_graph_helper.create_cudagraphs() + if model_config.cuda_graph_warmup_steps > 0 and should_toggle_forward_pre_hook: + enable_forward_pre_hook(model) + cuda_graph_helper.cuda_graph_set_manual_hooks() +``` + +### RNG initialization + +```199:206:src/megatron/bridge/training/initialize.py + _set_random_seed( + rng_config.seed, + rng_config.data_parallel_random_init, + rng_config.te_rng_tracker, + rng_config.inference_rng_tracker, + use_cudagraphable_rng=(model_config.cuda_graph_impl != "none"), + pg_collection=pg_collection, + ) +``` + +### Delayed wgrad + CUDA graph interaction + +```522:555:src/megatron/bridge/training/comm_overlap.py + cuda_graph_scope = getattr(model_cfg, "cuda_graph_scope", []) or [] + # ... scope parsing ... + if wgrad_in_graph_scope: + assert is_te_min_version("2.12.0"), ... + assert model_cfg.gradient_accumulation_fusion, ... + if attn_scope_enabled: + assert not model_cfg.add_bias_linear and not model_cfg.add_qkv_bias, ... +``` + +### Perf harness override helper + +```102:124:scripts/performance/utils/overrides.py +def _set_cuda_graph_overrides( + recipe, cuda_graph_impl=None, cuda_graph_scope=None +): + # Sets impl, scope, and auto-enables te_rng_tracker +``` + +### Graph cleanup + +```1414:1441:src/megatron/bridge/training/train.py +def _delete_cuda_graphs(cuda_graph_helper): + # Deletes FullCudaGraphWrapper and TE graph objects to free NCCL buffers +``` + +### MCore classes (in 3rdparty/Megatron-LM) + +- `CudaGraphManager`: `megatron/core/transformer/cuda_graphs.py` +- `TECudaGraphHelper`: `megatron/core/transformer/cuda_graphs.py` +- `FullCudaGraphWrapper`: `megatron/core/full_cuda_graph.py` +- `CudaGraphScope` enum: `megatron/core/transformer/enums.py` + +### Positive recipe anchors + +- `scripts/performance/configs/deepseek/deepseek_workload_base_configs.py` +- `scripts/performance/configs/qwen/qwen3_workload_base_configs.py` +- `scripts/performance/configs/gpt_oss/gpt_oss_workload_base_configs.py` + +### Tests + +| File | Coverage | +|---|---| +| `tests/unit_tests/training/test_config.py` | `full_iteration` NaN-check constraint | +| `tests/unit_tests/training/test_comm_overlap.py` | `delay_wgrad` + CUDA graph interaction | +| `tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py` | TE autocast with CUDA graphs | +| `tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py` | End-to-end local and TE graph smoke tests | +| `tests/unit_tests/recipes/kimi/test_kimi_k2.py` | TE + CUDA graph recipe config | +| `tests/unit_tests/recipes/gpt/test_gpt3_175b.py` | TE + CUDA graph recipe config | +| `tests/unit_tests/recipes/qwen_vl/test_qwen25_vl_recipes.py` | VLM CUDA graph settings | + +## Pitfalls + +1. **TE RNG tracker is mandatory**: Setting `cuda_graph_impl` without + `use_te_rng_tracker=True` and `rng.te_rng_tracker=True` will assert + in the provider. + +2. **`full_iteration` requires NaN checks disabled**: The entire fwd+bwd is + captured, so loss-NaN checking cannot inspect intermediate values. + +3. **MoE scope restrictions**: `moe` scope and `moe_router` scope are + mutually exclusive. Token-dropless MoE can only graph `moe_router` and + `moe_preprocess`, not the full expert dispatch. + +4. **Memory overhead**: CUDA graphs pin all intermediate buffers for the + graph's lifetime (no memory reuse). TE scoped graphs add a few GB; + full-iteration graphs can increase peak memory by 1.5–2×. `PP > 1` + compounds overhead since each stage holds its own graph. + +5. **Delayed wgrad interaction**: When `delay_wgrad_compute=True` and + attention or MoE router is in `cuda_graph_scope`, additional constraints + apply: TE >= 2.12.0, `gradient_accumulation_fusion=True`, and no + attention bias. + +6. **Variable-length sequences break graphs**: Sequence lengths must be + constant across steps. Use padded packed sequences if packing is needed. + +7. **Graph cleanup is required**: CUDA graph objects hold NCCL buffer + references. Bridge handles this in `_delete_cuda_graphs()` at the end + of training, but early exits must call it explicitly. + +8. **Older GPU architectures**: On GPUs with compute capability < 10.0 + (pre-Blackwell), set `NCCL_GRAPH_REGISTER=0` when using + `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`. Enforced in MCore + `CudaGraphManager` (cuda_graphs.py:1428) and `TECudaGraphHelper` + (cuda_graphs.py:1697). The TE impl asserts unconditionally regardless + of arch. + +9. **CPU offloading incompatible**: CUDA graphs cannot be used with CPU + offloading. Enforced in MCore `transformer_config.py:1907`. + +10. **MoE recompute + moe_router scope**: MoE recompute is not supported + with `moe_router` CUDA graph scope when using `cuda_graph_impl = + "transformer_engine"`. Enforced in MCore `transformer_config.py:1977`. + +## Verification + +### Unit tests + +```bash +uv run python -m pytest \ + tests/unit_tests/training/test_config.py -k "cuda_graph" \ + tests/unit_tests/training/test_comm_overlap.py -k "cuda_graph" \ + tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cuda_graph" -q +``` + +### Functional smoke test (requires GPU) + +```bash +uv run python -m pytest \ + tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py -q +``` + +### Success criteria + +- Unit tests pass, covering config validation for both `local` and + `transformer_engine` implementations. +- Functional test completes training steps with both CUDA graph + implementations. +- No NCCL errors or illegal memory access in logs. diff --git a/skills/perf-techniques/cuda-graphs/card.yaml b/skills/perf-techniques/cuda-graphs/card.yaml new file mode 100644 index 0000000000..d2e0a5b476 --- /dev/null +++ b/skills/perf-techniques/cuda-graphs/card.yaml @@ -0,0 +1,74 @@ +title: cuda_graphs +validated_on: "2026-03-17" +summary: > + Megatron Bridge supports CUDA graph capture through two implementations: + local full-iteration graphs (MCore FullCudaGraphWrapper) and Transformer + Engine scoped graphs (TECudaGraphHelper) for fine-grained capture of + attention, MLP, and MoE router modules. Requires static tensor shapes + and TE RNG tracker. +validation_status: + config_validation: + - code_verified # config.py:1524-1531 + te_rng_tracker_requirement: + - code_verified # gpt_provider.py:213-217 + full_iteration_nan_check_constraint: + - code_verified # config.py:1525-1529 + te_scoped_graph_capture: + - code_verified # train.py:231-255, 338-350 + delayed_wgrad_cuda_graph_interaction: + - code_verified # comm_overlap.py:522-555 + moe_moe_router_mutual_exclusion: + - code_verified # MCore transformer_config.py:1928-1931 + cpu_offloading_incompatible: + - code_verified # MCore transformer_config.py:1907-1908 + nccl_graph_register_env: + - code_verified # MCore cuda_graphs.py:1428-1435, 1697-1703 + graph_cleanup: + - code_verified # train.py:1414-1443 + perf_harness_override: + - code_verified # overrides.py:102-124 + test_functions_exist: + - code_verified # grep confirmed test function names in test files + end_to_end_functional_smoke: + - unclear # test file exists but tests were not executed + memory_overhead_numbers: + - doc_only # from performance-guide.md prose, not measured +feature_meaning: + cuda_graph_impl: > + Which graph capture backend to use: local (MCore), transformer_engine (TE), or none. + cuda_graph_scope: > + Which modules to capture: full_iteration, attn, mlp, moe, moe_router, moe_preprocess, mamba. + cuda_graph_warmup_steps: > + Number of eager warmup steps before graph capture begins (default 3). +recommended_path: + model.cuda_graph_impl: transformer_engine_for_scoped_or_local_for_full + model.cuda_graph_scope: attn_plus_moe_router_moe_preprocess_for_moe_models + rng.te_rng_tracker: true + model.use_te_rng_tracker: true +known_constraints: + - use_te_rng_tracker must be True when cuda_graph_impl is not none (gpt_provider.py:213). + - full_iteration scope only works with cuda_graph_impl local (MCore transformer_config.py:1704). + - full_iteration requires check_for_nan_in_loss False (config.py:1525). + - moe scope and moe_router scope are mutually exclusive (MCore transformer_config.py:1928). + - moe_preprocess scope requires moe_router scope (MCore transformer_config.py:1934). + - CPU offloading incompatible with CUDA graphs (MCore transformer_config.py:1907). + - With expandable_segments, NCCL_GRAPH_REGISTER=0 required (MCore cuda_graphs.py:1428, 1697). + - MoE recompute unsupported with moe_router scope + TE impl (MCore transformer_config.py:1977). +known_limitations: + - Most public recipes default to cuda_graph_impl none. + - Memory overhead of a few GB for static buffers; 10+ GB with PP > 1 (from docs, not measured). + - Tensor shapes must be static; variable-length sequences break graph replay (from docs). +evidence: + - docs/training/cuda-graphs.md + - docs/performance-guide.md + - src/megatron/bridge/training/config.py + - src/megatron/bridge/training/train.py + - src/megatron/bridge/models/gpt_provider.py + - src/megatron/bridge/training/initialize.py + - src/megatron/bridge/training/comm_overlap.py + - scripts/performance/utils/overrides.py + - tests/unit_tests/training/test_config.py + - tests/functional_tests/recipes/test_llama_recipes_pretrain_cuda_graphs.py +follow_up_validation: + - Benchmark throughput improvement for representative dense and MoE configs. + - Validate memory overhead numbers on multi-PP configs. diff --git a/skills/perf-techniques/expert-parallel-overlap/SKILL.md b/skills/perf-techniques/expert-parallel-overlap/SKILL.md new file mode 100644 index 0000000000..2b8ec55e8a --- /dev/null +++ b/skills/perf-techniques/expert-parallel-overlap/SKILL.md @@ -0,0 +1,111 @@ +--- +name: expert-parallel-overlap +description: Operational guide for enabling MoE expert-parallel communication overlap in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification. +--- + +# MoE Expert-Parallel Overlap Skill + +For stable background and recommendation level, see: + +- `docs/training/communication-overlap.md` + +## Enablement + +Minimal Bridge override with plain `alltoall`: + +```python +cfg.comm_overlap.overlap_moe_expert_parallel_comm = True +cfg.comm_overlap.delay_wgrad_compute = False + +cfg.model.expert_model_parallel_size = 8 +cfg.model.num_moe_experts = 64 +cfg.model.moe_token_dispatcher_type = "alltoall" +cfg.model.moe_shared_expert_overlap = False +cfg.model.bf16 = True +cfg.model.fp16 = False +``` + +Minimal Bridge override with DeepEP or HybridEP: + +```python +from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend + +cfg.comm_overlap.overlap_moe_expert_parallel_comm = True +cfg.comm_overlap.delay_wgrad_compute = True +cfg.model.moe_shared_expert_overlap = False + +apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep") +# or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep") +``` + +Required constraints: + +- `expert_model_parallel_size > 1` +- `num_moe_experts > 1` +- `moe_token_dispatcher_type in {"alltoall", "flex"}` +- `moe_shared_expert_overlap = False` +- base precision is BF16 or FP16 +- PyTorch `>= 2.6.0` +- if `PP > 1`, set `virtual_pipeline_model_parallel_size` + +## Code Anchors + +Bridge overlap validation: + +```463:520:src/megatron/bridge/training/comm_overlap.py +if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True: + assert model_cfg.expert_model_parallel_size > 1, ... + assert model_cfg.num_moe_experts > 1, ... + assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ... + assert model_cfg.bf16 or model_cfg.fp16, ... + assert is_torch_min_version("2.6.0"), ... +... +assert ( + model_cfg.overlap_moe_expert_parallel_comm + or self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm +), "overlap_moe_expert_parallel_comm is required for delay_wgrad_compute" +``` + +Flex-dispatcher activation: + +```27:69:src/megatron/bridge/training/flex_dispatcher_backend.py +def apply_flex_dispatcher_backend(...): + ... + model_config.moe_token_dispatcher_type = "flex" + model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend + model_config.moe_shared_expert_overlap = False +``` + +Perf harness overlap enablement: + +```148:155:scripts/performance/utils/overrides.py +if moe_a2a_overlap: + recipe.comm_overlap.overlap_moe_expert_parallel_comm = True + recipe.comm_overlap.delay_wgrad_compute = True + recipe.model.moe_shared_expert_overlap = False +``` + +## Pitfalls + +1. `moe_flex_dispatcher_backend` is metadata unless the recipe also calls `apply_flex_dispatcher_backend(...)`. +2. `delay_wgrad_compute` is stricter than plain overlap and requires overlap first. +3. CUDA graph plus delayed wgrad needs extra TE and graph-scope constraints. +4. MoE overlap and shared-expert overlap are mutually exclusive. +5. If `PP > 1`, virtual pipeline parallelism is required for MoE overlap. +6. TP/CP overlap tuning can conflict with DeepEP or HybridEP launch tuning. + +## Verification + +Run the existing unit coverage for Bridge MoE overlap validation and DeepEP or HybridEP helper logic: + +```bash +uv run python -m pytest \ + tests/unit_tests/training/test_comm_overlap.py \ + tests/unit_tests/training/test_deepep.py -q +``` + +Success criteria: + +- Pytest reports both targeted files passing with zero failures +- `test_comm_overlap.py` covers MoE overlap and delayed-wgrad validation +- `test_deepep.py` covers DeepEP or HybridEP helper activation and GPU gating diff --git a/skills/perf-techniques/expert-parallel-overlap/card.yaml b/skills/perf-techniques/expert-parallel-overlap/card.yaml new file mode 100644 index 0000000000..629cf93934 --- /dev/null +++ b/skills/perf-techniques/expert-parallel-overlap/card.yaml @@ -0,0 +1,53 @@ +title: moe_comm_overlap +validated_on: "2026-03-15" +summary: > + Megatron-Bridge supports MoE expert-parallel communication overlap through + overlap_moe_expert_parallel_comm, with optional delayed expert wgrad + scheduling, but the path depends on dispatcher choice, expert parallelism, + precision, and runtime support. +validation_status: + moe_overlap_validation: + - code_verified + flex_dispatcher_activation: + - code_verified + deepep_hybridep_helper_behavior: + - code_verified + end_to_end_recipe_smoke: + - unclear +feature_meaning: + moe_overlap: > + Overlap of expert-parallel token dispatch communication with expert compute. + delay_wgrad_compute: > + Delayed expert weight-gradient scheduling layered on top of MoE overlap. + flex_dispatcher: > + Dispatcher mode used for DeepEP or HybridEP style backends. +recommended_path: + comm_overlap.overlap_moe_expert_parallel_comm: true_for_moe_tuning + model.moe_shared_expert_overlap: false_when_overlap_is_enabled +known_constraints: + - expert_model_parallel_size must be greater than 1. + - num_moe_experts must be greater than 1. + - moe_token_dispatcher_type must be alltoall or flex. + - Precision must be BF16 or FP16. + - moe_shared_expert_overlap must be false when overlap is enabled. + - PyTorch must be at least 2.6.0. + - If pipeline parallelism is used, virtual pipeline parallelism is required for the overlap path. + - recompute_granularity must not be 'full'. + - recompute_method must be None. + - recompute_num_layers must be None. + - mtp_num_layers must be None or 1. +known_limitations: + - Setting moe_flex_dispatcher_backend alone does not activate flex dispatch. + - Public recipes are often conservative and leave MoE overlap disabled by default. + - Repo evidence is stronger for validation logic than for end-to-end throughput gains. +evidence: + - docs/training/communication-overlap.md + - docs/parallelisms.md + - src/megatron/bridge/training/comm_overlap.py + - src/megatron/bridge/training/flex_dispatcher_backend.py + - src/megatron/bridge/training/config.py + - tests/unit_tests/training/test_comm_overlap.py + - tests/unit_tests/training/test_deepep.py +follow_up_validation: + - Add a positive Bridge functional smoke for overlap_moe_expert_parallel_comm. + - Add benchmark-backed guidance for at least one representative MoE family. diff --git a/skills/perf-techniques/hybrid-context-parallel/SKILL.md b/skills/perf-techniques/hybrid-context-parallel/SKILL.md new file mode 100644 index 0000000000..19013e8c7b --- /dev/null +++ b/skills/perf-techniques/hybrid-context-parallel/SKILL.md @@ -0,0 +1,154 @@ +--- +name: hybrid-context-parallel +description: Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification. Use when the user asks about hierarchical_context_parallel_sizes, a2a+p2p, CP scaling beyond KV heads, or multi-level context parallelism. +--- + +# Hybrid / Hierarchical Context Parallel Skill + +For what HCP is, when to use it, and the decision tree (a2a+p2p vs pure a2a vs p2p), see: + +- `docs/training/hybrid-context-parallel.md` +- `card.yaml` (co-located) + +## Enablement + +Minimal Bridge override: + +```python +cfg.model.context_parallel_size = 4 +cfg.model.cp_comm_type = "a2a+p2p" +cfg.model.hierarchical_context_parallel_sizes = [2, 2] +cfg.dist.use_decentralized_pg = False +``` + +Required constraints: + +- `prod(hierarchical_context_parallel_sizes) == context_parallel_size` +- `seq_length % (2 * context_parallel_size) == 0` +- Transformer Engine `>= 1.12.0` + +## Code Anchors + +Upstream config and validation: + +```45:54:3rdparty/Megatron-LM/megatron/core/model_parallel_config.py +context_parallel_size: int = 1 +"""Splits network input along sequence dimension across GPU ranks.""" + +hierarchical_context_parallel_sizes: Optional[list[int]] = None +"""Degrees of the hierarchical context parallelism. Users should provide a list to specify + the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains + groups of two levels, so the first value of the list indicates the group size of the a2a + communication type, and the second value indicates the group size of the p2p communication + type. +""" +``` + +```428:433:3rdparty/Megatron-LM/megatron/training/arguments.py +if args.hierarchical_context_parallel_sizes: + from numpy import prod + assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes) +if "a2a+p2p" in args.cp_comm_type: + assert args.hierarchical_context_parallel_sizes is not None, \ + "--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm" +``` + +Bridge MPU path: + +```613:648:src/megatron/bridge/training/initialize.py +parallel_state.initialize_model_parallel( + ... + context_parallel_size=model_config.context_parallel_size, + hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes, + ... +) +... +return ProcessGroupCollection.use_mpu_process_groups() +``` + +Bridge decentralized-PG path: + +```503:524:src/megatron/bridge/training/initialize.py +pg_collection = ProcessGroupCollection( + ... + cp=cp_pg, + tp_cp=tp_cp_pg, + hcp=None, + ep=ep_pg, + ... +) +``` + +## Implementation Map + +### Config definition + +`hierarchical_context_parallel_sizes` is declared in `ModelParallelConfig`: + +``` +# 3rdparty/Megatron-LM/megatron/core/model_parallel_config.py +hierarchical_context_parallel_sizes: Optional[list[int]] = None +# First value = a2a group size, second value = p2p group size. +# Product must equal context_parallel_size. +``` + +`cp_comm_type` is declared in `TransformerConfig`: + +``` +# 3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py +cp_comm_type: Optional[Union[str, List[str]]] = None +# Can be per-layer (List[str]) or uniform (str). +# Values: "p2p", "all_gather", "a2a", "a2a+p2p" +``` + +### Validation (MCore) + +`TransformerConfig.__post_init__` enforces that `a2a+p2p` requires HCP sizes and the product matches CP. + +### Process group creation + +`parallel_state.initialize_model_parallel` creates hierarchical CP sub-groups when HCP sizes are provided via `create_hierarchical_groups`. + +### TE integration + +`TEDotProductAttention` passes the hierarchical groups to Transformer Engine when `a2a+p2p` is used. Requires **Transformer Engine >= 1.12.0**. + +## Pitfalls + +1. **Different features**: `a2a+p2p` and upstream `hybrid_context_parallel=True` are different features. The latter is for balancing packed/variable-length workloads. +2. **Bridge HCP is MPU-only today**: If `use_decentralized_pg=True`, Bridge initializes flat CP groups and leaves HCP unset. +3. **No checked-in Bridge recipe** currently exercises HCP directly. +4. **Single-GPU load helpers** clear `hierarchical_context_parallel_sizes`. +5. **Silent broken training**: If you use `a2a+p2p` without setting `hierarchical_context_parallel_sizes`, MCore now asserts. Older versions would silently disable CP communication — each rank attended only to its local chunk, producing artificially high throughput but completely broken gradients. +6. **Product must match**: `prod(hierarchical_context_parallel_sizes)` must exactly equal `context_parallel_size`. A mismatch triggers an assertion. +7. **Verify in logs**: Look for the process group initialization output. You should see `HIERARCHICAL_CONTEXT_PARALLEL_GROUPS` being created. If you only see `CONTEXT_PARALLEL_GROUP`, HCP is not active. + +## Verification + +No dedicated Bridge end-to-end test exists yet for HCP (see `card.yaml` +`follow_up_validation`). Use the existing unit tests and log inspection instead. + +Run the decentralized-PG unit test to confirm the flat-CP behavior is preserved: + +```bash +uv run python -m pytest tests/unit_tests/training/test_decentralized_pg.py -q +``` + +For a manual smoke check, launch a 4-GPU run with a small recipe and +`cp_comm_type=a2a+p2p` plus `hierarchical_context_parallel_sizes=[2,2]`: + +```bash +CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \ + scripts/training/run_recipe.py \ + --recipe llama32_1b_pretrain_config \ + model.context_parallel_size=4 \ + model.cp_comm_type=a2a+p2p \ + "model.hierarchical_context_parallel_sizes=[2,2]" \ + train.train_iters=2 +``` + +Success criteria: + +- Logs show `HIERARCHICAL_CONTEXT_PARALLEL_GROUPS` being created +- Training completes at least one step without error +- If you only see `CONTEXT_PARALLEL_GROUP`, HCP is not active diff --git a/skills/perf-techniques/hybrid-context-parallel/card.yaml b/skills/perf-techniques/hybrid-context-parallel/card.yaml new file mode 100644 index 0000000000..da882d12f8 --- /dev/null +++ b/skills/perf-techniques/hybrid-context-parallel/card.yaml @@ -0,0 +1,65 @@ +title: hybrid_context_parallel +validated_on: "2026-03-14" +summary: > + Megatron-Bridge currently supports hierarchical context parallelism + (`cp_comm_type="a2a+p2p"` plus `hierarchical_context_parallel_sizes`) only + through the MPU initialization path. The decentralized process-group path + remains flat and does not create hierarchical CP groups. +validation_status: + upstream_a2a_p2p_core: + - code_verified + bridge_mpu_passthrough: + - code_verified + bridge_mpu_runtime_groups: + - code_verified + bridge_decentralized_pg_hcp: + - code_verified + bridge_hcp_recipes_examples: + - unclear + bridge_hcp_docs: + - doc_only + bridge_hcp_end_to_end_training: + - unclear +feature_meaning: + a2a_p2p: > + Megatron-Core hierarchical context-parallel transport path used by + Transformer Engine attention and enabled by cp_comm_type="a2a+p2p". + hierarchical_context_parallel_sizes: > + Per-level subgroup sizes for hierarchical context parallelism. The product + must equal context_parallel_size. + hybrid_context_parallel_flag: > + Separate upstream feature for balancing packed or variable-length workloads. + It is not the same feature as a2a+p2p. +recommended_path: + model.context_parallel_size: 4 + model.cp_comm_type: a2a+p2p + model.hierarchical_context_parallel_sizes: + - 2 + - 2 + dist.use_decentralized_pg: false +known_constraints: + - Transformer Engine must be >= 1.12.0 for a2a+p2p. + - hierarchical_context_parallel_sizes must be set when cp_comm_type contains a2a+p2p. + - The product of hierarchical_context_parallel_sizes must equal context_parallel_size. + - seq_length must be divisible by 2 * context_parallel_size when CP > 1. + - Bridge HCP is MPU-path only today. +known_limitations: + - The decentralized-PG path initializes flat CP groups and leaves HCP unset. + - No checked-in Bridge recipe sets cp_comm_type=a2a+p2p. + - No checked-in Bridge functional test runs an end-to-end HCP training step. + - Bridge docs do not currently call out the decentralized-PG limitation. +evidence: + - src/megatron/bridge/training/initialize.py + - src/megatron/bridge/training/config.py + - src/megatron/bridge/training/model_load_save.py + - docs/performance-guide.md + - tests/unit_tests/training/test_decentralized_pg.py + - 3rdparty/Megatron-LM/megatron/core/model_parallel_config.py + - 3rdparty/Megatron-LM/megatron/core/parallel_state.py + - 3rdparty/Megatron-LM/megatron/core/extensions/transformer_engine.py + - 3rdparty/Megatron-LM/megatron/training/arguments.py +follow_up_validation: + - Add a positive Bridge functional test that completes at least one HCP training step. + - Add Bridge-side validation that rejects HCP-looking config on use_decentralized_pg=true. + - Add a checked-in Bridge recipe or example for a2a+p2p. + - Validate model-family-specific HCP correctness beyond group initialization. diff --git a/skills/perf-techniques/megatron-fsdp/SKILL.md b/skills/perf-techniques/megatron-fsdp/SKILL.md new file mode 100644 index 0000000000..0be0e394fe --- /dev/null +++ b/skills/perf-techniques/megatron-fsdp/SKILL.md @@ -0,0 +1,122 @@ +--- +name: megatron-fsdp +description: Operational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification. +--- + +# Megatron FSDP Skill + +For stable background and recommendation level, see: + +- `docs/training/megatron-fsdp.md` +- `card.yaml` (co-located) + +## Enablement + +Minimal Megatron FSDP override in Bridge: + +```python +cfg.dist.use_megatron_fsdp = True +cfg.ddp.use_megatron_fsdp = True +cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params" +cfg.ddp.average_in_collective = False +cfg.checkpoint.ckpt_format = "fsdp_dtensor" +``` + +Example recipe fixup: + +```python +cfg = llama3_8b_pretrain_config() +cfg.dist.use_megatron_fsdp = True +cfg.ddp.use_megatron_fsdp = True +cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params" +cfg.ddp.average_in_collective = False +cfg.checkpoint.ckpt_format = "fsdp_dtensor" +cfg.checkpoint.save = "/tmp/fsdp_ckpts" +cfg.checkpoint.load = None +``` + +Performance harness note: + +```bash +python scripts/performance/launch.py --use_megatron_fsdp true +``` + +## Code Anchors + +Bridge config definition: + +```148:154:src/megatron/bridge/training/config.py +use_megatron_fsdp: bool = False +"""Use Megatron's Fully Sharded Data Parallel. Cannot be used together with use_torch_fsdp2.""" + +use_torch_fsdp2: bool = False +"""Use the torch FSDP2 implementation. FSDP2 is not currently working with Pipeline Parallel. +It is still not in a stable release stage, and may therefore contain bugs or other +potential issues.""" +``` + +Bridge validation: + +```1533:1578:src/megatron/bridge/training/config.py +if self.dist.use_megatron_fsdp and self.dist.use_torch_fsdp2: + raise ValueError(...) +... +assert not self.dist.use_tp_pp_dp_mapping, "use_tp_pp_dp_mapping is not supported with Megatron FSDP" +... +assert self.checkpoint.ckpt_format == "fsdp_dtensor", ( + "Megatron FSDP only supports fsdp_dtensor checkpoint format" +) +``` + +Runtime wrapper selection: + +```217:243:src/megatron/bridge/models/common/unimodal.py +if use_megatron_fsdp: + DP = FullyShardedDataParallel +elif use_torch_fsdp2: + DP = TorchFullyShardedDataParallel +else: + DP = DistributedDataParallel +... +DP( + config=get_model_config(model_chunk), + ddp_config=ddp_config, + module=model_chunk, + ... + pg_collection=pg_collection, +) +``` + +Perf harness overrides: + +```74:98:scripts/performance/utils/overrides.py +recipe.ddp.use_megatron_fsdp = True +recipe.ddp.data_parallel_sharding_strategy = "optim_grads_params" +recipe.ddp.keep_fp8_transpose_cache = False +recipe.ddp.average_in_collective = False +... +recipe.checkpoint.load = None +``` + +## Pitfalls + +1. Public recipes often expose `use_megatron_fsdp` but still default to `ckpt_format="torch_dist"`. If save/load is enabled, switch to `fsdp_dtensor`. +2. `use_torch_fsdp2` exists, but on the validated branch Bridge still fails before training because `_ddp_wrap` passes `pg_collection`. +3. CPU offloading is only valid when `pipeline_model_parallel_size == 1` and activation recomputation is disabled. +4. Upstream warns that FSDP and TP/CP can want different `CUDA_DEVICE_MAX_CONNECTIONS` settings on Hopper and earlier. +5. Megatron FSDP and FSDP2 are mutually exclusive. + +## Verification + +Use the existing 2-GPU functional smoke test: + +```bash +CUDA_VISIBLE_DEVICES=0,1 uv run python -m torch.distributed.run --nproc_per_node=2 \ + -m pytest tests/functional_tests/training/test_megatron_fsdp.py::TestMegatronFSDP::test_fsdp_pretrain_basic -v -s +``` + +Success criteria: + +- Pytest reports `1 passed` +- The log shows finite loss at the last iteration +- The run finishes without a checkpoint format assertion diff --git a/skills/perf-techniques/megatron-fsdp/card.yaml b/skills/perf-techniques/megatron-fsdp/card.yaml new file mode 100644 index 0000000000..2638ed1b04 --- /dev/null +++ b/skills/perf-techniques/megatron-fsdp/card.yaml @@ -0,0 +1,50 @@ +title: megatron_fsdp +validated_on: "2026-03-14" +summary: > + Megatron FSDP is the practical FSDP path in Megatron-Bridge today. PyTorch + FSDP2 exists in code, but remains experimental and failed at runtime during + live validation on the current branch. +validation_status: + megatron_fsdp_core: + - code_verified + megatron_fsdp_runtime_smoke: + - code_verified + megatron_fsdp_recipe_defaults: + - unclear + megatron_fsdp_performance_claims: + - doc_only + torch_fsdp2_runtime: + - known_failure +feature_meaning: + megatron_fsdp: > + Megatron-Core custom FSDP path enabled through use_megatron_fsdp and + checkpointed through fsdp_dtensor. + torch_fsdp2: > + Megatron-Core wrapper around PyTorch fully_shard enabled through + use_torch_fsdp2. +recommended_path: + dist.use_megatron_fsdp: true + ddp.use_megatron_fsdp: true + ddp.data_parallel_sharding_strategy: optim_grads_params + checkpoint.ckpt_format: fsdp_dtensor +known_constraints: + - Megatron FSDP and Torch FSDP2 are mutually exclusive. + - Megatron FSDP save/load requires fsdp_dtensor. + - Megatron FSDP does not support use_tp_pp_dp_mapping. + - FSDP2 is upstream-blocked with PP, EP, distributed optimizer, and FP16. + - CPU offloading does not support PP>1 or activation recomputation. +known_limitations: + - Public recipes often expose use_megatron_fsdp but still default to torch_dist checkpoints. + - Bridge does not expose torch_dcp or the FSDP2 reshard_after_forward knob. + - Live validation of the current branch hit a Torch FSDP2 runtime TypeError from pg_collection. +evidence: + - src/megatron/bridge/training/config.py + - src/megatron/bridge/models/common/unimodal.py + - src/megatron/bridge/training/checkpointing.py + - tests/functional_tests/training/test_megatron_fsdp.py + - 3rdparty/Megatron-LM/megatron/training/arguments.py +follow_up_validation: + - Add a positive Bridge functional test for FSDP2. + - Fix recipe defaults to switch to fsdp_dtensor when Megatron FSDP is enabled. + - Benchmark DDP vs distributed optimizer vs Megatron FSDP. + - Validate TP / PP / CP / EP compatibility matrix explicitly. diff --git a/skills/perf-techniques/moe-comm-overlap/SKILL.md b/skills/perf-techniques/moe-comm-overlap/SKILL.md new file mode 100644 index 0000000000..e3f46ccaa9 --- /dev/null +++ b/skills/perf-techniques/moe-comm-overlap/SKILL.md @@ -0,0 +1,65 @@ +--- +name: moe-comm-overlap +description: MoE expert-parallel communication overlap in Megatron Bridge. Use when the user asks about overlap_moe_expert_parallel_comm, MoE dispatch overlap, flex dispatcher, DeepEP overlap, or expert wgrad scheduling. +--- + +# MoE Communication Overlap + +For what MoE communication overlap is and when to use it, see: + +- `docs/training/communication-overlap.md` +- `card.yaml` (co-located) + +## Enablement + +```python +cfg.comm_overlap.overlap_moe_expert_parallel_comm = True + +# Optional: delayed wgrad for additional overlap +cfg.model.moe_delay_wgrad_compute = True + +# IMPORTANT: disable shared expert overlap when using dispatch overlap +cfg.model.moe_shared_expert_overlap = False +``` + +### Prerequisites + +- `expert_model_parallel_size > 1` +- `num_moe_experts > 1` +- `moe_token_dispatcher_type` must be `"alltoall"` or `"flex"` +- Precision: BF16 or FP16 +- If PP is used, VPP (`virtual_pipeline_model_parallel_size > 1`) is required + +### Flex dispatcher activation + +Setting `moe_flex_dispatcher_backend` alone does **not** activate flex dispatch. +You must also set `moe_token_dispatcher_type = "flex"`. + +## Code Anchors + +- Overlap validation: `src/megatron/bridge/training/comm_overlap.py` +- Flex dispatcher backend: `src/megatron/bridge/training/flex_dispatcher_backend.py` +- Config: `src/megatron/bridge/training/config.py` +- Unit tests: `tests/unit_tests/training/test_comm_overlap.py` +- DeepEP tests: `tests/unit_tests/training/test_deepep.py` + +## Pitfalls + +1. **Shared expert overlap conflict**: `moe_shared_expert_overlap` and + `overlap_moe_expert_parallel_comm` can conflict. Disable shared expert + overlap when using the dispatch overlap path. + +2. **PP without VPP**: MoE overlap requires VPP when pipeline parallelism is + active. Without it, the overlap scheduling cannot interleave correctly. + +3. **Flex != backend flag**: `moe_flex_dispatcher_backend="deepep"` alone + does nothing if `moe_token_dispatcher_type` is still `"alltoall"`. + +4. **Conservative recipe defaults**: Most public recipes leave MoE overlap + disabled. You need to explicitly enable it via overrides. + +## Verification + +Look for overlap-related log messages during initialization. The comm overlap +validation in `comm_overlap.py` will raise if prerequisites are not met, so a +clean startup confirms the feature is active. diff --git a/skills/perf-techniques/moe-comm-overlap/card.yaml b/skills/perf-techniques/moe-comm-overlap/card.yaml new file mode 100644 index 0000000000..a70e26ce6f --- /dev/null +++ b/skills/perf-techniques/moe-comm-overlap/card.yaml @@ -0,0 +1,47 @@ +title: moe_comm_overlap +validated_on: "2026-03-15" +summary: > + Megatron-Bridge supports MoE expert-parallel communication overlap through + overlap_moe_expert_parallel_comm, with optional delayed expert wgrad + scheduling, but the path depends on dispatcher choice, expert parallelism, + precision, and runtime support. +validation_status: + moe_overlap_validation: + - code_verified + flex_dispatcher_activation: + - code_verified + deepep_hybridep_helper_behavior: + - code_verified + end_to_end_recipe_smoke: + - unclear +feature_meaning: + moe_overlap: > + Overlap of expert-parallel token dispatch communication with expert compute. + delay_wgrad_compute: > + Delayed expert weight-gradient scheduling layered on top of MoE overlap. + flex_dispatcher: > + Dispatcher mode used for DeepEP or HybridEP style backends. +recommended_path: + comm_overlap.overlap_moe_expert_parallel_comm: true_for_moe_tuning + model.moe_shared_expert_overlap: false_when_overlap_is_enabled +known_constraints: + - expert_model_parallel_size must be greater than 1. + - num_moe_experts must be greater than 1. + - moe_token_dispatcher_type must be alltoall or flex. + - Precision must be BF16 or FP16. + - If pipeline parallelism is used, virtual pipeline parallelism is required for the overlap path. +known_limitations: + - Setting moe_flex_dispatcher_backend alone does not activate flex dispatch. + - Public recipes are often conservative and leave MoE overlap disabled by default. + - Repo evidence is stronger for validation logic than for end-to-end throughput gains. +evidence: + - docs/training/communication-overlap.md + - docs/parallelisms.md + - src/megatron/bridge/training/comm_overlap.py + - src/megatron/bridge/training/flex_dispatcher_backend.py + - src/megatron/bridge/training/config.py + - tests/unit_tests/training/test_comm_overlap.py + - tests/unit_tests/training/test_deepep.py +follow_up_validation: + - Add a positive Bridge functional smoke for overlap_moe_expert_parallel_comm. + - Add benchmark-backed guidance for at least one representative MoE family. diff --git a/skills/perf-techniques/packed-sequences-long-context/SKILL.md b/skills/perf-techniques/packed-sequences-long-context/SKILL.md new file mode 100644 index 0000000000..05e5204a14 --- /dev/null +++ b/skills/perf-techniques/packed-sequences-long-context/SKILL.md @@ -0,0 +1,77 @@ +--- +name: packed-sequences-long-context +description: Sequence packing and long-context training in Megatron Bridge. Use when the user asks about packed sequences, sequence packing, long context training, PackedSequenceSpecs, pack_sequences_in_batch, or CP with packing. +--- + +# Packed Sequences & Long-Context Training + +For what packed sequences are, the three packing paths, and when to use them, see: + +- `docs/training/packed-sequences.md` +- `card.yaml` (co-located) + +## Enablement + +### Offline packed SFT + +```python +cfg.train.micro_batch_size = 1 +cfg.dataset.dataset_kwargs.pad_to_max_length = True +cfg.dataset.packed_sequence_specs.packed_sequence_size = 8192 # match seq_length +``` + +### VLM in-batch packing + +```python +cfg.dataset.pack_sequences_in_batch = True +cfg.train.micro_batch_size = 4 # must be > 1 +``` + +### CP + packing (finetuning) + +```python +cfg.model.context_parallel_size = 4 +cfg.model.calculate_per_token_loss = True +cfg.ddp.average_in_collective = False +cfg.dataset.packed_sequence_specs.pad_seq_to_mult = 2 * 4 # 2 * CP + +# If sequence_parallel is also enabled, pad_seq_to_mult must include TP: +# cfg.dataset.packed_sequence_specs.pad_seq_to_mult = 2 * CP * TP +``` + +## Code Anchors + +- Packed sequence dataset: `src/megatron/bridge/data/datasets/packed_sequence.py` +- SFT dataset: `src/megatron/bridge/data/datasets/sft.py` +- Packed seq utils: `src/megatron/bridge/training/utils/packed_seq_utils.py` +- GPT step (packing logic): `src/megatron/bridge/training/gpt_step.py` +- VLM step (packing logic): `src/megatron/bridge/training/vlm_step.py` +- Finetune utils: `src/megatron/bridge/recipes/utils/finetune_utils.py` +- Functional test: `tests/functional_tests/training/test_seqpacking_cp_example.py` + +## Pitfalls + +1. **MBS constraint**: Offline packed SFT requires `micro_batch_size == 1`. + VLM in-batch packing requires `micro_batch_size > 1`. Mixing these up + produces silent data corruption. + +2. **CP divisibility**: `seq_length` must be divisible by `2 * context_parallel_size`. + When sequence parallelism (SP) is also enabled, the divisor becomes + `2 * CP * TP`. Violations cause assertion errors during initialization. + +3. **Per-token loss with CP**: Finetuning with `CP > 1` requires + `calculate_per_token_loss=True` and `average_in_collective=False`. + Without these, loss scaling is wrong across CP ranks. + +4. **MTP incompatibility**: Sequence packing for finetuning is documented as + unsupported with multi-token prediction. + +5. **Model-family opt-outs**: Several model families explicitly disable packing: + Qwen3-Next SFT, GLM-4.5 SFT/PEFT, Qwen3.5-VL. Check model-specific + recipes before assuming packing is available. + +## Verification + +For offline packed SFT, verify that `cu_seqlens` and `seq_offsets` are +present in the batch dict during the forward pass. For CP + packing, look for +the `pad_seq_to_mult` validation message during config setup. diff --git a/skills/perf-techniques/packed-sequences-long-context/card.yaml b/skills/perf-techniques/packed-sequences-long-context/card.yaml new file mode 100644 index 0000000000..47ac32c583 --- /dev/null +++ b/skills/perf-techniques/packed-sequences-long-context/card.yaml @@ -0,0 +1,93 @@ +title: packed_sequences_long_context +validated_on: "2026-03-15" +summary: > + Megatron-Bridge currently supports two distinct packing paths: offline packed + SFT for text-only finetuning and in-batch packing for some VLM finetuning + paths. Long-context training is primarily expressed through context + parallelism, long-context Llama recipes, and memory tradeoff knobs like + recompute and CPU offloading. +validation_status: + offline_packed_sft_runtime: + - code_verified + vlm_in_batch_packing_runtime: + - code_verified + cp_and_packing_validation_rules: + - code_verified + packed_seq_helper_behavior: + - code_verified + vlm_packing_helper_behavior: + - code_verified + packed_cp_functional_smoke_in_tree: + - recipe_verified + long_context_llama_recipe_coverage: + - recipe_verified + public_cp_backend_guidance: + - doc_only + long_context_perf_claims: + - unclear +feature_meaning: + offline_packed_sft: > + Pre-tokenized packed finetuning datasets built through PackedSequenceSpecs + and consumed through THD packed-sequence metadata. + vlm_in_batch_packing: > + Runtime batch concatenation path for some VLM training flows controlled by + pack_sequences_in_batch=True. + long_context_training: > + Training at longer sequence lengths, primarily enabled through context + parallelism plus recipe-specific long-context presets and memory tuning + knobs. +recommended_path: + llm_packed_sft: + train.micro_batch_size: 1 + dataset.dataset_kwargs.pad_to_max_length: true + dataset.packed_sequence_specs.packed_sequence_size: match_seq_length + cp_finetuning: + model.calculate_per_token_loss: true + ddp.average_in_collective: false + dataset.packed_sequence_specs.pad_seq_to_mult: 2 * context_parallel_size + vlm_in_batch_packing: + dataset.pack_sequences_in_batch: true + train.micro_batch_size: ">1" +known_constraints: + - seq_length must be divisible by 2 * context_parallel_size when CP > 1. + - Offline packed SFT requires micro_batch_size == 1. + - VLM in-batch packing requires micro_batch_size > 1. + - For finetuning with CP > 1, calculate_per_token_loss must be true. + - For finetuning with CP > 1, ddp.average_in_collective must be false. + - pad_cu_seqlens=true also requires pad_to_max_length=true. + - Fine-tuning sequence packing is documented as unsupported with MTP. +known_limitations: + - Packing support is model-family-specific rather than universal. + - Qwen3-Next SFT disables packed sequences. + - GLM-4.5 SFT and PEFT disable packed sequences. + - Qwen3.5-VL disables pack_sequences_in_batch. + - The repo does not contain checked-in benchmark results validating long-context throughput claims. +evidence: + - docs/training/packed-sequences.md + - docs/performance-guide.md + - docs/training/multi-token-prediction.md + - docs/models/llm/llama3.md + - docs/training/hybrid-context-parallel.md + - src/megatron/bridge/data/datasets/packed_sequence.py + - src/megatron/bridge/data/datasets/sft.py + - src/megatron/bridge/training/utils/packed_seq_utils.py + - src/megatron/bridge/training/gpt_step.py + - src/megatron/bridge/training/vlm_step.py + - src/megatron/bridge/training/config.py + - src/megatron/bridge/recipes/utils/finetune_utils.py + - src/megatron/bridge/recipes/common.py + - src/megatron/bridge/recipes/llama/llama3.py + - src/megatron/bridge/recipes/qwen/qwen3_next.py + - src/megatron/bridge/recipes/glm/glm45.py + - src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py + - src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py + - tests/functional_tests/training/test_seqpacking_cp_example.py + - tests/unit_tests/training/utils/test_packed_seq_utils.py + - tests/unit_tests/training/test_config.py + - tests/unit_tests/training/test_vlm_step.py + - scripts/performance/utils/overrides.py +follow_up_validation: + - Run the checked-in packed-plus-CP functional test and record whether it still passes on current infrastructure. + - Add a tiny no-download end-to-end smoke test for offline packed SFT. + - Add a checked-in long-context training smoke for at least one 16K or 64K recipe. + - Cross-link public packing docs to model-family opt-outs and the MTP incompatibility note. diff --git a/skills/perf-techniques/parallelism-strategies/SKILL.md b/skills/perf-techniques/parallelism-strategies/SKILL.md new file mode 100644 index 0000000000..427a736d3d --- /dev/null +++ b/skills/perf-techniques/parallelism-strategies/SKILL.md @@ -0,0 +1,233 @@ +--- +name: parallelism-strategies +description: Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration. +--- + +# Parallelism Strategy Selection Skill + +For stable background on each parallelism type, see: + +- `docs/parallelisms.md` +- `card.yaml` (co-located) + +## Decision by Model Size + +### Dense models + +| Model size | GPUs | Recommended starting point | +|---|---|---| +| < 1B | 1-8 | DP only | +| 1-10B | 8-16 | TP=2-4 + DP | +| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP | +| 70-175B | 64-256 | TP=8 + PP=4-8 + DP | +| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP | + +### MoE models + +MoE parallelism differs from dense models. Because only a fraction of +parameters are active per token, TP can often stay at 1 or 2 — the active +parameter shard already fits on a single GPU. EP is the primary scaling +dimension, with PP handling cross-node layer distribution. + +| Model (total / active) | TP | PP | EP | Notes | +|---|---|---|---|---| +| OLMoE 7B / 1B | 1 | 1 | 8 | EP only, fits single node | +| Moonlight 16B / 3B | 2 | 1 | 8 | small TP for shared layers | +| DeepSeek-V2 236B / 21B | 1 | 4 | 32 | no TP at all | +| GLM-4.5 Air 106B / 12B | 1 | 4 | 8 | no TP at all | +| Qwen3 30B-A3B | 4 | 2 | 4 | | +| GLM-4.5 355B / 32B | 2 | 8 | 16 | | +| Qwen3 235B-A22B | 4 | 16 | 8 | CP=2 for pretrain | +| DeepSeek-V3 671B / 37B | 2 | 16 | 64 | TP=2, not 8 | +| Kimi-K2 1T | 2 | 16 | 32 | | + +Key patterns: + +- TP is sized by **active** params, not total params. A 671B MoE with + 37B active needs far less TP than a 70B dense model. +- EP scales with expert count. Common: EP = num_experts or + num_experts / experts_per_gpu. +- PP handles depth. Large MoE models use PP=8-16 across nodes. +- ETP (expert tensor parallelism) is rarely used. Llama 4 is an + exception (ETP=4). + +These are starting points, not hard rules. Always profile the first +iteration to verify memory and communication. + +## Decision by Hardware Topology + +Single node with NVLink: + +```python +cfg.model.tensor_model_parallel_size = 8 +``` + +Multiple nodes with InfiniBand: + +```python +cfg.model.tensor_model_parallel_size = 8 +cfg.model.pipeline_model_parallel_size = N +``` + +Limited network (Ethernet): + +```python +cfg.model.tensor_model_parallel_size = 4 +cfg.model.pipeline_model_parallel_size = M +``` + +The stable rule is: keep TP within a single NVLink domain. Use PP or DP +for cross-node scaling. TP across nodes is almost always a performance +loss. + +## Decision by Sequence Length + +| Sequence length | Recommendation | +|---|---| +| < 2K | standard TP + PP + DP | +| 2K-8K | add SP (`sequence_parallel=True`) | +| 8K-32K | add CP=2 | +| 32K+ | add CP=4-8, consider `a2a+p2p` for large CP | + +## Combined Parallelism Enablement + +3D parallelism (TP + PP + DP): + +```python +cfg.model.tensor_model_parallel_size = 4 +cfg.model.pipeline_model_parallel_size = 4 +cfg.model.sequence_parallel = True +``` + +4D parallelism (TP + PP + CP + DP): + +```python +cfg.model.tensor_model_parallel_size = 8 +cfg.model.pipeline_model_parallel_size = 8 +cfg.model.context_parallel_size = 2 +cfg.model.sequence_parallel = True +``` + +MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs): + +```python +cfg.model.tensor_model_parallel_size = 1 +cfg.model.pipeline_model_parallel_size = 4 +cfg.model.expert_model_parallel_size = 32 +cfg.model.sequence_parallel = False +``` + +MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs): + +```python +cfg.model.tensor_model_parallel_size = 2 +cfg.model.pipeline_model_parallel_size = 16 +cfg.model.expert_model_parallel_size = 64 +cfg.model.sequence_parallel = True +``` + +DP size is always implicit: + +``` +data_parallel_size = world_size / (TP * PP * CP) +``` + +## Memory Estimation + +Without parallelism (70B model, FP16): + +``` +parameters: 140 GB +gradients: 140 GB +optimizer states: 280 GB (Adam) +activations: 48 GB (batch=1, seq=4K) +total: 608 GB +``` + +With TP=4, PP=4, DP=4 (64 GPUs): + +``` +parameters: 8.75 GB per GPU +gradients: 8.75 GB per GPU +optimizer states: 17.50 GB per GPU +activations: 3.00 GB per GPU +total: ~38 GB per GPU +``` + +## Code Anchors + +Parallelism dimensions set in model provider: + +```66:81:docs/parallelisms.md +model_config = GPTModelProvider( + tensor_model_parallel_size=2, + # ... other model parameters +) +``` + +DP size calculation: + +```424:436:docs/parallelisms.md +data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size) +``` + +Bridge initialization wires parallelism into process groups: + +```618:628:src/megatron/bridge/training/initialize.py +parallel_state.initialize_model_parallel( + tensor_model_parallel_size=model_config.tensor_model_parallel_size, + pipeline_model_parallel_size=model_config.pipeline_model_parallel_size, + ... + context_parallel_size=model_config.context_parallel_size, + hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes, + expert_model_parallel_size=model_config.expert_model_parallel_size, + ... +) +``` + +## Pitfalls + +1. TP across nodes destroys throughput. Always keep TP within a single + NVLink domain. + +2. PP without interleaving has large pipeline bubbles. Use + `virtual_pipeline_model_parallel_size` when possible. + +3. SP requires `tensor_model_parallel_size > 1`. Enabling SP alone + without TP is a config error. + +4. CP requires `seq_length % (2 * context_parallel_size) == 0`. + +5. EP is only for MoE models. Setting `expert_model_parallel_size` on a + dense model is a no-op or error. + +6. The model-size-to-parallelism table above is a starting heuristic. + Always profile the first iteration to check memory and communication. + +7. `CUDA_DEVICE_MAX_CONNECTIONS` and related env vars interact with + overlap settings. See `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`. + +## Verification + +Quick sanity check that combined parallelism initializes correctly using +the smallest available recipe with overridden parallelism: + +```bash +CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \ + scripts/training/run_recipe.py \ + --recipe llama32_1b_pretrain_config \ + model.tensor_model_parallel_size=2 \ + model.pipeline_model_parallel_size=2 \ + model.sequence_parallel=True \ + train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \ + scheduler.lr_warmup_iters=0 \ + validation.eval_iters=0 validation.eval_interval=0 \ + checkpoint.save_interval=0 \ + logger.log_interval=1 +``` + +Success criteria: + +- exit code 0 +- finite loss at iteration 3 (e.g. `lm loss: 1.003808E+01`) +- log shows TP=2 PP=2 DP=1 layout with 4 ranks diff --git a/skills/perf-techniques/parallelism-strategies/card.yaml b/skills/perf-techniques/parallelism-strategies/card.yaml new file mode 100644 index 0000000000..7803124967 --- /dev/null +++ b/skills/perf-techniques/parallelism-strategies/card.yaml @@ -0,0 +1,72 @@ +title: parallelism_strategies +validated_on: "2026-03-15" +summary: > + Megatron Bridge supports DP, TP, PP, SP, CP, and EP parallelism strategies + which can be combined for models from sub-1B to 500B+ parameters. The right + combination depends on model size, hardware topology, and sequence length. +validation_status: + dp_ddp_distributed_optimizer: + - code_verified + tp_config_and_runtime: + - code_verified + pp_interleaved_schedule: + - code_verified + sp_activation_partitioning: + - code_verified + cp_context_parallel: + - code_verified + ep_expert_parallel: + - code_verified + combined_parallelism_init: + - code_verified + sizing_heuristics: + - doc_only +feature_meaning: + data_parallel: > + Replicate model across GPUs, split data batches, synchronize gradients. + tensor_parallel: > + Split individual layer tensors across GPUs within a node. + pipeline_parallel: > + Assign consecutive layer groups to different GPUs, process microbatches + in a pipeline. + sequence_parallel: > + Partition activations along the sequence dimension within TP groups to + reduce activation memory. + context_parallel: > + Split long sequences across GPUs using ring attention or similar + communication patterns. + expert_parallel: > + Distribute MoE experts across GPUs, only applies to expert layers. +recommended_path: + dense_under_1b: DP only + dense_1b_to_10b: TP=2-4 + DP + dense_10b_to_70b: TP=4-8 + PP=2-4 + DP + dense_70b_to_175b: TP=8 + PP=4-8 + DP + dense_175b_plus: TP=8 + PP=8-16 + CP=2 + DP + moe_under_20b: EP only (TP=1, PP=1) + moe_20b_to_100b: TP=1-2 + PP=2-4 + EP=8-16 + moe_100b_to_500b: TP=2-4 + PP=8-16 + EP=8-32 + moe_500b_plus: TP=2 + PP=16 + EP=32-64 +known_constraints: + - TP should stay within a single NVLink domain for performance. + - SP requires tensor_model_parallel_size > 1. + - CP requires seq_length divisible by 2 * context_parallel_size. + - EP requires num_moe_experts > 0 and expert_model_parallel_size divides num_moe_experts. + - PP interleaved schedule requires virtual_pipeline_model_parallel_size > 1. + - Total parallelism dimensions must divide evenly into world_size. +known_limitations: + - Model-size-to-parallelism mapping is a heuristic, not a benchmark-proven table. + - Not every parallelism combination has the same level of in-repo functional test coverage. + - Memory estimates assume standard Adam optimizer and FP16/BF16 parameters. +evidence: + - docs/parallelisms.md + - docs/performance-guide.md + - docs/training/communication-overlap.md + - docs/training/hybrid-context-parallel.md + - src/megatron/bridge/training/initialize.py + - src/megatron/bridge/training/config.py + - src/megatron/bridge/models/common/unimodal.py +follow_up_validation: + - Add a checked-in combined parallelism functional smoke for TP+PP+CP. + - Add benchmark-backed sizing guidance for at least one model family. + - Add explicit EP+TP+PP functional smoke for MoE models. diff --git a/skills/perf-techniques/sequence-packing/SKILL.md b/skills/perf-techniques/sequence-packing/SKILL.md new file mode 100644 index 0000000000..5f9877b961 --- /dev/null +++ b/skills/perf-techniques/sequence-packing/SKILL.md @@ -0,0 +1,137 @@ +--- +name: sequence-packing +description: Operational guide for enabling packed sequences and long-context config paths in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification. +--- + +# Sequence Packing Skill + +For stable background and recommendation level, see: + +- `docs/training/packed-sequences.md` +- `card.yaml` (co-located) + +## Enablement + +Offline packed SFT for LLM finetuning: + +```python +from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs + +cfg.train.micro_batch_size = 1 +cfg.dataset.seq_length = 4096 +cfg.model.seq_length = 4096 +cfg.dataset.dataset_kwargs = {"pad_to_max_length": True} +cfg.dataset.packed_sequence_specs = PackedSequenceSpecs( + packed_sequence_size=4096, + pad_seq_to_mult=1, +) +``` + +If CP is enabled: + +```python +cfg.model.context_parallel_size = 2 +cfg.model.calculate_per_token_loss = True +cfg.ddp.average_in_collective = False +cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2 +``` + +If CUDA graphs are enabled for this packed path: + +```python +cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True +cfg.dataset.dataset_kwargs["pad_to_max_length"] = True +``` + +**Note:** `pad_cu_seqlens = True` also requires a metadata JSON file alongside +the packed dataset (asserted in `src/megatron/bridge/data/datasets/sft.py`). +Custom packed datasets that omit the metadata file will hit an assertion at +dataset initialization. + +In-batch packing for VLM finetuning: + +```python +cfg.dataset.pack_sequences_in_batch = True +cfg.train.micro_batch_size = 2 +``` + +Long-context baseline: + +```python +cfg.model.seq_length = 16384 +cfg.dataset.seq_length = 16384 +cfg.model.context_parallel_size = 2 +``` + +## Code Anchors + +LLM packed SFT config surface: + +```72:97:src/megatron/bridge/recipes/utils/finetune_utils.py +if packed_sequence: + dataset_kwargs = {"pad_to_max_length": True} + packed_sequence_specs = PackedSequenceSpecs(packed_sequence_size=seq_length, pad_seq_to_mult=pad_seq_to_mult) +else: + dataset_kwargs = {} + packed_sequence_specs = None +``` + +Bridge validation: + +```1617:1657:src/megatron/bridge/training/config.py +if self.model.context_parallel_size > 1: + assert self.model.seq_length % (self.model.context_parallel_size * 2) == 0, ... + if isinstance(self.dataset, FinetuningDatasetConfig): + assert self.model.calculate_per_token_loss, ... + assert not self.ddp.average_in_collective, ... +... +if ... packed_sequence_size > 0 and self.train.micro_batch_size > 1: + raise ValueError(...) +... +if getattr(self.dataset, "pack_sequences_in_batch", False) and self.train.micro_batch_size == 1: + raise ValueError(...) +``` + +VLM in-batch runtime: + +```308:327:src/megatron/bridge/training/vlm_step.py +if enable_packing: + ... + ) = pack_batch_sequences( + ... + pad_token_id=0, + pad_to_multiple_of=cp_size * 2 if cp_size > 1 else 1, + ) +``` + +Packed THD runtime constraint: + +```61:64:src/megatron/bridge/training/gpt_step.py +if cu_seqlens.dim() > 1 and cu_seqlens.size(0) != 1: + raise ValueError("Packed THD batches expect micro-batch size 1 for context-parallel slicing (THD layout)") +``` + +## Pitfalls + +1. Offline packed SFT and VLM in-batch packing are different features with opposite micro-batch rules. +2. When CP is enabled, packed sequence lengths must respect `2 * context_parallel_size` divisibility. +3. For finetuning with CP, `calculate_per_token_loss=True` and `ddp.average_in_collective=False` are required. +4. `pad_cu_seqlens=True` also requires `pad_to_max_length=True`. +5. Packing support is model-family-specific. `Qwen3-Next`, `GLM-4.5`, and `Qwen3.5-VL` contain explicit opt-outs in different paths. +6. MTP finetuning is documented as incompatible with packed sequences. + +## Verification + +Use the checked-in unit coverage: + +```bash +uv run python -m pytest tests/unit_tests/training/utils/test_packed_seq_utils.py -v && \ +uv run python -m pytest tests/unit_tests/training/test_config.py -k "packed_sequence or pack_sequences_in_batch or context_parallel_seq_length_divisibility or context_parallel_finetuning_validations" -v && \ +uv run python -m pytest tests/unit_tests/training/test_vlm_step.py -k "enable_packing" -v +``` + +Success criteria: + +- first command reports `8 passed` +- second command reports `14 passed` +- third command reports `2 passed` diff --git a/skills/perf-techniques/sequence-packing/card.yaml b/skills/perf-techniques/sequence-packing/card.yaml new file mode 100644 index 0000000000..47ac32c583 --- /dev/null +++ b/skills/perf-techniques/sequence-packing/card.yaml @@ -0,0 +1,93 @@ +title: packed_sequences_long_context +validated_on: "2026-03-15" +summary: > + Megatron-Bridge currently supports two distinct packing paths: offline packed + SFT for text-only finetuning and in-batch packing for some VLM finetuning + paths. Long-context training is primarily expressed through context + parallelism, long-context Llama recipes, and memory tradeoff knobs like + recompute and CPU offloading. +validation_status: + offline_packed_sft_runtime: + - code_verified + vlm_in_batch_packing_runtime: + - code_verified + cp_and_packing_validation_rules: + - code_verified + packed_seq_helper_behavior: + - code_verified + vlm_packing_helper_behavior: + - code_verified + packed_cp_functional_smoke_in_tree: + - recipe_verified + long_context_llama_recipe_coverage: + - recipe_verified + public_cp_backend_guidance: + - doc_only + long_context_perf_claims: + - unclear +feature_meaning: + offline_packed_sft: > + Pre-tokenized packed finetuning datasets built through PackedSequenceSpecs + and consumed through THD packed-sequence metadata. + vlm_in_batch_packing: > + Runtime batch concatenation path for some VLM training flows controlled by + pack_sequences_in_batch=True. + long_context_training: > + Training at longer sequence lengths, primarily enabled through context + parallelism plus recipe-specific long-context presets and memory tuning + knobs. +recommended_path: + llm_packed_sft: + train.micro_batch_size: 1 + dataset.dataset_kwargs.pad_to_max_length: true + dataset.packed_sequence_specs.packed_sequence_size: match_seq_length + cp_finetuning: + model.calculate_per_token_loss: true + ddp.average_in_collective: false + dataset.packed_sequence_specs.pad_seq_to_mult: 2 * context_parallel_size + vlm_in_batch_packing: + dataset.pack_sequences_in_batch: true + train.micro_batch_size: ">1" +known_constraints: + - seq_length must be divisible by 2 * context_parallel_size when CP > 1. + - Offline packed SFT requires micro_batch_size == 1. + - VLM in-batch packing requires micro_batch_size > 1. + - For finetuning with CP > 1, calculate_per_token_loss must be true. + - For finetuning with CP > 1, ddp.average_in_collective must be false. + - pad_cu_seqlens=true also requires pad_to_max_length=true. + - Fine-tuning sequence packing is documented as unsupported with MTP. +known_limitations: + - Packing support is model-family-specific rather than universal. + - Qwen3-Next SFT disables packed sequences. + - GLM-4.5 SFT and PEFT disable packed sequences. + - Qwen3.5-VL disables pack_sequences_in_batch. + - The repo does not contain checked-in benchmark results validating long-context throughput claims. +evidence: + - docs/training/packed-sequences.md + - docs/performance-guide.md + - docs/training/multi-token-prediction.md + - docs/models/llm/llama3.md + - docs/training/hybrid-context-parallel.md + - src/megatron/bridge/data/datasets/packed_sequence.py + - src/megatron/bridge/data/datasets/sft.py + - src/megatron/bridge/training/utils/packed_seq_utils.py + - src/megatron/bridge/training/gpt_step.py + - src/megatron/bridge/training/vlm_step.py + - src/megatron/bridge/training/config.py + - src/megatron/bridge/recipes/utils/finetune_utils.py + - src/megatron/bridge/recipes/common.py + - src/megatron/bridge/recipes/llama/llama3.py + - src/megatron/bridge/recipes/qwen/qwen3_next.py + - src/megatron/bridge/recipes/glm/glm45.py + - src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py + - src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py + - tests/functional_tests/training/test_seqpacking_cp_example.py + - tests/unit_tests/training/utils/test_packed_seq_utils.py + - tests/unit_tests/training/test_config.py + - tests/unit_tests/training/test_vlm_step.py + - scripts/performance/utils/overrides.py +follow_up_validation: + - Run the checked-in packed-plus-CP functional test and record whether it still passes on current infrastructure. + - Add a tiny no-download end-to-end smoke test for offline packed SFT. + - Add a checked-in long-context training smoke for at least one 16K or 64K recipe. + - Cross-link public packing docs to model-family opt-outs and the MTP incompatibility note. diff --git a/skills/perf-techniques/tp-dp-comm-overlap/SKILL.md b/skills/perf-techniques/tp-dp-comm-overlap/SKILL.md new file mode 100644 index 0000000000..51f8305faf --- /dev/null +++ b/skills/perf-techniques/tp-dp-comm-overlap/SKILL.md @@ -0,0 +1,117 @@ +--- +name: tp-dp-comm-overlap +description: Operational guide for enabling TP, DP, and PP communication overlap in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification. +--- + +# TP / DP / PP Communication Overlap Skill + +For stable background and recommendation level, see: + +- `docs/training/communication-overlap.md` + +## Enablement + +Minimal Bridge override: + +```python +from megatron.bridge.training.comm_overlap import CommOverlapConfig + +cfg.model.tensor_model_parallel_size = 4 +cfg.model.sequence_parallel = True +cfg.model.pipeline_model_parallel_size = 4 +cfg.model.virtual_pipeline_model_parallel_size = 2 + +cfg.comm_overlap = CommOverlapConfig( + tp_comm_overlap=True, +) + +cfg.ddp.use_distributed_optimizer = True +cfg.ddp.overlap_grad_reduce = True +cfg.ddp.overlap_param_gather = True +``` + +Optional TP preset: + +```python +from megatron.bridge.training.comm_overlap import userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048 + +cfg.comm_overlap.tp_comm_overlap_cfg = userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048 +``` + +Precision knobs belong to mixed precision: + +```python +cfg.mixed_precision.grad_reduce_in_fp32 = False +cfg.mixed_precision.fp8_param_gather = False +``` + +## Code Anchors + +Bridge overlap gating: + +```439:449:src/megatron/bridge/training/comm_overlap.py +if self.user_comm_overlap_cfg.tp_comm_overlap is True: + if model_cfg.tensor_model_parallel_size < 2: + ... + elif not model_cfg.sequence_parallel: + ... + elif not HAVE_TE: + ... +``` + +PP overlap selection: + +```451:458:src/megatron/bridge/training/comm_overlap.py +if model_cfg.pipeline_model_parallel_size > 1: + if vp_size > 1: + comm_overlap_cfg.overlap_p2p_comm = True + comm_overlap_cfg.batch_p2p_comm = False + else: + comm_overlap_cfg.overlap_p2p_comm = False + comm_overlap_cfg.batch_p2p_comm = True +``` + +DP overlap defaults: + +```572:579:src/megatron/bridge/training/comm_overlap.py +if self.data_parallel_size > 1: + comm_overlap_cfg.bucket_size = 128 * 1024 * 1024 + comm_overlap_cfg.overlap_grad_reduce = True + comm_overlap_cfg.overlap_param_gather = True +``` + +Launch-time env tuning: + +```570:609:src/megatron/bridge/recipes/run_plugins.py +executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str(cuda_device_max_connections) +... +executor.env_vars["NVTE_FWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin) +executor.env_vars["NVTE_BWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin) +``` + +## Pitfalls + +1. TP overlap silently disables itself if `sequence_parallel=False` or Transformer Engine is unavailable. +2. PP overlap is not enabled for all PP cases. Bridge only auto-selects `overlap_p2p_comm=True` when `PP > 1` and `VPP > 1`. +3. `bucket_size` is a parameter-count knob, not a byte-size knob. +4. `grad_reduce_in_fp32` and `fp8_param_gather` should be set through mixed precision, not as standalone DDP tuning first. +5. `CUDA_DEVICE_MAX_CONNECTIONS` and LayerNorm SM margin are launch-time plugin settings, not `CommOverlapConfig` fields. + +## Verification + +Use the checked-in overlap unit coverage first: + +```bash +uv run python -m pytest tests/unit_tests/training/test_comm_overlap.py -q +``` + +Optional second check if `nemo_run` is available: + +```bash +uv run python -m pytest tests/unit_tests/recipes/test_run_plugins.py -q +``` + +Success criteria: + +- first command reports `26 passed` +- second command validates plugin-owned env wiring when not skipped diff --git a/skills/perf-techniques/tp-dp-comm-overlap/card.yaml b/skills/perf-techniques/tp-dp-comm-overlap/card.yaml new file mode 100644 index 0000000000..473a27b592 --- /dev/null +++ b/skills/perf-techniques/tp-dp-comm-overlap/card.yaml @@ -0,0 +1,51 @@ +title: tp_dp_comm_overlap +validated_on: "2026-03-15" +summary: > + Megatron-Bridge exposes communication overlap across tensor parallel, data + parallel, and pipeline parallel paths through CommOverlapConfig, but the + available behavior and defaults differ by mode. +validation_status: + tp_overlap_gating: + - code_verified + dp_overlap_defaults: + - code_verified + pp_overlap_auto_selection: + - code_verified + launch_env_wiring: + - code_verified + overlap_perf_claims: + - doc_only +feature_meaning: + tp_overlap: > + Overlap of tensor-parallel communication with GEMM work, typically tied to + sequence parallelism. + dp_overlap: > + Overlap of gradient reduce-scatter and parameter all-gather on the + distributed-optimizer path. + pp_overlap: > + Overlap of pipeline send and receive behavior, especially relevant for + interleaved pipeline schedules. +recommended_path: + comm_overlap.tp_comm_overlap: true_when_tp_and_sp_are_enabled + ddp.use_distributed_optimizer: true_for_dp_overlap +known_constraints: + - TP overlap requires tensor_model_parallel_size > 1. + - TP overlap requires sequence_parallel=True. + - TP overlap requires Transformer Engine to be available. + - DP overlap is tied to the distributed-optimizer path. + - PP overlap behavior depends on the pipeline schedule and is not identical for every PP setup. + - Launch-time environment tuning is part of practical overlap behavior. +known_limitations: + - Not every public recipe enables overlap even when the feature exists. + - Repo docs do not provide benchmark-backed proof for optimal overlap settings. +evidence: + - docs/training/communication-overlap.md + - docs/performance-guide.md + - src/megatron/bridge/training/comm_overlap.py + - src/megatron/bridge/training/config.py + - src/megatron/bridge/training/mixed_precision.py + - src/megatron/bridge/recipes/run_plugins.py + - tests/unit_tests/training/test_comm_overlap.py +follow_up_validation: + - Add benchmark-backed overlap guidance for at least one representative model family. + - Add a functional PP smoke for interleaved pipeline overlap. diff --git a/skills/resiliency/SKILL.md b/skills/resiliency/SKILL.md new file mode 100644 index 0000000000..d00bd1c6f6 --- /dev/null +++ b/skills/resiliency/SKILL.md @@ -0,0 +1,305 @@ +--- +name: resiliency +description: Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine. Use when the user asks about fault tolerance, straggler detection, hang detection, automatic restart, preemption, in-process restart, checkpoint recovery, or nvidia-resiliency-ext. +--- + +# Resiliency + +Stable docs: `docs/training/resiliency.md`, `docs/training/checkpointing.md` +Card: `card.yaml` (co-located) + +## Enablement + +### Fault tolerance (Slurm only) + +#### Option 1: NeMo Run plugin (recommended) + +```python +from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin +import nemo_run as run + +task = run.Script(...) +run_plugins = [ + FaultTolerancePlugin( + enable_ft_package=True, + calc_ft_timeouts=True, + num_in_job_restarts=3, + num_job_retries_on_failure=2, + initial_rank_heartbeat_timeout=1800, + rank_heartbeat_timeout=300, + ) +] +run.run(task, plugins=run_plugins, executor=executor) +``` + +| Plugin parameter | Default | Description | +|---|---|---| +| `num_in_job_restarts` | 3 | Max restarts within same job | +| `num_job_retries_on_failure` | 2 | Max new job launches on failure | +| `initial_rank_heartbeat_timeout` | 1800 | First heartbeat timeout (seconds) | +| `rank_heartbeat_timeout` | 300 | Subsequent heartbeat timeout (seconds) | + +#### Option 2: Direct config + ft_launcher + +```python +from megatron.bridge.training.config import FaultToleranceConfig + +cfg.ft = FaultToleranceConfig( + enable_ft_package=True, + calc_ft_timeouts=True, + simulate_fault=False, + simulated_fault_type="random", +) +``` + +Launch with `ft_launcher` (not `torchrun`): + +```bash +export GROUP_RANK=0 # required for non-Slurm +ft_launcher \ + --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \ + --nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \ + --ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \ + --ft-rank_out_of_section_timeout=300 \ + your_training_script.py +``` + +| Config parameter | Default | Description | +|---|---|---| +| `enable_ft_package` | False | Enable fault tolerance | +| `calc_ft_timeouts` | False | Auto-compute optimal timeouts | +| `simulate_fault` | False | Enable fault simulation for testing | +| `simulated_fault_type` | `"random"` | `"rank_hung"`, `"rank_killed"`, or `"random"` | +| `simulated_fault_rank` | None | Specific rank to fault (random if None) | +| `simulated_fault_base_delay` | 0 | Base delay before simulating fault | + +Section-based timeout monitoring covers setup, training steps, checkpointing, +and out-of-section time independently. Timeouts are saved to `ft_state.json` +for subsequent runs when `calc_ft_timeouts=True`. + +### NVRx straggler detection + +```python +from megatron.bridge.training.config import NVRxStragglerDetectionConfig + +cfg.nvrx_straggler = NVRxStragglerDetectionConfig( + enabled=True, + report_time_interval=300.0, + calc_relative_gpu_perf=True, + calc_individual_gpu_perf=True, + num_gpu_perf_scores_to_print=5, + gpu_relative_perf_threshold=0.7, + gpu_individual_perf_threshold=0.7, + stop_if_detected=False, + enable_logging=True, +) +``` + +| Parameter | Default | Description | +|---|---|---| +| `enabled` | False | Enable straggler detection | +| `report_time_interval` | 300.0 | Seconds between straggler checks | +| `calc_relative_gpu_perf` | True | Compare ranks against each other | +| `calc_individual_gpu_perf` | True | Track per-rank degradation over time | +| `gpu_relative_perf_threshold` | 0.7 | Threshold for relative performance (0-1) | +| `gpu_individual_perf_threshold` | 0.7 | Threshold for individual performance (0-1) | +| `stop_if_detected` | False | Terminate training on straggler | +| `num_gpu_perf_scores_to_print` | 5 | Number of best/worst scores to print | +| `profiling_interval` | 1 | Profiling interval for detector | + +### Preemption + +#### Plugin (Slurm) + +```python +from megatron.bridge.recipes.run_plugins import PreemptionPlugin + +plugins = [ + PreemptionPlugin( + preempt_time=60, + enable_exit_handler=True, + enable_exit_handler_for_data_loader=False, + ) +] +``` + +| Plugin parameter | Default | Description | +|---|---|---| +| `preempt_time` | 60 | Seconds before job limit to send signal | +| `enable_exit_handler` | True | Enable signal handler in training | +| `enable_exit_handler_for_data_loader` | False | Enable for dataloader workers | + +#### Direct config + +```python +import signal +cfg.train.exit_signal_handler = True +cfg.train.exit_signal = signal.SIGTERM +cfg.train.exit_signal_handler_for_dataloader = False +``` + +### Re-run state machine (experimental) + +```python +from megatron.bridge.training.config import RerunStateMachineConfig + +cfg.rerun_state_machine = RerunStateMachineConfig( + rerun_mode="validate_results", + check_for_nan_in_loss=True, + check_for_spiky_loss=False, + spiky_loss_factor=10.0, +) +``` + +| Parameter | Default | Description | +|---|---|---| +| `rerun_mode` | `"disabled"` | `"disabled"`, `"validate_results"`, `"report_determinism_stats"` | +| `check_for_nan_in_loss` | True | Check for NaN in loss | +| `check_for_spiky_loss` | False | Check for unexpectedly large loss | +| `spiky_loss_factor` | 10.0 | Loss flagged if > factor * max observed (increase for large models) | + +Exit codes: 16 = resume to disambiguate, 17 = failed validation. + +### In-process restart (experimental) + +```python +from megatron.bridge.training.config import InProcessRestartConfig + +cfg.inprocess_restart = InProcessRestartConfig( + enabled=True, + granularity="node", + soft_timeout=60.0, + hard_timeout=90.0, +) +``` + +| Parameter | Default | Description | +|---|---|---| +| `enabled` | False | Enable in-process restart | +| `active_world_size` | None | Ranks executing workload (rest are warm reserves) | +| `granularity` | `"node"` | `"node"` or `"rank"` restart granularity | +| `max_iterations` | None | Max restart attempts (None = unlimited) | +| `soft_timeout` | 60.0 | Detect GIL-released hangs (seconds) | +| `hard_timeout` | 90.0 | Force-terminate hung ranks (seconds) | +| `heartbeat_interval` | 30.0 | Heartbeat interval (seconds) | +| `heartbeat_timeout` | 60.0 | Missing heartbeat timeout (seconds) | +| `barrier_timeout` | 120.0 | Distributed barrier timeout (seconds) | +| `completion_timeout` | 120.0 | Completion barrier timeout (seconds) | +| `empty_cuda_cache` | True | Clear CUDA cache during restart | +| `max_rank_faults` | None | Max rank faults before terminating | +| `monitor_process_logdir` | None | Directory for monitor logs | + +Required environment variables: + +```bash +export TORCH_CPP_LOG_LEVEL=error +export TORCH_NCCL_RETHROW_CUDA_ERRORS=0 +export NCCL_NVLS_ENABLE=0 +``` + +The PyTorch NCCL watchdog timeout must exceed `hard_timeout`. NeMo-Run's +Slurm Executor is not supported; launch directly with `srun --kill-on-bad-exit=0`. + +### Async checkpoint save + +```python +cfg.checkpoint.async_save = True +cfg.checkpoint.ckpt_format = "torch_dist" +``` + +### Local checkpointing (NVRx) + +```python +cfg.checkpoint.non_persistent_local_ckpt_dir = "/local/scratch/ckpt" +cfg.checkpoint.non_persistent_local_ckpt_algo = "fully_parallel" +``` + +## Code Anchors + +### Fault tolerance +- Config: `src/megatron/bridge/training/config.py` — `FaultToleranceConfig` +- Runtime: `src/megatron/bridge/training/fault_tolerance.py` +- Plugin: `src/megatron/bridge/recipes/run_plugins.py` — `FaultTolerancePlugin` +- Perf plugin: `scripts/performance/resiliency_plugins.py` +- Tests: `tests/unit_tests/training/test_fault_tolerance.py` +- Example: `examples/resiliency/fault_tolerance/` + +### Straggler detection +- Config: `src/megatron/bridge/training/config.py` — `NVRxStragglerDetectionConfig` +- Runtime: `src/megatron/bridge/training/nvrx_straggler.py` +- Train loop: `src/megatron/bridge/training/train.py` — `check_nvrx_straggler_detection` +- Tests: `tests/unit_tests/training/test_nvrx_straggler.py`, `tests/functional_tests/training/test_nvrx_straggler.py` +- Example: `examples/resiliency/straggler_detection/` + +### In-process restart +- Config: `src/megatron/bridge/training/config.py` — `InProcessRestartConfig` +- Runtime: `src/megatron/bridge/training/inprocess_restart.py` +- Entry point: `src/megatron/bridge/training/pretrain.py` — `maybe_wrap_for_inprocess_restart` +- Tests: `tests/unit_tests/training/test_inprocess_restart.py`, `tests/functional_tests/training/test_inprocess_restart.py` + +### Preemption +- Plugin: `src/megatron/bridge/recipes/run_plugins.py` — `PreemptionPlugin` +- Signal handler: `src/megatron/bridge/training/utils/sig_utils.py` +- Tests: `tests/unit_tests/recipes/test_run_plugins.py` + +### Re-run state machine +- Config: `src/megatron/bridge/training/config.py` — `RerunStateMachineConfig` +- Init: `src/megatron/bridge/training/initialize.py` — `init_rerun_state` + +### Checkpointing +- Async save: `src/megatron/bridge/training/checkpointing.py` — `schedule_async_save` +- Local ckpt: `src/megatron/bridge/training/checkpointing.py` — `LocalCheckpointManager` +- Tests: `tests/functional_tests/training/test_local_checkpointing.py` + +## Pitfalls + +1. **ft_launcher, not torchrun**: Direct `FaultToleranceConfig` requires + `ft_launcher`. Using `torchrun` silently disables FT. For non-Slurm, + set `GROUP_RANK=0`. + +2. **Async save requires torch_dist**: `async_save=True` only works with + `ckpt_format="torch_dist"`. Other formats silently fail or error. + +3. **IPR + NeMo-Run**: In-process restart is not compatible with NeMo-Run + or Slurm preemption plugins. Requires specific PyTorch/NCCL versions + and env vars. + +4. **NVRx vs legacy straggler**: Two detectors exist. Use NVRx + (`nvrx_straggler`); do not enable both. + +5. **stop_if_detected default**: NVRx logs but does not stop training by + default. Set `stop_if_detected=True` for automatic termination. + +6. **NCCL watchdog vs hard_timeout**: For IPR, NCCL watchdog timeout must + exceed `hard_timeout` or PyTorch kills the process before recovery. + +7. **Rerun state machine is alpha**: Use `check_for_nan_in_loss=True` for + NaN detection, but don't rely on full rerun workflows yet. + +## Verification + +### Fault tolerance +```bash +./examples/resiliency/fault_tolerance/run_fault_tolerance.sh +./examples/resiliency/fault_tolerance/run_fault_tolerance.sh --simulate-fault +``` +Look for `[FaultTolerance]` / `[RankMonitorServer]` log lines with section +timeouts. Simulated fault should trigger restart from checkpoint. + +### Straggler detection +```bash +uv run python -m torch.distributed.run --nproc_per_node=2 \ + examples/resiliency/straggler_detection/straggler_detection_example.py +``` +Look for `GPU relative performance` and `GPU individual performance` reports +with per-rank scores. + +### Async checkpoint +Look for `Scheduling async checkpoint save` in logs. Training iterations +should continue while checkpoint files are being written. + +### In-process restart +```bash +pytest tests/functional_tests/training/test_inprocess_restart.py -v +``` +Requires compatible PyTorch/NCCL versions. diff --git a/skills/resiliency/card.yaml b/skills/resiliency/card.yaml new file mode 100644 index 0000000000..06b33903d2 --- /dev/null +++ b/skills/resiliency/card.yaml @@ -0,0 +1,121 @@ +title: resiliency +validated_on: "2026-03-16" +validation_method: file_existence_only +summary: > + Megatron Bridge integrates nvidia-resiliency-ext for fault tolerance (hang + detection + auto restart), NVRx straggler detection, in-process restart + (experimental), preemption (graceful shutdown), and re-run state machine + (experimental NaN attribution). Async checkpoint save and local checkpointing + support faster recovery. +validation_status: + fault_tolerance_config: + - file_exists + fault_tolerance_plugin: + - file_exists + fault_tolerance_unit_tests: + - file_exists + fault_tolerance_example: + - file_exists + nvrx_straggler_config: + - file_exists + nvrx_straggler_unit_tests: + - file_exists + nvrx_straggler_functional_test: + - file_exists + nvrx_straggler_example: + - file_exists + inprocess_restart_config: + - file_exists + inprocess_restart_unit_tests: + - file_exists + inprocess_restart_functional_test: + - file_exists + preemption_plugin: + - file_exists + preemption_unit_tests: + - file_exists + rerun_state_machine_config: + - file_exists + rerun_state_machine_runtime: + - unclear + async_checkpoint_save: + - file_exists + local_checkpointing: + - file_exists + local_checkpointing_functional_test: + - file_exists +feature_meaning: + fault_tolerance: > + Hang detection and automatic job restart via ft_launcher and + nvidia-resiliency-ext RankMonitorClient. Slurm-only. + nvrx_straggler_detection: > + GPU performance monitoring that identifies slow ranks and optionally + terminates training. Uses nvidia-resiliency-ext attribution module. + inprocess_restart: > + Experimental restart within the same process using + nvidia-resiliency-ext inprocess module. Does not require a new job. + preemption: > + Graceful shutdown on SIGTERM with checkpoint save before exit. + Slurm preemption support via PreemptionPlugin. + rerun_state_machine: > + Experimental NaN and spiky loss detection with automatic rerun + attribution. Alpha-level feature. + async_checkpoint_save: > + Non-blocking checkpoint writes using persistent workers. Overlaps + save I/O with training compute. + local_checkpointing: > + Fast local checkpoint save/load using nvidia-resiliency-ext local + checkpointing with replication strategies. +recommended_path: + fault_tolerance: + ft.enable_ft_package: true + ft.calc_ft_timeouts: true + plugin: FaultTolerancePlugin + launcher: ft_launcher + straggler_detection: + nvrx_straggler.enabled: true + nvrx_straggler.report_time_interval: 300.0 + nvrx_straggler.stop_if_detected: false + async_checkpoint: + checkpoint.async_save: true + checkpoint.ckpt_format: torch_dist +known_constraints: + - Fault tolerance requires Slurm and ft_launcher (not torchrun). + - Async save requires ckpt_format=torch_dist. + - In-process restart requires PyTorch >= 2.5.1 and NCCL >= 2.26.2. + - In-process restart is incompatible with NeMo-Run and Slurm preemption. + - NVRx straggler and legacy StragglerDetector should not both be enabled. + - NCCL watchdog timeout must exceed in-process restart hard_timeout. + - nvidia-resiliency-ext (~0.5.0) is required for FT, straggler, IPR, and local ckpt. +known_limitations: + - No torchrun-based fault tolerance path exists. + - Re-run state machine integration is alpha-level. + - In-process restart functional test is excluded from default CI. + - Not all recipes enable resiliency features by default. + - Preemption plugin is Slurm-specific. +evidence: + - docs/training/resiliency.md + - docs/training/checkpointing.md + - src/megatron/bridge/training/fault_tolerance.py + - src/megatron/bridge/training/nvrx_straggler.py + - src/megatron/bridge/training/inprocess_restart.py + - src/megatron/bridge/training/checkpointing.py + - src/megatron/bridge/training/config.py + - src/megatron/bridge/training/utils/sig_utils.py + - src/megatron/bridge/recipes/run_plugins.py + - scripts/performance/resiliency_plugins.py + - tests/unit_tests/training/test_fault_tolerance.py + - tests/unit_tests/training/test_nvrx_straggler.py + - tests/unit_tests/training/test_inprocess_restart.py + - tests/unit_tests/recipes/test_run_plugins.py + - tests/functional_tests/training/test_nvrx_straggler.py + - tests/functional_tests/training/test_inprocess_restart.py + - tests/functional_tests/training/test_local_checkpointing.py + - examples/resiliency/fault_tolerance/ + - examples/resiliency/straggler_detection/ +follow_up_validation: + - Add a Slurm-based FT end-to-end test with actual hang recovery. + - Add a checked-in recipe that enables FT + straggler detection by default. + - Validate in-process restart on current container and NCCL versions. + - Promote re-run state machine from alpha once runtime integration is complete. + - Add benchmark for async save overhead vs sync save.