Skip to content

Commit 2d23be7

Browse files
akoumpajgerh
andauthored
docs: explain patch_inner_model and patch_causal_lm_model more (#1133)
* explain patch_inner_model and patch_causal_lm_model more Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update pipelining.md * Update pipelining.md * Update pipelining.md * Update docs/guides/pipelining.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> * Update pipelining.md --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
1 parent 7536edf commit 2d23be7

File tree

1 file changed

+53
-26
lines changed

1 file changed

+53
-26
lines changed

docs/guides/pipelining.md

Lines changed: 53 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,12 @@ As large language models continue to grow in size, training and fine-tuning them
66

77
Pipeline parallelism addresses these challenges by splitting a model's layers across different devices and processing them in a pipelined fashion. Each device processes a different stage of the model, enabling training of models that wouldn't fit on a single device while maintaining high GPU utilization through overlapped computation.
88

9-
AutoPipeline is NeMo AutoModel's high-level pipeline parallelism interface specifically designed for HuggingFace models, making pipeline parallelism as simple as data parallelism. Built on PyTorch's native `torch.distributed.pipelining`, AutoPipeline provides seamless pipeline parallelism support for any HuggingFace decoder-only causal language model with minimal code changes.
9+
AutoPipeline is NeMo AutoModel's high-level pipeline parallelism interface specifically designed for Hugging Face models, making pipeline parallelism as simple as data parallelism. Built on PyTorch's native `torch.distributed.pipelining`, AutoPipeline provides seamless pipeline parallelism support for any Hugging Face decoder-only causal language model with minimal code changes.
1010

1111
For custom models and more granular control, the functional API in `nemo_automodel.components.distributed.pipelining.functional` provides modular, accessible building blocks that can be used with any PyTorch model architecture.
1212

13-
This guide walks you through the complete process of using AutoPipeline for HuggingFace models and the functional API for custom models. You'll learn how to configure pipeline stages, integrate with existing training workflows, optimize performance, and combine pipeline parallelism with other parallelization strategies.
13+
This guide walks you through the complete process of using AutoPipeline for Hugging Face models and the functional API for custom models. You'll learn how to configure pipeline stages, integrate with existing training workflows, optimize performance, and combine pipeline parallelism with other parallelization strategies.
1414

15-
:::{important}
16-
Before proceeding with this guide, please ensure that you have NeMo AutoModel installed on your machine.
1715

1816
**Prerequisites:**
1917

@@ -28,24 +26,25 @@ uv pip install nemo-automodel
2826
# Or install from source for the latest features
2927
uv pip install git+https://github.com/NVIDIA-NeMo/Automodel.git
3028
```
31-
29+
:::{important}
30+
Before proceeding with this guide, please ensure that you have NeMo AutoModel installed on your machine.
3231
For a complete guide and additional options please consult the AutoModel [Installation Guide](./installation.md).
3332
:::
3433

3534
## Key Features
3635

3736
AutoPipeline provides enterprise-grade pipeline parallelism with the following features:
3837

39-
- **Universal HuggingFace Support**: Works with any HuggingFace decoder-only causal language model including Llama, Qwen, Mistral, Gemma, and more
38+
- **Universal Hugging Face Support**: Works with any Hugging Face decoder-only causal language model including Llama, Qwen, Mistral, Gemma, and more
4039
- **PyTorch Native Integration**: Built on PyTorch's `torch.distributed.pipelining` for optimal performance
4140
- **Flexible Configuration**: Multiple scheduling strategies, configurable microbatch sizes, and automatic or manual layer splitting
4241
- **Mixed Parallelism Support**: Combine pipeline parallelism with data parallelism, tensor parallelism, and FSDP
4342
- **Modular Functional API**: For custom models, the functional module provides accessible, low-level building blocks
4443
- **Minimal Opinions**: Easy to extend and integrate with existing training workflows
4544

46-
## Quick Start with AutoPipeline (HuggingFace Models)
45+
## Quick Start with AutoPipeline (Hugging Face Models)
4746

48-
Here's a minimal example to get started with AutoPipeline using 2 pipeline stages with a HuggingFace model:
47+
Here's a minimal example to get started with AutoPipeline using 2 pipeline stages with a Hugging Face model:
4948

5049
```python
5150
import torch
@@ -127,13 +126,41 @@ ap = AutoPipeline(
127126
layers_per_stage=None, # Layers per stage (None for auto)
128127
module_fqns_per_model_part=None, # Manual module assignment
129128

130-
# Model patching
131-
patch_inner_model=True, # Patch HF model internals
132-
patch_causal_lm_model=True, # Patch causal LM wrapper
129+
# Model patching (HF-specific)
130+
patch_inner_model=True, # Make decoder forward stage-friendly
131+
patch_causal_lm_model=True, # Make CausalLM wrapper return tensors (hidden/logits)
133132
).build(model, loss_fn=loss_fn)
134133
```
135134

136-
### Automatic vs Manual Layer Distribution
135+
### Model Patching (`patch_inner_model`, `patch_causal_lm_model`)
136+
137+
AutoPipeline splits a model by deep-copying it per stage and pruning away modules that don't belong to that stage. Many Hugging Face models assume the full module tree is present and return `ModelOutput` objects; after pruning, their original `forward()` often breaks (or returns objects that are awkward to pipeline).
138+
139+
These two flags switch AutoPipeline to lightweight, pipeline-friendly `forward()` implementations that return tensors (see `nemo_automodel.components.distributed.pipelining.hf_utils.patch_hf_model_for_pp`):
140+
141+
- **`patch_inner_model`**: patches the *decoder module* (`model.model` for `...ForCausalLM`, otherwise the module itself) so each stage can run even after pruning.
142+
- **Stage 0** (has `embed_tokens`): takes token IDs and produces hidden states.
143+
- **Middle stages** (no `embed_tokens`): take hidden states from the previous stage (via `inputs_embeds`, or a float tensor passed through `input_ids`) and produce hidden states.
144+
- Handles sliced layer containers (e.g., `layers` becoming dict-like after stage pruning) and returns a **tensor** of hidden states so stages can be chained.
145+
146+
For compilation/performance, this patched forward prefers a precomputed `causal_mask_mapping` dict (it will fall back to computing masks and warn if you don't provide it).
147+
148+
- **`patch_causal_lm_model`**: patches the *`...ForCausalLM` wrapper* forward (the module that owns `lm_head`) so pipeline stages return tensors:
149+
- Returns **hidden states** when `lm_head` is absent on that stage.
150+
- Returns **logits** when `lm_head` is present (typically only the last stage).
151+
- Supports `logits_to_keep` to compute logits for only the last `k` tokens.
152+
153+
Note: this is only used when the module you pipeline is a `...ForCausalLM`-style wrapper (i.e., it has a `.model` attribute). If you pass a base decoder module directly, `patch_causal_lm_model` typically has no effect.
154+
155+
#### When Should I Change These?
156+
157+
- **Leave both `True` (default)** for standard Hugging Face `AutoModelForCausalLM` / `...ForCausalLM` models. This is the common case and gives the expected behavior: token IDs -> hidden states -> logits across stages.
158+
- **Set both `False`** when your model already has a pipeline-friendly forward (returns tensors and can accept hidden states when embeddings are absent) or it needs custom kwargs/paths that the HF patch doesn't preserve (common for NeMo AutoModel-native model implementations, packed-sequence/`thd` paths, extra args like `padding_mask`, etc.). Many benchmark configs for NeMo-native models do this (for example `examples/benchmark/configs/qwen3_moe_30b_torch.yaml`).
159+
- **Set `patch_inner_model=False, patch_causal_lm_model=True`** when your inner model is already stage-friendly, but the wrapper forward still returns a `ModelOutput` and you only want the wrapper simplified to “hidden states or logits”.
160+
161+
If you disable `patch_causal_lm_model`, your last stage will typically output hidden states instead of logits; in that case, make sure your `loss_fn` (or your last-stage module) applies the LM head explicitly.
162+
163+
### Automatic vs. Manual Layer Distribution
137164

138165
AutoPipeline offers flexible control over how your model is split across pipeline stages:
139166

@@ -251,13 +278,13 @@ Key observations:
251278

252279
## Using the Functional API for Custom Models
253280

254-
While AutoPipeline is specifically designed as a high-level interface for HuggingFace models, the functional API in `nemo_automodel.components.distributed.pipelining.functional` provides more modular and accessible building blocks that can be used with any PyTorch model, including custom architectures. This separation allows for cleaner code organization where AutoPipeline handles HuggingFace-specific optimizations while the functional module remains model-agnostic.
281+
While AutoPipeline is specifically designed as a high-level interface for Hugging Face models, the functional API in `nemo_automodel.components.distributed.pipelining.functional` provides more modular and accessible building blocks that can be used with any PyTorch model, including custom architectures. This separation allows for cleaner code organization where AutoPipeline handles Hugging Face-specific optimizations while the functional module remains model-agnostic.
255282

256283
### Key Functional API Components
257284

258285
The functional API provides several utilities for building custom pipeline parallel systems:
259286

260-
#### 1. **Stage ID Calculation**
287+
#### 1. Stage ID Calculation
261288
```python
262289
from nemo_automodel.components.distributed.pipelining.functional import stage_ids_this_rank
263290

@@ -271,7 +298,7 @@ stage_ids = stage_ids_this_rank(pp_rank=0, pp_size=4, num_stages=8, style="v")
271298
# Returns: (0, 7) - rank 0 gets stages 0 and 7
272299
```
273300

274-
#### 2. **Module Name Generation**
301+
#### 2. Module Name Generation
275302
```python
276303
from nemo_automodel.components.distributed.pipelining.functional import (
277304
generate_hf_model_fqn_per_model_part
@@ -288,7 +315,7 @@ module_names = generate_hf_model_fqn_per_model_part(
288315
)
289316
```
290317

291-
#### 3. **Virtual Stage Calculation**
318+
#### 3. Virtual Stage Calculation
292319
```python
293320
from nemo_automodel.components.distributed.pipelining.functional import calculate_virtual_stages
294321

@@ -302,7 +329,7 @@ num_virtual_stages, stages_per_rank = calculate_virtual_stages(
302329
)
303330
```
304331

305-
#### 4. **Pipeline Schedule Building**
332+
#### 4. Pipeline Schedule Building
306333
```python
307334
from nemo_automodel.components.distributed.pipelining.functional import build_pipeline_schedule
308335

@@ -477,8 +504,8 @@ schedule, model_parts, has_first, has_last, stages = pipeline_model(
477504
loss_fn=loss_fn,
478505
parallelize_fn=custom_parallelize_fn,
479506
module_fqns_per_model_part=None, # Provide custom module names
480-
patch_inner_model=False, # Disable HF-specific patching
481-
patch_causal_lm_model=False, # Disable HF-specific patching
507+
patch_inner_model=False, # Custom model: don't apply HF forward patches
508+
patch_causal_lm_model=False, # Custom model: don't apply HF forward patches
482509
)
483510
```
484511

@@ -492,7 +519,7 @@ The functional API is designed to be more accessible and modular than AutoPipeli
492519
4. **Flexibility**: The functional API gives you complete control over how models are split and parallelized
493520
5. **Testing**: Start with a small model and verify correct splitting before scaling up
494521

495-
The functional module's modular design makes it easier to integrate pipeline parallelism into existing custom model training workflows without the HuggingFace-specific assumptions that AutoPipeline makes.
522+
The functional module's modular design makes it easier to integrate pipeline parallelism into existing custom model training workflows without the Hugging Face-specific assumptions that AutoPipeline makes.
496523

497524
## Mixed Parallelism
498525

@@ -623,7 +650,7 @@ autopipeline:
623650
624651
### Mixed Parallelism Examples
625652
626-
#### Pipeline + Data Parallelism (4 GPUs total)
653+
#### Pipeline + Data Parallelism (4 GPUs Total)
627654
```bash
628655
uv run torchrun --nproc_per_node=4 examples/llm/finetune.py \
629656
--config your_config.yaml \
@@ -632,7 +659,7 @@ uv run torchrun --nproc_per_node=4 examples/llm/finetune.py \
632659
--dataloader.batch_size 16
633660
```
634661

635-
#### Pipeline + Tensor Parallelism (4 GPUs total)
662+
#### Pipeline + Tensor Parallelism (4 GPUs Total)
636663
```bash
637664
uv run torchrun --nproc_per_node=4 examples/llm/finetune.py \
638665
--config your_config.yaml \
@@ -641,7 +668,7 @@ uv run torchrun --nproc_per_node=4 examples/llm/finetune.py \
641668
--dataloader.batch_size 8
642669
```
643670

644-
#### Full Hybrid: PP + DP + TP (8 GPUs total)
671+
#### Full Hybrid: PP + DP + TP (8 GPUs Total)
645672
```bash
646673
uv run torchrun --nproc_per_node=8 examples/llm/finetune.py \
647674
--config your_config.yaml \
@@ -718,11 +745,11 @@ uv run torchrun --nproc_per_node=2 examples/llm/finetune.py --config config.yaml
718745

719746
## Conclusion
720747

721-
AutoPipeline and the functional API together provide a complete pipeline parallelism solution for both HuggingFace and custom models. AutoPipeline offers a high-level, optimized interface specifically for HuggingFace models, while the functional module provides modular, accessible building blocks for custom architectures.
748+
AutoPipeline and the functional API together provide a complete pipeline parallelism solution for both Hugging Face and custom models. AutoPipeline offers a high-level, optimized interface specifically for Hugging Face models, while the functional module provides modular, accessible building blocks for custom architectures.
722749

723750
Key takeaways:
724751
- Pipeline parallelism enables training of models too large for a single GPU
725-
- AutoPipeline provides a simple API for HuggingFace models with powerful customization options
752+
- AutoPipeline provides a simple API for Hugging Face models with powerful customization options
726753
- The functional API offers modular components for implementing pipeline parallelism with any PyTorch model
727754
- Both can be combined with other parallelization strategies for optimal performance
728-
- Use built-in monitoring tools to understand and optimize your pipeline
755+
- Use built-in monitoring tools to understand and optimize your pipeline

0 commit comments

Comments
 (0)