Skip to content

Commit 7536edf

Browse files
akoumpajgerh
andauthored
docs: 🤗 Hugging Face Transformers API compatibility (#1146)
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
1 parent 2da5f18 commit 7536edf

File tree

2 files changed

+230
-0
lines changed

2 files changed

+230
-0
lines changed
Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# 🤗 Hugging Face Transformers API Compatibility
2+
3+
NeMo AutoModel is built to work with the 🤗 Hugging Face ecosystem.
4+
In practice, compatibility comes in two layers:
5+
6+
- **API compatibility**: for many workflows, you can keep your existing `transformers` code and swap in NeMo AutoModel “drop-in” wrappers (`NeMoAutoModel*`, `NeMoAutoTokenizer`) with minimal changes.
7+
- **Artifact compatibility**: NeMo AutoModel produces **Hugging Face-compatible checkpoints** (config + tokenizer + safetensors) that can be loaded by Hugging Face Transformers and downstream tools (vLLM, SGLang, etc.).
8+
9+
This page summarizes what "HF compatibility" means in NeMo AutoModel, calls out differences you should be aware of, and provides side-by-side examples.
10+
11+
## Transformers Version Compatibility: v4 and v5
12+
13+
### Transformers v4 (Current Default)
14+
15+
NeMo AutoModel currently pins Hugging Face Transformers to the **v4** major line (see `pyproject.toml`, currently `transformers<=4.57.5`).
16+
17+
This means:
18+
19+
- NeMo AutoModel is primarily tested and released against **Transformers v4.x**
20+
- New model releases on the Hugging Face Hub that require a newer Transformers may require upgrading NeMo AutoModel as well (similar to upgrading `transformers` directly)
21+
22+
### Transformers v5 (Forward-Compatibility and Checkpoint Interoperability)
23+
24+
Transformers **v5** introduces breaking changes across some internal utilities (e.g., cache APIs) and adds/reshapes tokenizer backends for some model families.
25+
26+
NeMo AutoModel addresses this in two complementary ways:
27+
28+
- **Forward-compatibility shims**: NeMo AutoModel includes small compatibility patches to smooth over known API differences across Transformers releases (for example, cache utility method names). The built-in recipes apply these patches automatically.
29+
- **Backports where needed**: for some model families, NeMo AutoModel may vendor/backport Hugging Face code that originated in the v5 development line so users can run those models while staying on a pinned v4 dependency.
30+
- **Stable artifact format**: NeMo AutoModel checkpoints are written in Hugging Face-compatible `save_pretrained` layouts (config + tokenizer + safetensors). These artifacts are designed to be loadable by both Transformers **v4** and **v5** (and non-Transformers tools that consume HF-style model repos).
31+
32+
:::{note}
33+
If you are running Transformers v5 in another environment, you can still use NeMo AutoModel-produced consolidated checkpoints with Transformers' standard loading APIs. For details on the checkpoint layouts, see [checkpointing](checkpointing.md).
34+
:::
35+
36+
## Drop-In Compatibility and Key Differences
37+
38+
### Drop-In (Same Mental Model as Transformers)
39+
40+
- **Load by model ID or local path**: `from_pretrained(...)`
41+
- **Standard HF config objects**: `AutoConfig` / `config.json`
42+
- **Tokenizers**: standard `PreTrainedTokenizerBase` behavior, including `__call__` to create tensors and `decode`/`batch_decode`
43+
- **Generation**: `model.generate(...)` and the usual generation kwargs
44+
45+
### Differences (Where NeMo AutoModel Adds Value or Has Constraints)
46+
47+
- **Performance features**: NeMo AutoModel can automatically apply optional kernel patches/optimizations (e.g., SDPA selection, Liger kernels, DeepEP, etc.) while keeping the public model API the same.
48+
- **Distributed training stack**: NeMo AutoModel's recipes/CLI are designed for multi-GPU/multi-node fine-tuning with PyTorch-native distributed features (FSDP2, pipeline parallelism, etc.).
49+
- **CUDA expectation**: NeMo AutoModel's `NeMoAutoModel*` wrappers are primarily optimized for NVIDIA GPU workflows, and offer support for CPU workflows as well.
50+
51+
:::{important}
52+
`NeMoAutoModelForCausalLM.from_pretrained(...)` currently assumes CUDA is available (it uses `torch.cuda.current_device()` internally). If you need CPU-only inference, use Hugging Face `transformers` directly.
53+
:::
54+
55+
## API Mapping (Transformers and NeMo AutoModel)
56+
57+
### API Name Mapping
58+
59+
:::{raw} html
60+
<table>
61+
<thead>
62+
<tr>
63+
<th style="width: 45%;">🤗 Hugging Face (<code>transformers</code>)</th>
64+
<th style="width: 45%;">NeMo AutoModel (<code>nemo_automodel</code>)</th>
65+
<th style="width: 10%;">Status</th>
66+
</tr>
67+
</thead>
68+
<tbody>
69+
<tr>
70+
<td><code>transformers.AutoModelForCausalLM</code></td>
71+
<td><code>nemo_automodel.NeMoAutoModelForCausalLM</code></td>
72+
<td>✅</td>
73+
</tr>
74+
<tr>
75+
<td><code>transformers.AutoModelForImageTextToText</code></td>
76+
<td><code>nemo_automodel.NeMoAutoModelForImageTextToText</code></td>
77+
<td>✅</td>
78+
</tr>
79+
<tr>
80+
<td><code>transformers.AutoModelForSequenceClassification</code></td>
81+
<td><code>nemo_automodel.NeMoAutoModelForSequenceClassification</code></td>
82+
<td>✅</td>
83+
</tr>
84+
<tr>
85+
<td><code>transformers.AutoModelForTextToWaveform</code></td>
86+
<td><code>nemo_automodel.NeMoAutoModelForTextToWaveform</code></td>
87+
<td>✅</td>
88+
</tr>
89+
<tr>
90+
<td><code>transformers.AutoTokenizer.from_pretrained(...)</code></td>
91+
<td><code>nemo_automodel.NeMoAutoTokenizer.from_pretrained(...)</code></td>
92+
<td>✅</td>
93+
</tr>
94+
<tr>
95+
<td><code>model.generate(...)</code></td>
96+
<td><code>model.generate(...)</code></td>
97+
<td>🚧</td>
98+
</tr>
99+
<tr>
100+
<td><code>model.save_pretrained(path)</code></td>
101+
<td><code>model.save_pretrained(path, checkpointer=...)</code></td>
102+
<td>🚧</td>
103+
</tr>
104+
</tbody>
105+
</table>
106+
:::
107+
108+
## Side-by-Side Examples
109+
110+
### Load a Model and Tokenizer (Transformers v4)
111+
112+
:::{raw} html
113+
<table>
114+
<thead>
115+
<tr>
116+
<th style="width: 50%;">🤗 Hugging Face (<code>transformers</code>)</th>
117+
<th style="width: 50%;">NeMo AutoModel (<code>nemo_automodel</code>)</th>
118+
</tr>
119+
</thead>
120+
<tbody>
121+
<tr>
122+
<td style="vertical-align: top;">
123+
<div class="highlight"><pre><code>import torch
124+
from transformers import AutoModelForCausalLM, AutoTokenizer
125+
126+
model_id = "gpt2"
127+
128+
tokenizer = AutoTokenizer.from_pretrained(model_id)
129+
model = AutoModelForCausalLM.from_pretrained(
130+
model_id,
131+
torch_dtype=torch.bfloat16,
132+
)</code></pre></div>
133+
</td>
134+
<td style="vertical-align: top;">
135+
<div class="highlight"><pre><code>import torch
136+
from nemo_automodel import NeMoAutoModelForCausalLM, NeMoAutoTokenizer
137+
138+
model_id = "gpt2"
139+
140+
tokenizer = NeMoAutoTokenizer.from_pretrained(model_id)
141+
model = NeMoAutoModelForCausalLM.from_pretrained(
142+
model_id,
143+
torch_dtype=torch.bfloat16,
144+
)</code></pre></div>
145+
</td>
146+
</tr>
147+
</tbody>
148+
</table>
149+
:::
150+
151+
### Text Generation
152+
153+
This snippet assumes you already have a `model` and `tokenizer` (see the loading snippet above).
154+
155+
:::{raw} html
156+
<table>
157+
<thead>
158+
<tr>
159+
<th style="width: 50%;">🤗 Hugging Face (<code>transformers</code>)</th>
160+
<th style="width: 50%;">NeMo AutoModel (<code>nemo_automodel</code>)</th>
161+
</tr>
162+
</thead>
163+
<tbody>
164+
<tr>
165+
<td style="vertical-align: top; padding-top: 0;">
166+
<div class="highlight" style="margin-top: 0;"><pre style="margin: 0;"><code>import torch
167+
168+
prompt = "Write a haiku about GPU kernels."
169+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
170+
171+
with torch.inference_mode():
172+
out = model.generate(**inputs, max_new_tokens=64)
173+
174+
print(tokenizer.decode(out[0], skip_special_tokens=True))</code></pre></div>
175+
</td>
176+
<td style="vertical-align: top; padding-top: 0;">
177+
<div class="highlight" style="margin-top: 0;"><pre style="margin: 0;"><code>import torch
178+
179+
prompt = "Write a haiku about GPU kernels."
180+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
181+
182+
with torch.inference_mode():
183+
out = model.generate(**inputs, max_new_tokens=64)
184+
185+
print(tokenizer.decode(out[0], skip_special_tokens=True))</code></pre></div>
186+
</td>
187+
</tr>
188+
</tbody>
189+
</table>
190+
:::
191+
192+
193+
### Tokenizers (Transformers vs NeMo AutoModel)
194+
195+
NeMo AutoModel provides `NeMoAutoTokenizer` as a Transformers-like auto-tokenizer with a small registry for specialized backends (and a safe fallback when no specialization is needed).
196+
197+
:::{raw} html
198+
<table>
199+
<thead>
200+
<tr>
201+
<th style="width: 50%;">🤗 Hugging Face (<code>transformers</code>)</th>
202+
<th style="width: 50%;">NeMo AutoModel (<code>nemo_automodel</code>)</th>
203+
</tr>
204+
</thead>
205+
<tbody>
206+
<tr>
207+
<td style="vertical-align: top;">
208+
<div class="highlight"><pre><code>from transformers import AutoTokenizer
209+
210+
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")</code></pre></div>
211+
</td>
212+
<td style="vertical-align: top;">
213+
<div class="highlight"><pre><code>from nemo_automodel import NeMoAutoTokenizer
214+
215+
tok = NeMoAutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")</code></pre></div>
216+
</td>
217+
</tr>
218+
</tbody>
219+
</table>
220+
:::
221+
222+
## Checkpoints: Save in NeMo AutoModel, Load Everywhere
223+
224+
NeMo AutoModel training recipes write checkpoints in Hugging Face-compatible layouts, including consolidated safetensors that you can load directly with Transformers:
225+
226+
- See [checkpointing](checkpointing.md) for checkpoint formats and example directory layouts.
227+
- See [model coverage](../model-coverage/overview.md) for notes on how model support depends on the pinned Transformers version.
228+
229+
If your goal is: **train/fine-tune in NeMo AutoModel → deploy in the HF ecosystem**, the recommended workflow is to enable consolidated safetensors checkpoints and then load them with the standard HF APIs or downstream inference engines.

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
repository-structure.md
99
guides/installation.md
1010
guides/configuration.md
11+
guides/huggingface-api-compatibility.md
1112
launcher/local-workstation.md
1213
launcher/cluster.md
1314
```

0 commit comments

Comments
 (0)