Skip to content

Commit 07195df

Browse files
authored
Merge branch 'main' into akoumparouli/feat_backport_devstral_to_v4
2 parents a337a4d + b7031bb commit 07195df

File tree

10 files changed

+242
-173
lines changed

10 files changed

+242
-173
lines changed

.github/CODEOWNERS

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
docker/ @nvidia-nemo/automation
33
pyproject.toml @nvidia-nemo/automation
44

5-
nemo_automodel @akoumpa @HuiyingLi @adil-a @hemildesai @ybabakhin @shan-nvidia
6-
examples @akoumpa @HuiyingLi @adil-a @hemildesai @ybabakhin @shan-nvidia
7-
README.md @akoumpa @HuiyingLi
5+
docs @akoumpa @jgerh
6+
nemo_automodel @akoumpa @HuiyingLi @adil-a @hemildesai @ybabakhin @shan-nvidia @rnyak @oliverholworthy @gabrielspmoreira
7+
examples @akoumpa @HuiyingLi @adil-a @hemildesai @ybabakhin @shan-nvidia @rnyak @oliverholworthy @gabrielspmoreira
8+
README.md @akoumpa @HuiyingLi @snowmanwwg

.github/workflows/cicd-main.yml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ jobs:
4545
- name: Set up UV
4646
uses: astral-sh/setup-uv@v1
4747
with:
48-
version: 0.7.2
48+
version: 0.8.22
4949
- name: Install ruff
5050
env:
5151
UV_PROJECT_ENVIRONMENT: ./venv
@@ -60,8 +60,9 @@ jobs:
6060
- name: Run ruff
6161
run: |
6262
source ./venv/bin/activate
63-
uv run ruff check . --verbose
64-
uv run ruff format --check . --verbose
63+
uv run --active ruff --version
64+
uv run --active ruff check --verbose .
65+
uv run --active ruff format --check --verbose .
6566
6667
import_linting:
6768
runs-on: ubuntu-latest

docs/guides/dataset-overview.md

Lines changed: 136 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Dataset Overview: LLM and VLM Datasets in NeMo Automodel
22

3-
This page summarizes the datasets already supported in NeMo Automodel for LLM and VLM, and shows how to plug in your own datasets using simple Python functions or directly through YAML using the `_target_` mechanism.
3+
This page summarizes the datasets supported in NeMo Automodel for LLM and VLM and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
44

55
- See also: [LLM datasets](llm/dataset.md) and [VLM datasets](vlm/dataset.md) for deeper, task-specific guides.
66

@@ -23,7 +23,7 @@ dataset:
2323
split: train
2424
```
2525
26-
- **SQuAD-style QA (instruction SFT)**
26+
- **SQuAD-style Question Answering (QA) (instruction SFT)**
2727
- Factory: `nemo_automodel.components.datasets.llm.squad.make_squad_dataset`
2828
- Use case: instruction/QA tuning with either prompt+answer formatting or chat-template formatting
2929
- Notes:
@@ -57,7 +57,133 @@ dataset:
5757
answer_only_loss_mask: true
5858
start_of_turn_token: "<|assistant|>"
5959
```
60-
- See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.
60+
See the detailed guide, [Column-Mapped Text Instruction Dataset](llm/column-mapped-text-instruction-dataset.md), for more information.
61+
62+
- **ChatDataset (multi-turn conversations and tool calling)**
63+
- Class: `nemo_automodel.components.datasets.llm.ChatDataset`
64+
- Use case: multi-turn conversations and tool calling in OpenAI chat format
65+
- Sources: local JSON/JSONL or Hugging Face Hub dataset ID
66+
- Key args:
67+
- `path_or_dataset_id`: path to local file(s) or HuggingFace dataset ID
68+
- `tokenizer`: tokenizer instance (required. Must have chat template support)
69+
- `split`: dataset split (e.g., "train", "validation")
70+
- `name`: dataset configuration/subset name
71+
- `seq_length`: maximum sequence length for padding/truncation
72+
- `padding`: padding strategy ("do_not_pad", "max_length", etc.)
73+
- `truncation`: truncation strategy ("do_not_truncate", "longest_first", etc.)
74+
- `start_of_turn_token`: token marking assistant response start (for answer-only loss)
75+
- `chat_template`: optional override for tokenizer's chat template
76+
- Notes:
77+
- Requires a tokenizer with chat template support
78+
- Supports both single-turn and multi-turn tool calling
79+
- Tool definitions are provided in a `tools` field at the conversation level
80+
- Tool calls appear in assistant messages via `tool_calls` field
81+
- Tool responses use the `tool` role
82+
- Example YAML:
83+
```yaml
84+
dataset:
85+
_target_: nemo_automodel.components.datasets.llm.ChatDataset
86+
path_or_dataset_id: Salesforce/xlam-function-calling-60k
87+
split: train
88+
tokenizer:
89+
_target_: transformers.AutoTokenizer.from_pretrained
90+
pretrained_model_name_or_path: google/functiongemma-270m-it
91+
seq_length: 2048
92+
start_of_turn_token: "<start_of_turn>"
93+
```
94+
- Expected data format (OpenAI messages format):
95+
```json
96+
{
97+
"messages": [
98+
{
99+
"role": "user",
100+
"content": "What's the weather in Seattle?"
101+
},
102+
{
103+
"role": "assistant",
104+
"content": "",
105+
"tool_calls": [
106+
{
107+
"id": "call_1",
108+
"type": "function",
109+
"function": {
110+
"name": "get_weather",
111+
"arguments": "{\"city\": \"Seattle\"}"
112+
}
113+
}
114+
]
115+
},
116+
{
117+
"role": "tool",
118+
"tool_call_id": "call_1",
119+
"content": "{\"temperature\": 65, \"condition\": \"cloudy\"}"
120+
},
121+
{
122+
"role": "assistant",
123+
"content": "It's 65°F and cloudy in Seattle."
124+
}
125+
],
126+
"tools": [
127+
{
128+
"type": "function",
129+
"function": {
130+
"name": "get_weather",
131+
"description": "Get current weather for a city",
132+
"parameters": {
133+
"type": "object",
134+
"properties": {
135+
"city": {"type": "string"}
136+
},
137+
"required": ["city"]
138+
}
139+
}
140+
}
141+
]
142+
}
143+
```
144+
- For single-turn tool calling (one tool call per conversation), omit the tool response and final assistant message:
145+
```json
146+
{
147+
"messages": [
148+
{
149+
"role": "user",
150+
"content": "Book a table for two at 7pm in Seattle."
151+
},
152+
{
153+
"role": "assistant",
154+
"content": "",
155+
"tool_calls": [
156+
{
157+
"id": "call_1",
158+
"type": "function",
159+
"function": {
160+
"name": "book_table",
161+
"arguments": "{\"party_size\": 2, \"time\": \"19:00\", \"city\": \"Seattle\"}"
162+
}
163+
}
164+
]
165+
}
166+
],
167+
"tools": [
168+
{
169+
"type": "function",
170+
"function": {
171+
"name": "book_table",
172+
"description": "Book a restaurant table",
173+
"parameters": {
174+
"type": "object",
175+
"properties": {
176+
"party_size": {"type": "integer"},
177+
"time": {"type": "string"},
178+
"city": {"type": "string"}
179+
}
180+
}
181+
}
182+
}
183+
]
184+
}
185+
```
186+
See the [Function Calling guide](llm/toolcalling.md) for an end-to-end example with FunctionGemma.
61187

62188
- **NanoGPT Binary Shards (pretraining)**
63189
- Class: `nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset`
@@ -69,7 +195,7 @@ dataset:
69195
- **Megatron (pretraining; interoperable with pre-tokenized Megatron data)**
70196
- Class: `nemo_automodel.components.datasets.llm.megatron_dataset.MegatronPretraining`
71197
- Use case: large-scale LM pretraining over Megatron-LM formatted tokenized corpora
72-
- Interoperability: if your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly; no re-tokenization required
198+
- Interoperability: If your corpus has already been tokenized/indexed for Megatron (i.e., `.bin`/`.idx` pairs), you can point Automodel to those assets directly. No re-tokenization required.
73199
- Key args: `paths` (single path, glob, weighted list, or per-split dict), `seq_length`, `tokenizer`, `split`, `index_mapping_dir`, `splits_to_build`
74200
- Example YAML:
75201
```yaml
@@ -84,9 +210,7 @@ dataset:
84210
split: "0.99, 0.01, 0.00" # train, validation, test
85211
splits_to_build: "train"
86212
```
87-
- See the detailed pretraining guide, [Megatron Core Dataset Pretraining](llm/pretraining.md), which uses MegatronPretraining data.
88-
89-
> ⚠️ Note: Multi-turn conversational and tool-calling/function-calling dataset support is coming soon.
213+
See the detailed [pretraining guide](llm/pretraining.md), which uses MegatronPretraining data.
90214

91215
## Packed Sequence Support
92216
To reduce padding and improve throughput with variable-length sequences:
@@ -111,9 +235,10 @@ VLM datasets are represented as conversations (message lists) that combine text
111235

112236
Built-in dataset makers (return lists of `conversation` dicts):
113237
- **RDR items**: `nemo_automodel.components.datasets.vlm.datasets.make_rdr_dataset` (HF: `quintend/rdr-items`)
114-
- **CORD-V2 receipts**: `nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset` (HF: `naver-clova-ix/cord-v2`)
115-
- **MedPix-VQA (medical)**: `nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset`
116-
- **CommonVoice 17 (audio)**: `nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset`
238+
- **CORD-V2 receipts (Consolidated Receipt Dataset for Post-OCR Parsing)**: `nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset` (HF: `naver-clova-ix/cord-v2`)
239+
- **MedPix-VQA (Medical Pixel Question Answering)**: `nemo_automodel.components.datasets.vlm.datasets.make_medpix_dataset`
240+
- **CommonVoice 17 (CV17) (audio)**: `nemo_automodel.components.datasets.vlm.datasets.make_cv17_dataset`
241+
117242

118243
Each example follows the conversation schema expected by `apply_chat_template`, e.g.:
119244
```python
@@ -188,7 +313,7 @@ dataset:
188313
Where `build_my_dataset` returns either a `datasets.Dataset` or a list/iterator of conversation dicts (for VLM).
189314

190315
### 3) Use ColumnMappedTextInstructionDataset for most instruction datasets (LLM)
191-
- Ideal when your data has columns like `instruction`, `input`, `output` but with arbitrary names
316+
- Ideal when your data has columns like `instruction`, `input`, or `output` but with arbitrary names
192317
- Supports local JSON/JSONL and HF Hub
193318
```yaml
194319
dataset:

examples/llm_finetune/llama3_3/custom_llama3_3_70b_instruct_peft_benchmark.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ rng:
2828
ranked: true
2929

3030
model:
31-
_target_: nemo_automodel.components.models.llama.model.build_llama_model
31+
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
3232
pretrained_model_name_or_path: meta-llama/Llama-3.3-70B-Instruct
3333
torch_dtype: bf16
3434

@@ -87,4 +87,4 @@ optimizer:
8787

8888
lr_scheduler:
8989
lr_decay_style: cosine
90-
min_lr: 1.0e-6
90+
min_lr: 1.0e-6

examples/llm_finetune/llama3_3/custom_llama3_3_70b_instruct_peft_benchmark_2nodes.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ rng:
2828
ranked: true
2929

3030
model:
31-
_target_: nemo_automodel.components.models.llama.model.build_llama_model
31+
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
3232
pretrained_model_name_or_path: meta-llama/Llama-3.3-70B-Instruct
3333
torch_dtype: bf16
3434

@@ -87,4 +87,4 @@ optimizer:
8787

8888
lr_scheduler:
8989
lr_decay_style: cosine
90-
min_lr: 1.0e-6
90+
min_lr: 1.0e-6

nemo_automodel/_transformers/auto_model.py

Lines changed: 47 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
import logging
1919
import os
2020
import types
21+
from contextlib import contextmanager
2122
from typing import List, Optional, Union
2223

2324
import torch
@@ -36,10 +37,7 @@
3637
import nemo_automodel.components.distributed.utils as dist_utils
3738
from nemo_automodel import __version__
3839
from nemo_automodel._transformers.registry import ModelRegistry
39-
from nemo_automodel.components.distributed.init_utils import (
40-
get_local_world_size_preinit,
41-
get_world_size_safe,
42-
)
40+
from nemo_automodel.components.distributed.init_utils import get_local_world_size_preinit, get_world_size_safe
4341
from nemo_automodel.components.utils.model_utils import resolve_trust_remote_code
4442
from nemo_automodel.shared.import_utils import safe_import
4543
from nemo_automodel.shared.utils import dtype_from_str
@@ -49,6 +47,33 @@
4947
logger = logging.getLogger(__name__)
5048

5149

50+
@contextmanager
51+
def local_torch_dtype(
52+
dtype: torch.dtype, model_class_name: str | None = None, default_dtype: torch.dtype = torch.bfloat16
53+
):
54+
"""
55+
Locally change the torch default dtype to `dtype`, and restore the old one upon exiting the context.
56+
If `model_class_name` is provided, it's used to provide a more helpful error message if `dtype` is not valid.
57+
"""
58+
# Just a more helping error before we set `torch.set_default_dtype` later on which would crash in this case
59+
if isinstance(dtype, str):
60+
dtype = default_dtype
61+
if not dtype.is_floating_point:
62+
if model_class_name is not None:
63+
error_message = (
64+
f"{model_class_name} cannot be instantiated under `dtype={dtype}` as it's not a floating-point dtype"
65+
)
66+
else:
67+
error_message = f"Cannot set `{dtype}` as torch's default as it's not a floating-point dtype"
68+
raise ValueError(error_message)
69+
original_dtype = torch.get_default_dtype()
70+
try:
71+
torch.set_default_dtype(dtype)
72+
yield
73+
finally:
74+
torch.set_default_dtype(original_dtype)
75+
76+
5277
def _assert_same_signature(original, patched):
5378
"""
5479
Raise AssertionError if the two call signatures differ.
@@ -157,15 +182,17 @@ def _get_next_fallback_attn(attn_implementation: str) -> str:
157182
return priorities[0]
158183

159184

160-
def _prepare_hf_config_and_flag(pretrained_model_name_or_path, force_hf, kwargs):
185+
def _prepare_hf_config_and_flag(pretrained_model_name_or_path, force_hf, kwargs, attn_implementation):
161186
"""
162187
Resolve trust_remote_code default, fetch HF config and determine if model is HF-based.
163188
"""
164189
kwargs["trust_remote_code"] = kwargs.get(
165190
"trust_remote_code", resolve_trust_remote_code(pretrained_model_name_or_path)
166191
)
167192
hf_config = kwargs.pop("config", None) or AutoConfig.from_pretrained(
168-
pretrained_model_name_or_path, trust_remote_code=kwargs["trust_remote_code"]
193+
pretrained_model_name_or_path,
194+
**kwargs,
195+
attn_implementation=attn_implementation,
169196
)
170197
architectures = getattr(hf_config, "architectures", None) or []
171198
is_hf_model = (not architectures or architectures[0] not in ModelRegistry.model_arch_name_to_cls) or force_hf
@@ -358,7 +385,9 @@ def from_pretrained(
358385
`use_liger_kernel=False` or `use_sdpa_patching=False`
359386
"""
360387
torch_dtype = dtype_from_str(torch_dtype) if torch_dtype != "auto" else torch_dtype
361-
hf_config, is_hf_model = _prepare_hf_config_and_flag(pretrained_model_name_or_path, force_hf, kwargs)
388+
hf_config, is_hf_model = _prepare_hf_config_and_flag(
389+
pretrained_model_name_or_path, force_hf, kwargs, attn_implementation=attn_implementation
390+
)
362391
tp_size, cp_size, has_packed_sequence = _pop_tp_cp_has_packed(kwargs)
363392
attn_implementation, use_liger_kernel = _apply_preload_overrides(
364393
is_hf_model, tp_size, cp_size, has_packed_sequence, attn_implementation, use_liger_kernel
@@ -400,7 +429,10 @@ def _retry(**override):
400429
_download_model_weights(hf_config, pretrained_model_name_or_path)
401430
logger.info(f"Using custom model implementation for {architectures[0]}")
402431
kwargs.pop("trust_remote_code", None)
403-
return ModelRegistry.model_arch_name_to_cls[architectures[0]](hf_config, *model_args, **kwargs)
432+
# TODO(@akoumpa): restore weights after initialization.
433+
model_cls = ModelRegistry.model_arch_name_to_cls[architectures[0]]
434+
with local_torch_dtype(torch_dtype, model_cls.__name__):
435+
return model_cls(hf_config)
404436

405437
# 3. fallback to parent class
406438
model = None
@@ -533,7 +565,11 @@ def _retry(**override):
533565

534566
# handle model_id passed as config
535567
if isinstance(config, str):
536-
config = AutoConfig.from_pretrained(config, trust_remote_code=kwargs.get("trust_remote_code", False))
568+
config = AutoConfig.from_pretrained(
569+
config,
570+
trust_remote_code=kwargs.get("trust_remote_code", False),
571+
attn_implementation=attn_implementation,
572+
)
537573
# 1. if force_hf is True, we will use the parent class to load and return the model as is
538574
if force_hf:
539575
return cls._from_config_parent_class(
@@ -547,7 +583,8 @@ def _retry(**override):
547583
# 2. If we have a custom model implementation available, we prioritize that over HF
548584
architectures = get_architectures(config)
549585
if len(architectures) > 0 and architectures[0] in ModelRegistry.model_arch_name_to_cls:
550-
return ModelRegistry.model_arch_name_to_cls[architectures[0]](config, *model_args, **kwargs)
586+
with local_torch_dtype(torch_dtype, ModelRegistry.model_arch_name_to_cls[architectures[0]].__name__):
587+
return ModelRegistry.model_arch_name_to_cls[architectures[0]](config)
551588

552589
# 3. fallback to parent class
553590
model = None

nemo_automodel/components/models/llama/__init__.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,3 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14-
15-
"""Custom Llama model implementation for NeMo Automodel."""
16-
17-
from nemo_automodel.components.models.llama.model import LlamaForCausalLM, build_llama_model
18-
19-
__all__ = ["LlamaForCausalLM", "build_llama_model"]

0 commit comments

Comments
 (0)