Skip to content
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
572 commits
Select commit Hold shift + click to select a range
0d88bca
fix: fixing the sequence parallel related issue in mcore path (#1487)
youngeunkwon0405 Nov 14, 2025
aadbebf
fix: improve local eval config and doc (#1528)
yuki-97 Nov 17, 2025
dff9072
docs: Refactor Home Page and New About Section (#1338)
jgerh Nov 17, 2025
9605fa4
fix: Incompatible configuration between reward normalization and the …
ffrujeri Nov 18, 2025
0a8915a
feat: Support for nano-v2 (#1514)
yfw Nov 18, 2025
49f3ab2
fix: Update Penguin tests to use renamed resource server (#1540)
shashank3959 Nov 19, 2025
f543b37
fix: honor mlflow server artifact_location (#1536) (#1538)
clumsy Nov 19, 2025
8b235a0
build: Update docker file to include OSS NOTICES.txt (#1544)
chtruong814 Nov 19, 2025
da22cf0
perf: perf script change for qwen30b-a3b (#1526)
youngeunkwon0405 Nov 20, 2025
cda42b0
fix: removed sliding_window_overwrite (#1541)
ahmadki Nov 22, 2025
35e8a05
chore: add a research template project (#1278)
terrykong Nov 23, 2025
660698d
docs: remove doc pyproject toml (#1561)
lbliii Nov 24, 2025
5d83ff9
perf: [Perf script] QWEN3 30B-A3B tensor_parallel_size from 4 to 2 (#…
youngeunkwon0405 Nov 24, 2025
7457e87
feat: per-worker active/idle timeline + IFB size logging (#1534)
youngeunkwon0405 Nov 25, 2025
48a44a3
chore: Improve checkpoint loading error messages with common issue an…
ahmadki Nov 25, 2025
e254950
feat: Fp8 moe rollout (#1446)
guyueh1 Nov 27, 2025
d1710c2
fix: Fix the sequence padding for FP8 case (#1569)
guyueh1 Nov 28, 2025
43cb404
build: Use dynamic engine for generate. (#1502)
shanmugamr1992 Dec 1, 2025
92a84d5
docs: Create performance-summary.md for NeMo RL (#1560)
snowmanwwg Dec 1, 2025
ceab63e
docs: Update nvidia-sphinx-theme (#1584)
chtruong814 Dec 2, 2025
d60c621
feat: KV cache quantization support in fp8 rollout in GRPO (#1212)
sharonyu-115 Dec 2, 2025
11f4e59
fix: Use Float16Module even when defer_fp32_logits=True (#1537)
yfw Dec 3, 2025
b740a54
feat: plot vllm internal metrics to the wandb log (#1567)
youngeunkwon0405 Dec 3, 2025
b1aad0c
docs: add v0.4 news and minor touch up to front page readme (#1268)
euronymous-aithal Dec 3, 2025
3f0dfc7
feat: Add moe load balancing metrics (#1520)
yfw Dec 3, 2025
b4cb62b
feat: force on-policy ratio to 1 (#1529)
yfw Dec 4, 2025
444672b
fix: ADDING DOCS (#1595)
shanmugamr1992 Dec 4, 2025
06c7efc
refactor: Introduce BasePolicyWorker (#1585)
ashors1 Dec 4, 2025
e7c1c7b
chore: rename penguin -> nemo_gym and add the gym submodule (#1587)
terrykong Dec 5, 2025
6949de2
feat: allow uv-less execution and fingerprint the environment (#1491)
terrykong Dec 5, 2025
6537fd7
add dep for causal-conv1d
Dec 8, 2025
f500593
add conversation-based dataset
yuanhangsu1986 Dec 12, 2025
beb2501
add avlm config yaml
yuanhangsu1986 Dec 12, 2025
ce17500
import bugfix
Dec 12, 2025
4f08ca6
indentation fix
yuanhangsu1986 Dec 12, 2025
b588ec8
add GeneralConversationsJsonlDataset initializer
yuanhangsu1986 Dec 12, 2025
d4ea08a
bugfix
Dec 12, 2025
84eda79
process multimodal data
yuanhangsu1986 Dec 12, 2025
23b64db
use decord for video and audio loading
yuanhangsu1986 Dec 12, 2025
ca941a7
move the sample processing to sft_processor
yuanhangsu1986 Dec 13, 2025
cd5bc3d
video output bugfix
Dec 12, 2025
374632e
move multimodal functions to multimodal_utils.py; add video, audio se…
yuanhangsu1986 Dec 13, 2025
628712a
bugfix
Dec 13, 2025
47af67d
bugfix reported by coderabbitai
yuanhangsu1986 Jan 8, 2026
ec621d5
feat: log generation ISL/OSL histogram to wandb (#1594)
youngeunkwon0405 Dec 5, 2025
550d8e8
feat: Enable Ray dashboard for Ray state API (#1602)
pjin-nvidia Dec 5, 2025
6337574
docs: update roadmap post v0.4 (#1607)
euronymous-aithal Dec 7, 2025
dad90f0
fix: add H200 TFLOPS (#1543)
clumsy Dec 9, 2025
2dab255
fix: Set validation accuracy to mean of rewards to handle non-[0,1] r…
alexandery-nvidia Dec 11, 2025
02bf9bd
feat: LoRA SFT support for DTensorV2 path (#1556)
samodi-nv Dec 13, 2025
b1255c6
fix: swanlab logger error caused by `define_metric` (#1615)
Zeyi-Lin Dec 13, 2025
a36a058
refactor: refactor env and data processor & add nemotron super 49b re…
yuki-97 Dec 13, 2025
f6743e6
fix: Sort rollout outputs to match inputs order + gym bump (#1627)
yfw Dec 14, 2025
e60de8c
chore: update megatron dev (11/21/2025) / mbridge (11/28/2025) (#1568)
yaoyu-33 Dec 14, 2025
91d228d
docs: Add SkyRL to inspired libraries list (#1632)
snowmanwwg Dec 15, 2025
6ad57e5
fix: Set use_flashinfer_fused_rope to False (#1636)
shanmugamr1992 Dec 15, 2025
c26200a
chore: Enable LoRA Nightly Test (#1634)
RayenTian Dec 15, 2025
0fc7f84
docs: Revise news section for nemotron v3 and DAPO algorithm support …
snowmanwwg Dec 16, 2025
91421ec
chore: fix grpo functional test metric (#1643)
RayenTian Dec 16, 2025
d31a010
feat: add support from building images using vllm from private repos …
terrykong Dec 17, 2025
ba50efb
feat: Necessary changes for Gym GRPO tutorial (#1630)
bxyu-nvidia Dec 17, 2025
4dd9658
perf: Add qwen3 30b-a3b async-8-off recipe (#1642)
youngeunkwon0405 Dec 17, 2025
1fbc75d
feat: Add GPT-OSS support via mcore (#1452)
ashors1 Dec 17, 2025
72476ea
chore: Bump vllm to 0.11.2, torch to 2.9, transformers to 4.57.1 (#1563)
yfw Dec 18, 2025
be8eaca
fix: Support datasets saved with save_to_disk in ResponseDataset (#1610)
sahgerlad Dec 18, 2025
4029cfe
fix: Handle disabled validation in SFT training (#1611)
sahgerlad Dec 19, 2025
5ee9272
fix: Fix crash when using cp in dtensor path (#1663)
yfw Dec 19, 2025
c0d933b
fix: Fix Fp8 sequence padding for PP>1 case (#1579)
guyueh1 Dec 20, 2025
3cfce26
test: Perf recipe for v0.5 (#1667)
guyueh1 Dec 20, 2025
ca91716
fix: Fix fp8 after vllm v0.11.2 bump (#1660)
guyueh1 Dec 20, 2025
7efbdd3
fix: Fix crash when using activation_checkpointing (#1676)
yfw Dec 22, 2025
4580984
feat: add dapo recipe and test (#1617)
ZhiyuLi-Nvidia Dec 22, 2025
e422e47
feat: DTensorPolicyV2 GPT-OSS SFT support (#1470)
adil-a Dec 23, 2025
267e700
fix: grad norm calculation for dtensor v2 (#1693)
hemildesai Dec 24, 2025
01f8d95
feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoR…
RayenTian Dec 24, 2025
6d0eac6
feat: Support prefetching of specific envs (#1692)
hemildesai Dec 25, 2025
4e1895c
fix: Fix DTensor slice crash after PyTorch 2.9 bump (#1689)
zpqiu Jan 2, 2026
ad6bd9e
fix: grad norm check for automodel gpt oss nightly (#1708)
hemildesai Jan 5, 2026
a3d532b
fix: relax nanov3 nightly test metrics strict (#1712)
RayenTian Jan 5, 2026
5e8ee64
fix: on GB200 use single-thread checkpoint save to avoid Cpu OOM (#1703)
guyueh1 Jan 5, 2026
0a80425
perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmar…
guyueh1 Jan 5, 2026
b2695c1
docs: Add doc for nano-v3 (#1694)
yfw Jan 5, 2026
486555a
fix: Disable cudnn sdpa backend when using activation checkpointing (…
yfw Jan 6, 2026
c42514d
fix: log metrics that can be coerced to scalars (#1723)
terrykong Jan 6, 2026
c900202
fix: use median instead of mean for logprob error for stability in ni…
terrykong Jan 7, 2026
2f8fb44
fix: gemma3 27b must now have skip_tokenizer_init=False in vllm (#1721)
terrykong Jan 7, 2026
83b9476
fix: fix several nightly tests that were flaky (#1724)
terrykong Jan 7, 2026
4115085
fix: apply offloading change from v2 to v1 (#1726)
terrykong Jan 7, 2026
0a47e76
fix: mcore generation config restored in nightly test (#1720)
terrykong Jan 8, 2026
949380c
feat: Megatron SFT LoRA (#1629)
arendu Jan 8, 2026
c905d54
build: Update aiohttp and urlib3 (#1746)
chtruong814 Jan 9, 2026
de09033
fix: patch pytorch aten.alias.default shard strategy (#1728)
RayenTian Jan 9, 2026
e59175c
feat: RL support for custom moe models in dtensor v2 (#1695)
hemildesai Jan 9, 2026
0bbe2ee
fix: split dtensorv1 vllm dependency (#1638)
yuki-97 Jan 10, 2026
137bf66
build: Resolve CVEs for gnupg and aiohttp (#1755)
chtruong814 Jan 10, 2026
78e6142
build: Bump mamba to d68d16e and causal-conv1d to 67e0a9d (#1759)
chtruong814 Jan 12, 2026
7d14c21
ci: Clean up disk space for lint check (#1768)
chtruong814 Jan 13, 2026
380e22b
docs: Adding dtensor TP debugging summary (#1767)
joyang-nv Jan 15, 2026
02e310d
docs: Update image syntax in dtensor TP accuracy guide for consistenc…
RayenTian Jan 15, 2026
28edf65
fix: fix formatting for async docs (#1783)
parthchadha Jan 15, 2026
c8a2c01
ci: Add nightly and release tests for gb200 (#1788)
chtruong814 Jan 16, 2026
3797917
feat: NeMo Gym refresh 20260113 (#1773)
bxyu-nvidia Jan 18, 2026
0600598
perf: DeepEP interface in megatron backend (#1794)
guyueh1 Jan 20, 2026
6d870b7
feat: refactor init of dtensor policy v2 (#1709)
hemildesai Jan 20, 2026
7ef7501
build: Update pyasn1 to >= 0.6.2 (#1791)
chtruong814 Jan 20, 2026
9c97e47
docs: Adding k8 guide (#1764)
vinhngx Jan 20, 2026
d1ec03a
test: Add grpo-qwen3-30ba3b-4n8g-40k config to performance test suite…
sfawzy-nv Jan 21, 2026
b5c91a2
docs: v0.5 performance results update (#1772)
guyueh1 Jan 21, 2026
deb8af1
docs: model support page (#1799)
terrykong Jan 21, 2026
57ffb0b
refactor: split train and val dataset in response dataset (#1649)
yuki-97 Jan 22, 2026
f34986e
docs: fix pytorch anchor link: PYTORCH_CUDA_ALLOC_CONF->PYTORCH_ALLOC…
terrykong Jan 22, 2026
d24b812
fix: log validation data (#1805)
parthchadha Jan 22, 2026
0b562e7
feat: Add SGLang rollout backend and tests (#1674)
RolaoDenthu Jan 22, 2026
3b16569
refactor: reuse setup data (#1808)
yuki-97 Jan 23, 2026
2633175
feat: refactor megatron init (#1646)
ashors1 Jan 23, 2026
3122477
build: Bump setuptools >= 80.10.1 and wheel >= 0.46.2 (#1822)
chtruong814 Jan 25, 2026
3dec4d9
build: Bump setuptools to 80.10.2 (#1830)
chtruong814 Jan 27, 2026
3e34e07
feat: refactor common data utilities of dtensor policy v2 (#1710)
hemildesai Jan 28, 2026
bb8fa12
feat: add FT launcher config and resiliency dependency [1/4] (#1824)
yashaswikarnati Jan 28, 2026
fd44882
fix: move ft_config.yaml outside examples/configs (#1839)
yashaswikarnati Jan 29, 2026
f0f5bc4
docs: Add notes for FP8 recipe in docs/fp8.md (#1829)
guyueh1 Jan 29, 2026
3844367
feat: Timer for the data sharding and job submission (#1802)
guyueh1 Jan 29, 2026
9386219
feat: Allow loading of more general data types (#1834)
nathan-az Jan 30, 2026
1af304e
chore: add assert for dtensor v2 cpu offload (#1817)
yuki-97 Jan 30, 2026
17ea691
build: Bump protobuf to 6.33.5 and python-multipart to 0.0.22 (#1850)
chtruong814 Jan 30, 2026
5fa4b13
feat: refactor megatron data utils (#1651)
ashors1 Jan 31, 2026
604e979
feat: support stateless group and decouple vLLM in train backend (#1842)
shuyixiong Jan 31, 2026
dc97cd5
docs: update readme post 0.5 (#1856)
euronymous-aithal Feb 1, 2026
6e4fa59
docs: fix readme post 0.5 (#1858)
euronymous-aithal Feb 2, 2026
3974004
feat: Support lora in dtensor grpo workflow by merging weight (#1797)
RayenTian Feb 2, 2026
e1106f2
chore: add nanov3 lora sft recipe to doc (#1860)
RayenTian Feb 2, 2026
8bd6a5d
ci: Allow repo to self publish docs (#1821)
chtruong814 Feb 2, 2026
9033633
fix: fix statistic of probs_ratio_clamped_min/max (#1818)
yuki-97 Feb 3, 2026
1f2826f
feat: support multiple datasets for response dataset (#1691)
yuki-97 Feb 3, 2026
759c14e
refactor: unify entrypoint for different envs (#1841)
yuki-97 Feb 3, 2026
d624f88
feat: add lora config for dpo dtensor backend (#1826)
RayenTian Feb 3, 2026
7876c84
fix: add log_plot to the logger interface (#1862)
terrykong Feb 3, 2026
2462f16
add preprocessor
yuanhangsu1986 Feb 4, 2026
f6ac015
bugfix
Feb 4, 2026
f6bb285
add working example configs for video
Feb 10, 2026
33e1c08
add unit tesets
yuanhangsu1986 Feb 10, 2026
1910ed8
refactor: split train and val dataset in preference dataset (#1763)
yuki-97 Feb 4, 2026
b3833c0
chore: add assert for tp4 batch variant accuracy issue (#1861)
yuki-97 Feb 4, 2026
1d56d3f
fix: prevent crash in rollout metric calculation when just 1 value (#…
terrykong Feb 4, 2026
a0e99c9
feat: add val_at_end for all algorithms (#1863)
terrykong Feb 4, 2026
06b7076
ci: Add secrets detector (#1854)
chtruong814 Feb 4, 2026
d97e109
feat: Add bisecting tooling for nightly test regressions (#1223)
terrykong Feb 5, 2026
9315b36
docs: add release runs to front page readme for 0.5 (#1879)
terrykong Feb 5, 2026
58cd571
fix: Remove redundant nested loop in `move_model` (#1880)
nathan-az Feb 6, 2026
294cee9
docs: Fix a step time number for deepseek (#1890)
guyueh1 Feb 6, 2026
a7ae356
feat: refactor train utilities for dtensor policy v2 (#1757)
hemildesai Feb 6, 2026
312f3c3
feat: add speculative decoding during post-training (#1785)
isomap Feb 6, 2026
29a10cc
feat: Add Nemotron‑3 Nano 30B A3B GRPO nightly tests (FSDP2, +LoRA) …
RayenTian Feb 7, 2026
345119a
ci: Fix docs publishing (#1898)
chtruong814 Feb 7, 2026
560cf3b
feat: Implement ProRLv2 recipe (#1809)
hijkzzz Feb 7, 2026
89c4ff5
feat: add way of excluding generation backends (#1855)
terrykong Feb 9, 2026
9f9047e
feat: Update mlflow to work better with env vars, manual run id, fix …
nathan-az Feb 10, 2026
91e18c3
feat: unify nemogym dataset (#1807)
yuki-97 Feb 10, 2026
f1ab10b
feat: improve dataset (#1893)
yuki-97 Feb 10, 2026
a53eb72
fix: fix enable_seq_packing and apply_temperature_scaling in DTensor …
yuki-97 Feb 10, 2026
2d9c6e1
chore: Centralize OmegaConf resolver registration (#1882)
RayenTian Feb 10, 2026
2294a23
fix: Fix DCP-to-HF conversion for model-wrapped checkpoints (#1881)
RayenTian Feb 10, 2026
f0fca1a
add support of split_validation_size
yuanhangsu1986 Feb 11, 2026
82e4d92
add configs for testing general_conversation_dataset
yuanhangsu1986 Feb 11, 2026
79561fd
change valid batch size
yuanhangsu1986 Feb 11, 2026
8df3f86
update to working config
Feb 11, 2026
ba31c0c
add interleaved multiturn test and singleturn test
yuanhangsu1986 Feb 11, 2026
aa4623e
bugfix for general_conversations_data
Feb 11, 2026
dbdaa8f
add daily-omni unit test
yuanhangsu1986 Feb 11, 2026
fe29d2e
add interleaved multiturn test and singleturn test
yuanhangsu1986 Feb 11, 2026
53148cb
fix: add missing functional test (#1883)
yuki-97 Feb 11, 2026
5428505
fix: fix and re-enable rm env functional test (#1905)
RayenTian Feb 11, 2026
8d27913
feat: start nemo gym and other environments with cached venvs (#1927)
terrykong Feb 11, 2026
25dbcc0
fix: Mxfp8 training fix sequence padding (#1884)
guyueh1 Feb 11, 2026
ca96880
fix: use seq_length instead of padded_seq_length for topk output padd…
zpqiu Feb 12, 2026
c032a1c
fix: Update sglang source (#1926)
RolaoDenthu Feb 12, 2026
8ffb6e4
chore: bump mcore and mbridge (#1902)
yfw Feb 13, 2026
d0bbda9
feat: refactor mcore train/forward utilities (#1654)
ashors1 Feb 13, 2026
3452719
docs: Document Gym + RL integration design (#1762)
ananthsub Feb 13, 2026
8803231
feat: retry rollout if generation_logprobs contains NaN (#1885)
guyueh1 Feb 13, 2026
da6c08c
feat: Support build custom flashinfer (#1886)
guyueh1 Feb 14, 2026
c784627
fix: async llm engine didnt have get_metrics() (#1943)
terrykong Feb 14, 2026
f08d0d1
feat: Mask sequences with high logprob error (#1838)
yfw Feb 14, 2026
c88ffdc
feat: ProRLv2 - add seq-mask-tis truncated importance sampling type (…
hijkzzz Feb 16, 2026
171fd51
ci: Update release-docs workflow to use FW-CI-templates v0.72.0 (#1965)
chtruong814 Feb 17, 2026
2752d38
fix: speedup minimize and minimize-check in config_cli (#1964)
hemildesai Feb 17, 2026
bc572ef
docs: update features.md to reflect v0.5 release and v0.6 roadmap (#1…
seonjinn Feb 17, 2026
4e9791b
fix: add mask seq with high logp err to nemo gym config (#1980)
cmunley1 Feb 18, 2026
8796b22
chore: upgrade wandb to 0.25+ (#1979)
Kipok Feb 18, 2026
5e6bfa9
feat: Remove do_not_average_loss (#1988)
yfw Feb 20, 2026
6ebfc25
chore: rename penguin -> nemo_gym and add the gym submodule (#1587)
terrykong Dec 5, 2025
3789fb7
feat: allow uv-less execution and fingerprint the environment (#1491)
terrykong Dec 5, 2025
66951de
add conversation-based dataset
yuanhangsu1986 Dec 12, 2025
0e2e450
add GeneralConversationsJsonlDataset initializer
yuanhangsu1986 Dec 12, 2025
15a4f08
bugfix
Dec 12, 2025
9e99c72
process multimodal data
yuanhangsu1986 Dec 12, 2025
225c715
use decord for video and audio loading
yuanhangsu1986 Dec 12, 2025
ec210e5
video output bugfix
Dec 12, 2025
fe0626e
move multimodal functions to multimodal_utils.py; add video, audio se…
yuanhangsu1986 Dec 13, 2025
e8eb181
bugfix
Dec 13, 2025
2b8c9d2
bugfix reported by coderabbitai
yuanhangsu1986 Jan 8, 2026
b0d9c34
feat: log generation ISL/OSL histogram to wandb (#1594)
youngeunkwon0405 Dec 5, 2025
be7d057
feat: LoRA SFT support for DTensorV2 path (#1556)
samodi-nv Dec 13, 2025
f9ba596
refactor: refactor env and data processor & add nemotron super 49b re…
yuki-97 Dec 13, 2025
08b6541
fix: Sort rollout outputs to match inputs order + gym bump (#1627)
yfw Dec 14, 2025
eb8555a
chore: update megatron dev (11/21/2025) / mbridge (11/28/2025) (#1568)
yaoyu-33 Dec 14, 2025
83aef80
fix: Set use_flashinfer_fused_rope to False (#1636)
shanmugamr1992 Dec 15, 2025
e837538
chore: Enable LoRA Nightly Test (#1634)
RayenTian Dec 15, 2025
84ba2f7
chore: Bump vllm to 0.11.2, torch to 2.9, transformers to 4.57.1 (#1563)
yfw Dec 18, 2025
a28ed19
fix: Handle disabled validation in SFT training (#1611)
sahgerlad Dec 19, 2025
f223ce8
feat: DTensorPolicyV2 GPT-OSS SFT support (#1470)
adil-a Dec 23, 2025
afba602
feat: Megatron SFT LoRA (#1629)
arendu Jan 8, 2026
6ee2d73
feat: RL support for custom moe models in dtensor v2 (#1695)
hemildesai Jan 9, 2026
acfa334
fix: split dtensorv1 vllm dependency (#1638)
yuki-97 Jan 10, 2026
19c4e63
feat: NeMo Gym refresh 20260113 (#1773)
bxyu-nvidia Jan 18, 2026
534422d
perf: DeepEP interface in megatron backend (#1794)
guyueh1 Jan 20, 2026
edf4412
feat: refactor init of dtensor policy v2 (#1709)
hemildesai Jan 20, 2026
f6052a3
refactor: split train and val dataset in response dataset (#1649)
yuki-97 Jan 22, 2026
356b838
feat: Add SGLang rollout backend and tests (#1674)
RolaoDenthu Jan 22, 2026
a4b45b4
refactor: reuse setup data (#1808)
yuki-97 Jan 23, 2026
e54a937
feat: refactor common data utilities of dtensor policy v2 (#1710)
hemildesai Jan 28, 2026
b3f25fc
feat: add FT launcher config and resiliency dependency [1/4] (#1824)
yashaswikarnati Jan 28, 2026
0a9c02b
fix: move ft_config.yaml outside examples/configs (#1839)
yashaswikarnati Jan 29, 2026
e0aad8f
feat: refactor megatron data utils (#1651)
ashors1 Jan 31, 2026
4a56388
feat: support stateless group and decouple vLLM in train backend (#1842)
shuyixiong Jan 31, 2026
e7d4501
feat: Support lora in dtensor grpo workflow by merging weight (#1797)
RayenTian Feb 2, 2026
c712c40
indentation bugfix
yuanhangsu1986 Feb 3, 2026
ddc6815
feat: support multiple datasets for response dataset (#1691)
yuki-97 Feb 3, 2026
efa7935
refactor: unify entrypoint for different envs (#1841)
yuki-97 Feb 3, 2026
cf15a2d
add preprocessor
yuanhangsu1986 Feb 4, 2026
66ce0da
bugfix
Feb 4, 2026
b516a7c
add working example configs for video
Feb 10, 2026
ee790aa
refactor: split train and val dataset in preference dataset (#1763)
yuki-97 Feb 4, 2026
810fd9c
feat: refactor train utilities for dtensor policy v2 (#1757)
hemildesai Feb 6, 2026
761a063
feat: Implement ProRLv2 recipe (#1809)
hijkzzz Feb 7, 2026
3a793d2
feat: add way of excluding generation backends (#1855)
terrykong Feb 9, 2026
2ce8b83
feat: unify nemogym dataset (#1807)
yuki-97 Feb 10, 2026
28630e2
add support of split_validation_size
yuanhangsu1986 Feb 11, 2026
3473c25
add daily-omni dataset unit test; add general_conversations_dataset u…
yuanhangsu1986 Feb 11, 2026
2eb770e
fix: add missing functional test (#1883)
yuki-97 Feb 11, 2026
6dd98b0
add preprocessor to setup_response_data for rl training
yuanhangsu1986 Feb 13, 2026
fab4e46
add preprocessor for preference datasets as well
yuanhangsu1986 Feb 13, 2026
ecc68e9
lint fixes
yuanhangsu1986 Feb 13, 2026
513a94f
lint fixes
yuanhangsu1986 Feb 20, 2026
a5e0341
Merge remote-tracking branch 'upstream/main' into yuanhangs_dev
yuanhangsu1986 Feb 21, 2026
5f2744d
Merge remote-tracking branch 'upstream/main' into yuanhangs_dev
Feb 21, 2026
a6f105d
merge with the yuanhangs_dev
yuanhangsu1986 Feb 21, 2026
a7295d1
update Megatron-LM to the latest commit
yuanhangsu1986 Feb 21, 2026
b355e22
docstring fix
yuanhangsu1986 Feb 21, 2026
d8267b3
move load_video_kwargs,load_audio_kwargs from global to get_multimoda…
yuanhangsu1986 Feb 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions examples/configs/sft_avlm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
defaults:
- sft_vlm_3B.yaml

policy:
tokenizer:
video:
num_frames: 16

data:
dataset_name: GeneralConversationsJsonlDataset
train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl
val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl
train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace hardcoded user-specific paths with placeholders.

The example config contains absolute paths to user-specific directories on an internal filesystem (/lustre/fsw/portfolios/llmservice/users/yuanhangs/...). These paths won't work for other users and expose internal infrastructure details.

Consider using placeholder paths that users can customize:

 data:
   dataset_name: GeneralConversationsJsonlDataset
-  train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl
-  val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl
-  train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
-  val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
+  train_data_path: /path/to/train_data.jsonl  # Replace with your training data path
+  val_data_path: /path/to/val_data.jsonl  # Replace with your validation data path
+  train_media_data_dir: /path/to/train_media/  # Directory containing training media files
+  val_media_data_dir: /path/to/val_media/  # Directory containing validation media files
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl
val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl
train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min
data:
dataset_name: GeneralConversationsJsonlDataset
train_data_path: /path/to/train_data.jsonl # Replace with your training data path
val_data_path: /path/to/val_data.jsonl # Replace with your validation data path
train_media_data_dir: /path/to/train_media/ # Directory containing training media files
val_media_data_dir: /path/to/val_media/ # Directory containing validation media files
🤖 Prompt for AI Agents
In examples/configs/sft_avlm.yaml around lines 11 to 14, the file contains
hardcoded, user-specific absolute paths that expose internal infrastructure and
will fail for other users; replace those literal paths with configurable
placeholders (e.g., ${DATA_DIR}/train.jsonl, ${DATA_DIR}/val.jsonl,
${MEDIA_DIR}/train_media, ${MEDIA_DIR}/val_media) or relative paths and update
any docs/README to instruct users to set the environment variables or edit the
placeholders to point to their local dataset and media directories before
running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with coderabbit's comment here about not putting our internal paths in the config. Is this data also available publicly somewhere? It may be useful to provide details (e.g. under https://github.com/NVIDIA-NeMo/RL/tree/main/docs/guides) for how someone could obtain and prepare this dataset. If this data isn't available publicly, is there an alternative publicly available dataset in a similar format we can use as an example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for documentation about datasets in the repo
fd -t f README.md docs/guides

# Search for any references to MiraData in the codebase
rg -n -C2 "MiraData" --type yaml --type py --type md

# Look for dataset documentation or examples
fd -t f -e md . docs/

# Check if there's any documentation about GeneralConversationsJsonlDataset format
rg -n -C5 "GeneralConversationsJsonlDataset" --type md --type py -g '!*.pyc'

Length of output: 8080


🏁 Script executed:

#!/bin/bash
# Get the full docstring and format details from GeneralConversationsJsonlDataset
cat nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py | head -80

# Check for any example datasets mentioned in docs
rg -n -C3 -i "dataset.*example|example.*dataset|public.*dataset" docs/ --type md

Length of output: 13938


🌐 Web query:

MiraData video dataset public availability

💡 Result:

Short answer: MiraData's metadata and code are publicly released, but the raw video files are not hosted directly — the project publishes meta CSVs (versions: 330K, 93K, 42K, 9K), download scripts, and instructions so users can fetch the original videos themselves; the repo notes videos come from Internet sources and are subject to copyright and license restrictions (GPL‑3.0). [1][2][3]

Sources:

  • Official MiraData GitHub (meta files, download scripts, license, note about videos). [1]
  • Mira / MiraData project repo (release notes calling out MiraData‑v1 and dataset availability). [2]
  • MiraData arXiv paper describing dataset composition and availability. [3]

🌐 Web query:

public video conversation dataset QA format

💡 Result:

Here are common public video (conversation / video‑QA) datasets and the typical QA annotation formats they use:

  • AVSD (Audio‑Visual Scene‑Aware Dialog) — dialog-style QA: each sample contains video_id, caption/summary and a Dialog array of 10 rounds (question, answer). Also a multiple‑choice “options” variant with answer_options and gt_index. [1]

  • TVQA — clip-level multiple‑choice QA for video+subtitles: each item has video clip id, start/end (localized moment), question, four answer candidates and the correct answer index. Frames and subtitles provided alongside. [2][3]

  • TGIF‑QA — short GIFs, multiple QA tasks (FrameQA, Count, Action, Transition): annotations provided as CSV/TSV per task (video_name, question, answer, etc.). [4][5]

  • MSVD‑QA / MSRVTT‑QA / ActivityNet‑QA — open‑ended QA: JSON files with entries like {video_id / video_name, question, answer, id}; train/val/test splits. Many repos supply CSV variants for model input. [6][7]

  • HowToVQA / HowToVQA69M / HowTo100M‑derived — large narrated video QA for pretraining: usually distributed as CSV/PKL lists (video_id, timestamp, question, answer) and separate feature files (S3D/S3D features). [8]

Typical formats and fields you’ll encounter (practical templates):

  • Multiple‑choice (CSV/JSON): {video_id, start_time, end_time, question, option0, option1, option2, option3, answer_index, qa_id}.
  • Open‑ended (JSON/CSV): {video_id, question, answer, question_id, question_type (optional)}.
  • Dialog (AVSD style JSON): {video_id, caption, summary, dialog: [{turn:1, question: "...", answer: "..."}, ...], (optional) answer_options, gt_index}.

If you want, I can:

  • fetch/download one specific dataset and show an exact example QA file, or
  • convert any of the above formats into a uniform CSV/JSON schema you can use for model training. Which would you like?

Sources: AVSD (dataset page/readme) [1]; TVQA (papers & dataset page) [2][3]; TGIF‑QA repo & description [4][5]; MSVD/MSRVTT/TGIF formats and examples from repos [6][7]; HowToVQA/HowTo100M details [8].


🏁 Script executed:

#!/bin/bash
# Check if there are any other VLM example configs that might use public datasets
fd -e yaml . examples/configs/ -x cat {} \; | head -200

# Look for any VLM or multimodal dataset references in docs
rg -n -C3 -i "video|audio|multimodal|vlm" docs/guides/sft.md

Length of output: 6349


🏁 Script executed:

#!/bin/bash
# Get all yaml config files in examples/configs
fd -e yaml . examples/configs/ --exec echo "=== {} ===" \; --exec head -30 {} \;

# Check if there's any documentation about VLM/multimodal support
rg -n -i "vlm|vision|multimodal|video|audio|image" docs/ --type md -C2 | head -100

Length of output: 77485


@yfw Based on my research, MiraData's metadata and code are publicly released, but the raw video files are not hosted directly — users must download the original videos themselves using provided scripts. The videos are subject to copyright and license restrictions.

For a better user experience, I'd recommend:

1. Use placeholder paths in the example config

Replace the hardcoded paths with placeholders as CodeRabbit suggested, e.g.:

train_data_path: /path/to/train_data.jsonl  # Replace with your training data path
val_data_path: /path/to/val_data.jsonl  # Replace with your validation data path
train_media_data_dir: /path/to/train_media/  # Directory containing training media files
val_media_data_dir: /path/to/val_media/  # Directory containing validation media files

2. Create documentation for VLM datasets

I notice there's no guide for VLM dataset preparation in docs/guides/. Consider creating a guide (e.g., docs/guides/vlm-datasets.md) that explains:

  • The expected GeneralConversationsJsonlDataset format (the docstring in nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py lines 30-73 already provides a good example)
  • How to obtain and prepare MiraData (with download instructions)
  • Alternative publicly available video conversation datasets like:
    • AVSD (Audio-Visual Scene-Aware Dialog) - dialog-style video QA
    • MSVD-QA / MSRVTT-QA - open-ended video QA in JSON format
    • ActivityNet-QA - temporal video understanding
  • Format conversion examples for these datasets

3. Consider a small public example

If MiraData requires complex setup, consider including a smaller, fully public example dataset (or instructions to create one) that users can quickly test with.

Would you like me to create an issue to track the documentation work for VLM dataset preparation?


🧠 Learnings used
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-11-24T17:24:41.976Z
Learning: Applies to examples/configs/recipes/vlm/*.yaml : Recipe YAML files should follow the naming pattern: vlm_<algo>-<model>-<nodes>n<gpus>g-<strategy>[-modifiers][.vN].yaml for VLM recipes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfw Good catch. Let me find some public available video datasets. Feel free to suggest as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

12 changes: 8 additions & 4 deletions examples/run_sft.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,13 +110,17 @@ def setup_data(tokenizer: AutoTokenizer, data_config: DataConfig, seed: int):
)

# add preprocessor if needed
datum_preprocessor = None
datum_preprocessor_train = None
datum_preprocessor_val = None
if "dataset_name" in data_config and data_config["dataset_name"] == "clevr_cogent":
from nemo_rl.data.datasets.response_datasets.clevr import (
format_clevr_cogent_dataset,
)

datum_preprocessor = partial(format_clevr_cogent_dataset, return_pil=True)
datum_preprocessor_train = datum_preprocessor_val = partial(format_clevr_cogent_dataset, return_pil=True)
elif hasattr(data, "datum_preprocessor"):
datum_preprocessor_train = data.datum_preprocessor["train"]
datum_preprocessor_val = data.datum_preprocessor["val"]

train_dataset = AllTaskProcessedDataset(
train_dataset,
Expand All @@ -127,7 +131,7 @@ def setup_data(tokenizer: AutoTokenizer, data_config: DataConfig, seed: int):
add_bos=data_config["add_bos"],
add_eos=data_config["add_eos"],
add_generation_prompt=data_config["add_generation_prompt"],
datum_preprocessor=datum_preprocessor,
datum_preprocessor=datum_preprocessor_train,
),
max_seq_length=data_config["max_input_seq_length"],
)
Expand All @@ -141,7 +145,7 @@ def setup_data(tokenizer: AutoTokenizer, data_config: DataConfig, seed: int):
add_bos=data_config.get("add_bos", True),
add_eos=data_config.get("add_eos", True),
add_generation_prompt=data_config["add_generation_prompt"],
datum_preprocessor=datum_preprocessor,
datum_preprocessor=datum_preprocessor_val,
),
max_seq_length=data_config["max_input_seq_length"],
)
Expand Down
25 changes: 25 additions & 0 deletions nemo_rl/algorithms/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -320,6 +320,31 @@ def get_tokenizer(
processor.bos_token_id = tokenizer.bos_token_id
# copy name_or_path from tokenizer to processor for logging
processor.name_or_path = tokenizer.name_or_path
if hasattr(processor, "feature_extractor") and "audio" in tokenizer_config:
if "sampling_rate" in tokenizer_config["audio"] and \
tokenizer_config["audio"]["sampling_rate"] != processor.feature_extractor.sampling_rate:
new_sampling_rate = tokenizer_config["audio"]["sampling_rate"]
warnings.warn(
f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}"
)
processor.feature_extractor.sampling_rate = new_sampling_rate
if hasattr(processor, "video_processor") and "video" in tokenizer_config:
if "fps" in tokenizer_config["video"] and \
tokenizer_config["video"]["fps"] != processor.video_processor.fps:
# override the video loading fps
new_fps = tokenizer_config["video"]["fps"]
warnings.warn(
f"Overriding video fps from {processor.video_processor.fps} to {new_fps}"
)
processor.video_processor.fps = new_fps
# fps and num_frames cannot co-exist, but let it crash later
if "num_frames" in tokenizer_config["video"] and \
tokenizer_config["video"]["num_frames"] != processor.video_processor.num_frames:
new_num_frames = tokenizer_config["video"]["num_frames"]
warnings.warn(
f"Overriding video num_frames from {processor.video_processor.num_frames} to {new_num_frames}"
)
processor.video_processor.num_frames = new_num_frames

return tokenizer if processor is None else processor

Expand Down
20 changes: 20 additions & 0 deletions nemo_rl/data/datasets/response_datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
)
from nemo_rl.data.datasets.response_datasets.refcoco import RefCOCODataset
from nemo_rl.data.datasets.response_datasets.response_dataset import ResponseDataset
from nemo_rl.data.datasets.response_datasets.general_conversations_dataset import GeneralConversationsJsonlDataset
from nemo_rl.data.datasets.response_datasets.squad import SquadDataset
from nemo_rl.data.datasets.utils import get_extra_kwargs

Expand Down Expand Up @@ -113,6 +114,25 @@ def load_response_dataset(data_config, seed: int = 42):
train_data_path=data_config["train_data_path"],
**extra_kwargs,
)
elif dataset_name == "GeneralConversationsJsonlDataset":
if "train_data_path" not in data_config:
raise ValueError(
"train_data_path is required when dataset_name is not one of the built-ins."
)
extra_kwargs = get_extra_kwargs(
data_config,
[
"val_data_path",
"train_split",
"val_split",
"train_media_data_dir",
"val_media_data_dir",
],
)
base_dataset = GeneralConversationsJsonlDataset(
train_data_path=data_config["train_data_path"],
**extra_kwargs,
)
else:
raise ValueError(
f"Unsupported {dataset_name=}. "
Expand Down
131 changes: 131 additions & 0 deletions nemo_rl/data/datasets/response_datasets/conversation_base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we separate the converation_base.py from generation_conversation_dataset.py? It seems everything here is only used in general_conversation_dataset.py. If that's the case, I would prefer keeping it in one file for now. We can always refactor it later if we want to extract out some common functionality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. Let me merge.

#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
import re
import io
import copy
import warnings
import dataclasses
from PIL import Image
from pathlib import Path
from collections import defaultdict
from typing import Any, Dict, Callable, Optional

from nemo_rl.data import multimodal_utils


# map the senders from the sample to the allowed ones
conversation_sender_mapping_sample_to_allowed = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is needed? Is it just some of the data has these keys instead of 'user' / 'assistant'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are some legacy datasets that use keys such as 'human', 'gpt', or 'agent'

'human': 'user',
'gpt': 'assistant',
'agent': 'assistant',
}


# convert
def convert_metadata(metadata: Dict[str, Any], return_inplace=False):
data = metadata
if not return_inplace:
data = metadata.copy()

for tag in multimodal_utils.media_tags_to_allowed:
if tag in data:
tag_mapped = multimodal_utils.media_tags_to_allowed[tag]
if tag_mapped not in data:
data[tag_mapped] = data[tag]
del data[tag]
else:
warnings.warn(
f"Trying to map {tag} to {tag_mapped}, but {tag_mapped} already exists in the raw data. Mapping is not carried out."
)

for idx, message in enumerate(data["conversations"]):
msg_str = message["value"]
for tag in multimodal_utils.media_tags_to_allowed:
tag_str = '<' + tag + '>'
if tag_str in msg_str:
tag_str_mapped = multimodal_utils.media_tags[
multimodal_utils.media_tags_to_allowed[tag]
]
msg_str = msg_str.replace(tag_str, tag_str_mapped)
message["value"] = msg_str
data["conversations"][idx] = message

if not return_inplace:
return data


def conversation_process_message(
metadata: Dict[str, Any],
message: Dict[str, str],
media_index: dict,
raw: Dict[str, Any] = {},
allow_empty_text: bool = False,
check_if_media_file_exist: bool = True,
tried_default_extensions: set = set(),
tags_mapping_sample_to_allowed: Dict[str, str] = multimodal_utils.media_tags_to_allowed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Unused parameter tags_mapping_sample_to_allowed.

This parameter is declared but never used in the function body. Either remove it or implement the intended functionality.

🧰 Tools
🪛 Ruff (0.14.8)

78-78: Unused function argument: tags_mapping_sample_to_allowed

(ARG001)

🤖 Prompt for AI Agents
In nemo_rl/data/datasets/response_datasets/conversation_base.py around line 78,
the parameter tags_mapping_sample_to_allowed is declared but never used; either
remove it from the function signature and update any callers, or implement its
intended behavior by applying the mapping to the sample's tags (e.g., map each
sample tag via tags_mapping_sample_to_allowed to an allowed tag and use that
result wherever allowed tags are computed/validated), and then adjust type hints
and add/modify tests to cover the mapping behavior.

process_message_fragment: Callable = lambda tag, fragment: [{tag: fragment}],
) -> list[Dict[str, Any]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Mutable default arguments will cause bugs.

raw={} and tried_default_extensions=set() are mutable defaults. The tried_default_extensions set is modified on line 106, meaning the same set instance will be shared across all calls that don't explicitly pass this argument, causing incorrect behavior.

 def conversation_process_message(
     metadata: Dict[str, Any],
     message: Dict[str, str],
     media_index: dict,
-    raw: Dict[str, Any] = {},
+    raw: Optional[Dict[str, Any]] = None,
     allow_empty_text: bool = False,
     check_if_media_file_exist: bool = True,
-    tried_default_extensions: set = set(),
+    tried_default_extensions: Optional[set] = None,
     tags_mapping_sample_to_allowed: Dict[str, str] = multimodal_utils.media_tags_to_allowed,
     process_message_fragment: Callable = lambda tag, fragment: [{tag: fragment}],
 ) -> list[Dict[str, Any]]:
     """
     Args:
         raw: dictionary with all webdataset compliant keys of a sample. 
             Emtpy for jsonl dataset, non-empty otherwise
         metadata: 
     """
+    if raw is None:
+        raw = {}
+    if tried_default_extensions is None:
+        tried_default_extensions = set()
     fragments = []    
🧰 Tools
🪛 Ruff (0.14.8)

74-74: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


77-77: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)


78-78: Unused function argument: tags_mapping_sample_to_allowed

(ARG001)

🤖 Prompt for AI Agents
In nemo_rl/data/datasets/response_datasets/conversation_base.py around lines 70
to 80, the function uses mutable default arguments raw={} and
tried_default_extensions=set(), which leads to shared state across calls
(tried_default_extensions is mutated on line 106). Change both defaults to None
in the signature, then inside the function initialize raw = {} if raw is None
and tried_default_extensions = set() if tried_default_extensions is None so each
call gets a fresh object; ensure any downstream logic continues to use these
local variables.

"""
Args:
raw: dictionary with all webdataset compliant keys of a sample.
Emtpy for jsonl dataset, non-empty otherwise
metadata:
"""
fragments = []
parts = re.split(multimodal_utils.media_tag_pattern, message["value"])

# Convert the parts to message fragments
empty_text = True
for i, part in enumerate(parts):
if part in multimodal_utils.media_tags.values():
# process multimodal tags
tag = multimodal_utils.media_tags_reversed[part]
if not isinstance(metadata[tag], list):
metadata[tag] = [metadata[tag]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n nemo_rl/data/datasets/response_datasets/conversation_base.py | sed -n '80,110p'

Repository: NVIDIA-NeMo/RL

Length of output: 1585


🏁 Script executed:

cat -n nemo_rl/data/datasets/response_datasets/conversation_base.py | sed -n '65,95p'

Repository: NVIDIA-NeMo/RL

Length of output: 1333


🏁 Script executed:

rg -n "conversation_process_message" nemo_rl/data/datasets/response_datasets/

Repository: NVIDIA-NeMo/RL

Length of output: 306


🏁 Script executed:

cat -n nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py | sed -n '130,160p'

Repository: NVIDIA-NeMo/RL

Length of output: 1302


🏁 Script executed:

cat -n nemo_rl/data/datasets/response_datasets/conversation_base.py | sed -n '1,70p'

Repository: NVIDIA-NeMo/RL

Length of output: 2819


🏁 Script executed:

rg -n "media_tags" nemo_rl/data/multimodal_utils.py | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 277


🏁 Script executed:

sed -n '33,70p' nemo_rl/data/multimodal_utils.py

Repository: NVIDIA-NeMo/RL

Length of output: 1062


Add validation to prevent KeyError when media tag is in message but missing from metadata.

At line 96, metadata[tag] is accessed without checking if tag exists in the metadata dictionary. If a message contains a media tag like <image> but the metadata dict lacks the corresponding key, a KeyError will be raised. Add a check before accessing metadata[tag] to handle missing tags gracefully (either skip the tag, raise a clear error, or provide a default value).

🤖 Prompt for AI Agents
In nemo_rl/data/datasets/response_datasets/conversation_base.py around lines
96-97, the code accesses metadata[tag] without verifying tag exists which can
raise KeyError; add a presence check before accessing metadata[tag] (e.g., if
tag not in metadata: either continue to skip processing that tag, or set a
sensible default like metadata[tag]=[] or raise a clear ValueError with a
descriptive message), then proceed with the existing list-wrapping logic.

# try to extract the media object from the shard
ext = os.path.basename(metadata[tag][media_index[tag]]).split('.', 1)[1]
if raw and ext not in raw and \
tag not in tried_default_extensions and \
tag in multimodal_utils.default_media_extensions:
# try the default extension
for ext in multimodal_utils.default_media_extensions[tag]:
if ext in raw:
tried_default_extensions.add(ext)
break
media_file = None
if ext in raw:
media_file = ext
elif isinstance(metadata[tag][media_index[tag]], str) and \
os.path.isfile(metadata[tag][media_index[tag]]):
# if cannot get it from the shard files, try to find the local file
media_file = metadata[tag][media_index[tag]]
elif check_if_media_file_exist:
sample_to_print = raw if raw else metadata
raise ValueError(f"Cannot find the media file {metadata[tag][media_index[tag]]} from {sample_to_print} or locally.")
else:
media_file = metadata[tag][media_index[tag]]
media_index[tag] += 1
fragments += process_message_fragment(tag, media_file)
else:
# process text
if part.strip():
fragments += process_message_fragment('text', part)
empty_text = False

if not allow_empty_text and empty_text:
fragments += process_message_fragment('text', ' ')

return fragments
Loading
Loading