-
Notifications
You must be signed in to change notification settings - Fork 260
feat: Omni dataloader for HF models #1639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yuanhangsu1986
wants to merge
572
commits into
NVIDIA-NeMo:main
Choose a base branch
from
yuanhangsu1986:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
572 commits
Select commit
Hold shift + click to select a range
0d88bca
fix: fixing the sequence parallel related issue in mcore path (#1487)
youngeunkwon0405 aadbebf
fix: improve local eval config and doc (#1528)
yuki-97 dff9072
docs: Refactor Home Page and New About Section (#1338)
jgerh 9605fa4
fix: Incompatible configuration between reward normalization and the …
ffrujeri 0a8915a
feat: Support for nano-v2 (#1514)
yfw 49f3ab2
fix: Update Penguin tests to use renamed resource server (#1540)
shashank3959 f543b37
fix: honor mlflow server artifact_location (#1536) (#1538)
clumsy 8b235a0
build: Update docker file to include OSS NOTICES.txt (#1544)
chtruong814 da22cf0
perf: perf script change for qwen30b-a3b (#1526)
youngeunkwon0405 cda42b0
fix: removed sliding_window_overwrite (#1541)
ahmadki 35e8a05
chore: add a research template project (#1278)
terrykong 660698d
docs: remove doc pyproject toml (#1561)
lbliii 5d83ff9
perf: [Perf script] QWEN3 30B-A3B tensor_parallel_size from 4 to 2 (#…
youngeunkwon0405 7457e87
feat: per-worker active/idle timeline + IFB size logging (#1534)
youngeunkwon0405 48a44a3
chore: Improve checkpoint loading error messages with common issue an…
ahmadki e254950
feat: Fp8 moe rollout (#1446)
guyueh1 d1710c2
fix: Fix the sequence padding for FP8 case (#1569)
guyueh1 43cb404
build: Use dynamic engine for generate. (#1502)
shanmugamr1992 92a84d5
docs: Create performance-summary.md for NeMo RL (#1560)
snowmanwwg ceab63e
docs: Update nvidia-sphinx-theme (#1584)
chtruong814 d60c621
feat: KV cache quantization support in fp8 rollout in GRPO (#1212)
sharonyu-115 11f4e59
fix: Use Float16Module even when defer_fp32_logits=True (#1537)
yfw b740a54
feat: plot vllm internal metrics to the wandb log (#1567)
youngeunkwon0405 b1aad0c
docs: add v0.4 news and minor touch up to front page readme (#1268)
euronymous-aithal 3f0dfc7
feat: Add moe load balancing metrics (#1520)
yfw b4cb62b
feat: force on-policy ratio to 1 (#1529)
yfw 444672b
fix: ADDING DOCS (#1595)
shanmugamr1992 06c7efc
refactor: Introduce BasePolicyWorker (#1585)
ashors1 e7c1c7b
chore: rename penguin -> nemo_gym and add the gym submodule (#1587)
terrykong 6949de2
feat: allow uv-less execution and fingerprint the environment (#1491)
terrykong 6537fd7
add dep for causal-conv1d
f500593
add conversation-based dataset
yuanhangsu1986 beb2501
add avlm config yaml
yuanhangsu1986 ce17500
import bugfix
4f08ca6
indentation fix
yuanhangsu1986 b588ec8
add GeneralConversationsJsonlDataset initializer
yuanhangsu1986 d4ea08a
bugfix
84eda79
process multimodal data
yuanhangsu1986 23b64db
use decord for video and audio loading
yuanhangsu1986 ca941a7
move the sample processing to sft_processor
yuanhangsu1986 cd5bc3d
video output bugfix
374632e
move multimodal functions to multimodal_utils.py; add video, audio se…
yuanhangsu1986 628712a
bugfix
47af67d
bugfix reported by coderabbitai
yuanhangsu1986 ec621d5
feat: log generation ISL/OSL histogram to wandb (#1594)
youngeunkwon0405 550d8e8
feat: Enable Ray dashboard for Ray state API (#1602)
pjin-nvidia 6337574
docs: update roadmap post v0.4 (#1607)
euronymous-aithal dad90f0
fix: add H200 TFLOPS (#1543)
clumsy 2dab255
fix: Set validation accuracy to mean of rewards to handle non-[0,1] r…
alexandery-nvidia 02bf9bd
feat: LoRA SFT support for DTensorV2 path (#1556)
samodi-nv b1255c6
fix: swanlab logger error caused by `define_metric` (#1615)
Zeyi-Lin a36a058
refactor: refactor env and data processor & add nemotron super 49b re…
yuki-97 f6743e6
fix: Sort rollout outputs to match inputs order + gym bump (#1627)
yfw e60de8c
chore: update megatron dev (11/21/2025) / mbridge (11/28/2025) (#1568)
yaoyu-33 91d228d
docs: Add SkyRL to inspired libraries list (#1632)
snowmanwwg 6ad57e5
fix: Set use_flashinfer_fused_rope to False (#1636)
shanmugamr1992 c26200a
chore: Enable LoRA Nightly Test (#1634)
RayenTian 0fc7f84
docs: Revise news section for nemotron v3 and DAPO algorithm support …
snowmanwwg 91421ec
chore: fix grpo functional test metric (#1643)
RayenTian d31a010
feat: add support from building images using vllm from private repos …
terrykong ba50efb
feat: Necessary changes for Gym GRPO tutorial (#1630)
bxyu-nvidia 4dd9658
perf: Add qwen3 30b-a3b async-8-off recipe (#1642)
youngeunkwon0405 1fbc75d
feat: Add GPT-OSS support via mcore (#1452)
ashors1 72476ea
chore: Bump vllm to 0.11.2, torch to 2.9, transformers to 4.57.1 (#1563)
yfw be8eaca
fix: Support datasets saved with save_to_disk in ResponseDataset (#1610)
sahgerlad 4029cfe
fix: Handle disabled validation in SFT training (#1611)
sahgerlad 5ee9272
fix: Fix crash when using cp in dtensor path (#1663)
yfw c0d933b
fix: Fix Fp8 sequence padding for PP>1 case (#1579)
guyueh1 3cfce26
test: Perf recipe for v0.5 (#1667)
guyueh1 ca91716
fix: Fix fp8 after vllm v0.11.2 bump (#1660)
guyueh1 7efbdd3
fix: Fix crash when using activation_checkpointing (#1676)
yfw 4580984
feat: add dapo recipe and test (#1617)
ZhiyuLi-Nvidia e422e47
feat: DTensorPolicyV2 GPT-OSS SFT support (#1470)
adil-a 267e700
fix: grad norm calculation for dtensor v2 (#1693)
hemildesai 01f8d95
feat: Add Nemotron‑3 Nano 30B A3B BF16 SFT nightly tests (FSDP2, +LoR…
RayenTian 6d0eac6
feat: Support prefetching of specific envs (#1692)
hemildesai 4e1895c
fix: Fix DTensor slice crash after PyTorch 2.9 bump (#1689)
zpqiu ad6bd9e
fix: grad norm check for automodel gpt oss nightly (#1708)
hemildesai a3d532b
fix: relax nanov3 nightly test metrics strict (#1712)
RayenTian 5e8ee64
fix: on GB200 use single-thread checkpoint save to avoid Cpu OOM (#1703)
guyueh1 0a80425
perf: [Perf recipe] Change TP 16->32 for deepseek GB200 sync benchmar…
guyueh1 b2695c1
docs: Add doc for nano-v3 (#1694)
yfw 486555a
fix: Disable cudnn sdpa backend when using activation checkpointing (…
yfw c42514d
fix: log metrics that can be coerced to scalars (#1723)
terrykong c900202
fix: use median instead of mean for logprob error for stability in ni…
terrykong 2f8fb44
fix: gemma3 27b must now have skip_tokenizer_init=False in vllm (#1721)
terrykong 83b9476
fix: fix several nightly tests that were flaky (#1724)
terrykong 4115085
fix: apply offloading change from v2 to v1 (#1726)
terrykong 0a47e76
fix: mcore generation config restored in nightly test (#1720)
terrykong 949380c
feat: Megatron SFT LoRA (#1629)
arendu c905d54
build: Update aiohttp and urlib3 (#1746)
chtruong814 de09033
fix: patch pytorch aten.alias.default shard strategy (#1728)
RayenTian e59175c
feat: RL support for custom moe models in dtensor v2 (#1695)
hemildesai 0bbe2ee
fix: split dtensorv1 vllm dependency (#1638)
yuki-97 137bf66
build: Resolve CVEs for gnupg and aiohttp (#1755)
chtruong814 78e6142
build: Bump mamba to d68d16e and causal-conv1d to 67e0a9d (#1759)
chtruong814 7d14c21
ci: Clean up disk space for lint check (#1768)
chtruong814 380e22b
docs: Adding dtensor TP debugging summary (#1767)
joyang-nv 02e310d
docs: Update image syntax in dtensor TP accuracy guide for consistenc…
RayenTian 28edf65
fix: fix formatting for async docs (#1783)
parthchadha c8a2c01
ci: Add nightly and release tests for gb200 (#1788)
chtruong814 3797917
feat: NeMo Gym refresh 20260113 (#1773)
bxyu-nvidia 0600598
perf: DeepEP interface in megatron backend (#1794)
guyueh1 6d870b7
feat: refactor init of dtensor policy v2 (#1709)
hemildesai 7ef7501
build: Update pyasn1 to >= 0.6.2 (#1791)
chtruong814 9c97e47
docs: Adding k8 guide (#1764)
vinhngx d1ec03a
test: Add grpo-qwen3-30ba3b-4n8g-40k config to performance test suite…
sfawzy-nv b5c91a2
docs: v0.5 performance results update (#1772)
guyueh1 deb8af1
docs: model support page (#1799)
terrykong 57ffb0b
refactor: split train and val dataset in response dataset (#1649)
yuki-97 f34986e
docs: fix pytorch anchor link: PYTORCH_CUDA_ALLOC_CONF->PYTORCH_ALLOC…
terrykong d24b812
fix: log validation data (#1805)
parthchadha 0b562e7
feat: Add SGLang rollout backend and tests (#1674)
RolaoDenthu 3b16569
refactor: reuse setup data (#1808)
yuki-97 2633175
feat: refactor megatron init (#1646)
ashors1 3122477
build: Bump setuptools >= 80.10.1 and wheel >= 0.46.2 (#1822)
chtruong814 3dec4d9
build: Bump setuptools to 80.10.2 (#1830)
chtruong814 3e34e07
feat: refactor common data utilities of dtensor policy v2 (#1710)
hemildesai bb8fa12
feat: add FT launcher config and resiliency dependency [1/4] (#1824)
yashaswikarnati fd44882
fix: move ft_config.yaml outside examples/configs (#1839)
yashaswikarnati f0f5bc4
docs: Add notes for FP8 recipe in docs/fp8.md (#1829)
guyueh1 3844367
feat: Timer for the data sharding and job submission (#1802)
guyueh1 9386219
feat: Allow loading of more general data types (#1834)
nathan-az 1af304e
chore: add assert for dtensor v2 cpu offload (#1817)
yuki-97 17ea691
build: Bump protobuf to 6.33.5 and python-multipart to 0.0.22 (#1850)
chtruong814 5fa4b13
feat: refactor megatron data utils (#1651)
ashors1 604e979
feat: support stateless group and decouple vLLM in train backend (#1842)
shuyixiong dc97cd5
docs: update readme post 0.5 (#1856)
euronymous-aithal 6e4fa59
docs: fix readme post 0.5 (#1858)
euronymous-aithal 3974004
feat: Support lora in dtensor grpo workflow by merging weight (#1797)
RayenTian e1106f2
chore: add nanov3 lora sft recipe to doc (#1860)
RayenTian 8bd6a5d
ci: Allow repo to self publish docs (#1821)
chtruong814 9033633
fix: fix statistic of probs_ratio_clamped_min/max (#1818)
yuki-97 1f2826f
feat: support multiple datasets for response dataset (#1691)
yuki-97 759c14e
refactor: unify entrypoint for different envs (#1841)
yuki-97 d624f88
feat: add lora config for dpo dtensor backend (#1826)
RayenTian 7876c84
fix: add log_plot to the logger interface (#1862)
terrykong 2462f16
add preprocessor
yuanhangsu1986 f6ac015
bugfix
f6bb285
add working example configs for video
33e1c08
add unit tesets
yuanhangsu1986 1910ed8
refactor: split train and val dataset in preference dataset (#1763)
yuki-97 b3833c0
chore: add assert for tp4 batch variant accuracy issue (#1861)
yuki-97 1d56d3f
fix: prevent crash in rollout metric calculation when just 1 value (#…
terrykong a0e99c9
feat: add val_at_end for all algorithms (#1863)
terrykong 06b7076
ci: Add secrets detector (#1854)
chtruong814 d97e109
feat: Add bisecting tooling for nightly test regressions (#1223)
terrykong 9315b36
docs: add release runs to front page readme for 0.5 (#1879)
terrykong 58cd571
fix: Remove redundant nested loop in `move_model` (#1880)
nathan-az 294cee9
docs: Fix a step time number for deepseek (#1890)
guyueh1 a7ae356
feat: refactor train utilities for dtensor policy v2 (#1757)
hemildesai 312f3c3
feat: add speculative decoding during post-training (#1785)
isomap 29a10cc
feat: Add Nemotron‑3 Nano 30B A3B GRPO nightly tests (FSDP2, +LoRA) …
RayenTian 345119a
ci: Fix docs publishing (#1898)
chtruong814 560cf3b
feat: Implement ProRLv2 recipe (#1809)
hijkzzz 89c4ff5
feat: add way of excluding generation backends (#1855)
terrykong 9f9047e
feat: Update mlflow to work better with env vars, manual run id, fix …
nathan-az 91e18c3
feat: unify nemogym dataset (#1807)
yuki-97 f1ab10b
feat: improve dataset (#1893)
yuki-97 a53eb72
fix: fix enable_seq_packing and apply_temperature_scaling in DTensor …
yuki-97 2d9c6e1
chore: Centralize OmegaConf resolver registration (#1882)
RayenTian 2294a23
fix: Fix DCP-to-HF conversion for model-wrapped checkpoints (#1881)
RayenTian f0fca1a
add support of split_validation_size
yuanhangsu1986 82e4d92
add configs for testing general_conversation_dataset
yuanhangsu1986 79561fd
change valid batch size
yuanhangsu1986 8df3f86
update to working config
ba31c0c
add interleaved multiturn test and singleturn test
yuanhangsu1986 aa4623e
bugfix for general_conversations_data
dbdaa8f
add daily-omni unit test
yuanhangsu1986 fe29d2e
add interleaved multiturn test and singleturn test
yuanhangsu1986 53148cb
fix: add missing functional test (#1883)
yuki-97 5428505
fix: fix and re-enable rm env functional test (#1905)
RayenTian 8d27913
feat: start nemo gym and other environments with cached venvs (#1927)
terrykong 25dbcc0
fix: Mxfp8 training fix sequence padding (#1884)
guyueh1 ca96880
fix: use seq_length instead of padded_seq_length for topk output padd…
zpqiu c032a1c
fix: Update sglang source (#1926)
RolaoDenthu 8ffb6e4
chore: bump mcore and mbridge (#1902)
yfw d0bbda9
feat: refactor mcore train/forward utilities (#1654)
ashors1 3452719
docs: Document Gym + RL integration design (#1762)
ananthsub 8803231
feat: retry rollout if generation_logprobs contains NaN (#1885)
guyueh1 da6c08c
feat: Support build custom flashinfer (#1886)
guyueh1 c784627
fix: async llm engine didnt have get_metrics() (#1943)
terrykong f08d0d1
feat: Mask sequences with high logprob error (#1838)
yfw c88ffdc
feat: ProRLv2 - add seq-mask-tis truncated importance sampling type (…
hijkzzz 171fd51
ci: Update release-docs workflow to use FW-CI-templates v0.72.0 (#1965)
chtruong814 2752d38
fix: speedup minimize and minimize-check in config_cli (#1964)
hemildesai bc572ef
docs: update features.md to reflect v0.5 release and v0.6 roadmap (#1…
seonjinn 4e9791b
fix: add mask seq with high logp err to nemo gym config (#1980)
cmunley1 8796b22
chore: upgrade wandb to 0.25+ (#1979)
Kipok 5e6bfa9
feat: Remove do_not_average_loss (#1988)
yfw 6ebfc25
chore: rename penguin -> nemo_gym and add the gym submodule (#1587)
terrykong 3789fb7
feat: allow uv-less execution and fingerprint the environment (#1491)
terrykong 66951de
add conversation-based dataset
yuanhangsu1986 0e2e450
add GeneralConversationsJsonlDataset initializer
yuanhangsu1986 15a4f08
bugfix
9e99c72
process multimodal data
yuanhangsu1986 225c715
use decord for video and audio loading
yuanhangsu1986 ec210e5
video output bugfix
fe0626e
move multimodal functions to multimodal_utils.py; add video, audio se…
yuanhangsu1986 e8eb181
bugfix
2b8c9d2
bugfix reported by coderabbitai
yuanhangsu1986 b0d9c34
feat: log generation ISL/OSL histogram to wandb (#1594)
youngeunkwon0405 be7d057
feat: LoRA SFT support for DTensorV2 path (#1556)
samodi-nv f9ba596
refactor: refactor env and data processor & add nemotron super 49b re…
yuki-97 08b6541
fix: Sort rollout outputs to match inputs order + gym bump (#1627)
yfw eb8555a
chore: update megatron dev (11/21/2025) / mbridge (11/28/2025) (#1568)
yaoyu-33 83aef80
fix: Set use_flashinfer_fused_rope to False (#1636)
shanmugamr1992 e837538
chore: Enable LoRA Nightly Test (#1634)
RayenTian 84ba2f7
chore: Bump vllm to 0.11.2, torch to 2.9, transformers to 4.57.1 (#1563)
yfw a28ed19
fix: Handle disabled validation in SFT training (#1611)
sahgerlad f223ce8
feat: DTensorPolicyV2 GPT-OSS SFT support (#1470)
adil-a afba602
feat: Megatron SFT LoRA (#1629)
arendu 6ee2d73
feat: RL support for custom moe models in dtensor v2 (#1695)
hemildesai acfa334
fix: split dtensorv1 vllm dependency (#1638)
yuki-97 19c4e63
feat: NeMo Gym refresh 20260113 (#1773)
bxyu-nvidia 534422d
perf: DeepEP interface in megatron backend (#1794)
guyueh1 edf4412
feat: refactor init of dtensor policy v2 (#1709)
hemildesai f6052a3
refactor: split train and val dataset in response dataset (#1649)
yuki-97 356b838
feat: Add SGLang rollout backend and tests (#1674)
RolaoDenthu a4b45b4
refactor: reuse setup data (#1808)
yuki-97 e54a937
feat: refactor common data utilities of dtensor policy v2 (#1710)
hemildesai b3f25fc
feat: add FT launcher config and resiliency dependency [1/4] (#1824)
yashaswikarnati 0a9c02b
fix: move ft_config.yaml outside examples/configs (#1839)
yashaswikarnati e0aad8f
feat: refactor megatron data utils (#1651)
ashors1 4a56388
feat: support stateless group and decouple vLLM in train backend (#1842)
shuyixiong e7d4501
feat: Support lora in dtensor grpo workflow by merging weight (#1797)
RayenTian c712c40
indentation bugfix
yuanhangsu1986 ddc6815
feat: support multiple datasets for response dataset (#1691)
yuki-97 efa7935
refactor: unify entrypoint for different envs (#1841)
yuki-97 cf15a2d
add preprocessor
yuanhangsu1986 66ce0da
bugfix
b516a7c
add working example configs for video
ee790aa
refactor: split train and val dataset in preference dataset (#1763)
yuki-97 810fd9c
feat: refactor train utilities for dtensor policy v2 (#1757)
hemildesai 761a063
feat: Implement ProRLv2 recipe (#1809)
hijkzzz 3a793d2
feat: add way of excluding generation backends (#1855)
terrykong 2ce8b83
feat: unify nemogym dataset (#1807)
yuki-97 28630e2
add support of split_validation_size
yuanhangsu1986 3473c25
add daily-omni dataset unit test; add general_conversations_dataset u…
yuanhangsu1986 2eb770e
fix: add missing functional test (#1883)
yuki-97 6dd98b0
add preprocessor to setup_response_data for rl training
yuanhangsu1986 fab4e46
add preprocessor for preference datasets as well
yuanhangsu1986 ecc68e9
lint fixes
yuanhangsu1986 513a94f
lint fixes
yuanhangsu1986 a5e0341
Merge remote-tracking branch 'upstream/main' into yuanhangs_dev
yuanhangsu1986 5f2744d
Merge remote-tracking branch 'upstream/main' into yuanhangs_dev
a6f105d
merge with the yuanhangs_dev
yuanhangsu1986 a7295d1
update Megatron-LM to the latest commit
yuanhangsu1986 b355e22
docstring fix
yuanhangsu1986 d8267b3
move load_video_kwargs,load_audio_kwargs from global to get_multimoda…
yuanhangsu1986 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+12 KB
examples/configs/recipes/llm/performance/.grpo-deepseek-v3-32n4g.yaml.swp
Binary file not shown.
Binary file added
BIN
+12 KB
examples/configs/recipes/llm/performance/.grpo-deepseek-v3-32n8g.yaml.swp
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| defaults: | ||
| - sft_vlm_3B.yaml | ||
|
|
||
| sft: | ||
| val_batches: 2 | ||
| val_global_batch_size: 8 | ||
|
|
||
| policy: | ||
| max_total_sequence_length: 32768 | ||
| train_global_batch_size: 8 | ||
| dtensor_cfg: | ||
| tensor_parallel_size: 1 | ||
| dynamic_batching: | ||
| enabled: true | ||
| tokenizer: | ||
| video: | ||
| num_frames: 16 | ||
|
|
||
| data: | ||
| # dataset | ||
| train: | ||
| dataset_name: daily-omni | ||
| split: train | ||
| split_validation_size: 0.05 # use 5% of the training data as validation data | ||
| seed: 42 # seed for train/validation split when split_validation_size > 0 | ||
| validation: null | ||
| # default settings for all datasets | ||
| default: | ||
| prompt_file: null |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| ## Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import os | ||
| from typing import Any | ||
| from huggingface_hub import snapshot_download | ||
|
|
||
| from nemo_rl.data.datasets.raw_dataset import RawDataset | ||
| from nemo_rl.data.datasets.utils import ( | ||
| load_dataset_from_path, | ||
| get_huggingface_cache_path, | ||
| ) | ||
|
|
||
|
|
||
| class DailyOmniDataset(RawDataset): | ||
| """Simple wrapper around the CLEVR-CoGenT dataset. | ||
|
|
||
| Args: | ||
| split: Split name for the dataset, default is "train" | ||
| """ | ||
|
|
||
| task_name = "daily-omni" | ||
|
|
||
| def __init__(self, split: str = "train", split_validation_size: float = 0, seed: int = 42, **kwargs): | ||
| # train, valA, and valB are supported splits. | ||
| SPLIT_TO_HF_NAME = { | ||
| "train": "liarliar/Daily-Omni", | ||
| } | ||
| if split not in SPLIT_TO_HF_NAME: | ||
| raise ValueError( | ||
| f"Invalid split: {split}. Please use 'train'." | ||
| ) | ||
|
|
||
| self.hf_cache_dir = get_huggingface_cache_path(SPLIT_TO_HF_NAME[split]) | ||
| if not self.hf_cache_dir: | ||
| # download the dataset | ||
| self.hf_cache_dir = snapshot_download(repo_id=SPLIT_TO_HF_NAME[split], repo_type='dataset') | ||
| if not self.hf_cache_dir: | ||
| raise ValueError(f"Cannot download DailyOmniDataset.") | ||
|
|
||
| json_file = os.path.join(self.hf_cache_dir, "qa.json") | ||
|
|
||
| if not os.path.isfile(json_file): | ||
| raise ValueError(f"{json_file} cannot be found.") | ||
|
|
||
| files_folder = os.path.join(self.hf_cache_dir, 'Videos') | ||
| if not os.path.isdir(files_folder): | ||
| # prepare the dataset | ||
| # TODO: move untar, unzip func to utils? | ||
| import tarfile | ||
| archive_filename = os.path.join(self.hf_cache_dir, "Videos.tar") | ||
| if not os.path.isfile(archive_filename): | ||
| raise ValueError(f"{archive_filename} cannot be found.") | ||
| try: | ||
| with tarfile.open(archive_filename, "r:*") as tar: | ||
| # Extract all contents to the specified path | ||
| tar.extractall(path=self.hf_cache_dir) | ||
| if os.path.isdir(files_folder): | ||
| print(f"Successfully extracted '{archive_filename}' to '{files_folder}'") | ||
| else: | ||
| raise ValueError(f"Cannot find the extracted folder {files_folder}. Extraction failed.") | ||
| except tarfile.ReadError: | ||
| raise tarfile.ReadErro(f"Error: Could not read the tar file. It might be corrupted or not a tar file.") | ||
| except Exception as e: | ||
| raise Exception(f"An unexpected error occurred: {e}") | ||
|
|
||
| self.dataset = load_dataset_from_path(json_file) | ||
|
|
||
| # format - disable features to avoid schema conflicts | ||
| self.dataset = self.dataset.add_column( | ||
| "task_name", [self.task_name] * len(self.dataset) | ||
| ) | ||
|
|
||
| self.preprocessor = self.format_data | ||
|
|
||
| # `self.val_dataset` is used (not None) only when current dataset is used for both training and validation | ||
| self.val_dataset = None | ||
| self.split_train_validation(split_validation_size, seed) | ||
|
|
||
| @classmethod | ||
| def get_prompt(cls, data: dict[str, Any]) -> str: | ||
| # WARNING: model could have preference of a different prompt | ||
| prompt = data["Question"] + '\n' + '\n'.join(data["Choice"]) | ||
| candidate_answers = [chr(ord("A")+idx) for idx in range(len(data["Choice"]))] | ||
| candidate_answers_all_but_last = ",".join(candidate_answers[:-1]) | ||
| prompt += '\n' + f"Your replies must contain only a single letter " + \ | ||
| f"(either {candidate_answers_all_but_last} or {candidate_answers[-1]})." | ||
| return prompt | ||
|
|
||
| def format_data(self, data: dict[str, Any]) -> dict[str, Any]: | ||
| user_content = [ | ||
| { | ||
| "type": "video", | ||
| "video": os.path.join( | ||
| self.hf_cache_dir, | ||
| "Videos", | ||
| data["video_id"], | ||
| data["video_id"]+"_video.mp4" | ||
| ), | ||
| }, | ||
| { | ||
| "type": "text", | ||
| "text": self.get_prompt(data), | ||
| }, | ||
| ] | ||
| return { | ||
| "messages": [ | ||
| {"role": "user", "content": user_content}, | ||
| {"role": "assistant", "content": data["Answer"]}, | ||
| ], | ||
| "task_name": self.task_name, | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.