-
Notifications
You must be signed in to change notification settings - Fork 260
feat: Omni dataloader for HF models #1639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 14 commits
0d88bca
aadbebf
dff9072
9605fa4
0a8915a
49f3ab2
f543b37
8b235a0
da22cf0
cda42b0
35e8a05
660698d
5d83ff9
7457e87
48a44a3
e254950
d1710c2
43cb404
92a84d5
ceab63e
d60c621
11f4e59
b740a54
b1aad0c
3f0dfc7
b4cb62b
444672b
06c7efc
e7c1c7b
6949de2
6537fd7
f500593
beb2501
ce17500
4f08ca6
b588ec8
d4ea08a
84eda79
23b64db
ca941a7
cd5bc3d
374632e
628712a
47af67d
ec621d5
550d8e8
6337574
dad90f0
2dab255
02bf9bd
b1255c6
a36a058
f6743e6
e60de8c
91d228d
6ad57e5
c26200a
0fc7f84
91421ec
d31a010
ba50efb
4dd9658
1fbc75d
72476ea
be8eaca
4029cfe
5ee9272
c0d933b
3cfce26
ca91716
7efbdd3
4580984
e422e47
267e700
01f8d95
6d0eac6
4e1895c
ad6bd9e
a3d532b
5e8ee64
0a80425
b2695c1
486555a
c42514d
c900202
2f8fb44
83b9476
4115085
0a47e76
949380c
c905d54
de09033
e59175c
0bbe2ee
137bf66
78e6142
7d14c21
380e22b
02e310d
28edf65
c8a2c01
3797917
0600598
6d870b7
7ef7501
9c97e47
d1ec03a
b5c91a2
deb8af1
57ffb0b
f34986e
d24b812
0b562e7
3b16569
2633175
3122477
3dec4d9
3e34e07
bb8fa12
fd44882
f0f5bc4
3844367
9386219
1af304e
17ea691
5fa4b13
604e979
dc97cd5
6e4fa59
3974004
e1106f2
8bd6a5d
9033633
1f2826f
759c14e
d624f88
7876c84
2462f16
f6ac015
f6bb285
33e1c08
1910ed8
b3833c0
1d56d3f
a0e99c9
06b7076
d97e109
9315b36
58cd571
294cee9
a7ae356
312f3c3
29a10cc
345119a
560cf3b
89c4ff5
9f9047e
91e18c3
f1ab10b
a53eb72
2d9c6e1
2294a23
f0fca1a
82e4d92
79561fd
8df3f86
ba31c0c
aa4623e
dbdaa8f
fe29d2e
53148cb
5428505
8d27913
25dbcc0
ca96880
c032a1c
8ffb6e4
d0bbda9
3452719
8803231
da6c08c
c784627
f08d0d1
c88ffdc
171fd51
2752d38
bc572ef
4e9791b
8796b22
5e6bfa9
6ebfc25
3789fb7
66951de
0e2e450
15a4f08
9e99c72
225c715
ec210e5
fe0626e
e8eb181
2b8c9d2
b0d9c34
be7d057
f9ba596
08b6541
eb8555a
83aef80
e837538
84ba2f7
a28ed19
f223ce8
afba602
6ee2d73
acfa334
19c4e63
534422d
edf4412
f6052a3
356b838
a4b45b4
e54a937
b3f25fc
0a9c02b
e0aad8f
4a56388
e7d4501
c712c40
ddc6815
efa7935
cf15a2d
66ce0da
b516a7c
ee790aa
810fd9c
761a063
3a793d2
2ce8b83
28630e2
3473c25
2eb770e
6dd98b0
fab4e46
ecc68e9
513a94f
a5e0341
5f2744d
a6f105d
a7295d1
b355e22
d8267b3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| defaults: | ||
| - sft_vlm_3B.yaml | ||
|
|
||
| policy: | ||
| tokenizer: | ||
| video: | ||
| num_frames: 16 | ||
|
|
||
| data: | ||
| dataset_name: GeneralConversationsJsonlDataset | ||
| train_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_10000.jsonl | ||
| val_data_path: /lustre/fsw/portfolios/llmservice/users/yuanhangs/codes/megatron-lm-omcat/megatron-lm-vlm2/examples/multimodal/avlm/test/datasets/miradata_bat1_filtered_vision_5min_100.jsonl | ||
| train_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min | ||
| val_media_data_dir: /lustre/fsw/portfolios/edgeai/projects/edgeai_riva_rivamlops/data/videomme/MiraData/video/batch1/5min | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,131 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
|
||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import os | ||
| import re | ||
| import io | ||
| import copy | ||
| import warnings | ||
| import dataclasses | ||
| from PIL import Image | ||
| from pathlib import Path | ||
| from collections import defaultdict | ||
| from typing import Any, Dict, Callable, Optional | ||
|
|
||
| from nemo_rl.data import multimodal_utils | ||
|
|
||
|
|
||
| # map the senders from the sample to the allowed ones | ||
| conversation_sender_mapping_sample_to_allowed = { | ||
|
||
| 'human': 'user', | ||
| 'gpt': 'assistant', | ||
| 'agent': 'assistant', | ||
| } | ||
|
|
||
|
|
||
| # convert | ||
| def convert_metadata(metadata: Dict[str, Any], return_inplace=False): | ||
| data = metadata | ||
| if not return_inplace: | ||
| data = metadata.copy() | ||
|
|
||
| for tag in multimodal_utils.media_tags_to_allowed: | ||
| if tag in data: | ||
| tag_mapped = multimodal_utils.media_tags_to_allowed[tag] | ||
| if tag_mapped not in data: | ||
| data[tag_mapped] = data[tag] | ||
| del data[tag] | ||
| else: | ||
| warnings.warn( | ||
| f"Trying to map {tag} to {tag_mapped}, but {tag_mapped} already exists in the raw data. Mapping is not carried out." | ||
| ) | ||
|
|
||
| for idx, message in enumerate(data["conversations"]): | ||
| msg_str = message["value"] | ||
| for tag in multimodal_utils.media_tags_to_allowed: | ||
| tag_str = '<' + tag + '>' | ||
| if tag_str in msg_str: | ||
| tag_str_mapped = multimodal_utils.media_tags[ | ||
| multimodal_utils.media_tags_to_allowed[tag] | ||
| ] | ||
| msg_str = msg_str.replace(tag_str, tag_str_mapped) | ||
| message["value"] = msg_str | ||
| data["conversations"][idx] = message | ||
|
|
||
| if not return_inplace: | ||
| return data | ||
|
|
||
|
|
||
| def conversation_process_message( | ||
| metadata: Dict[str, Any], | ||
| message: Dict[str, str], | ||
| media_index: dict, | ||
| raw: Dict[str, Any] = {}, | ||
| allow_empty_text: bool = False, | ||
| check_if_media_file_exist: bool = True, | ||
| tried_default_extensions: set = set(), | ||
| tags_mapping_sample_to_allowed: Dict[str, str] = multimodal_utils.media_tags_to_allowed, | ||
|
||
| process_message_fragment: Callable = lambda tag, fragment: [{tag: fragment}], | ||
| ) -> list[Dict[str, Any]]: | ||
|
||
| """ | ||
| Args: | ||
| raw: dictionary with all webdataset compliant keys of a sample. | ||
| Emtpy for jsonl dataset, non-empty otherwise | ||
| metadata: | ||
| """ | ||
| fragments = [] | ||
| parts = re.split(multimodal_utils.media_tag_pattern, message["value"]) | ||
|
|
||
| # Convert the parts to message fragments | ||
| empty_text = True | ||
| for i, part in enumerate(parts): | ||
| if part in multimodal_utils.media_tags.values(): | ||
| # process multimodal tags | ||
| tag = multimodal_utils.media_tags_reversed[part] | ||
| if not isinstance(metadata[tag], list): | ||
| metadata[tag] = [metadata[tag]] | ||
|
||
| # try to extract the media object from the shard | ||
| ext = os.path.basename(metadata[tag][media_index[tag]]).split('.', 1)[1] | ||
| if raw and ext not in raw and \ | ||
| tag not in tried_default_extensions and \ | ||
| tag in multimodal_utils.default_media_extensions: | ||
| # try the default extension | ||
| for ext in multimodal_utils.default_media_extensions[tag]: | ||
| if ext in raw: | ||
| tried_default_extensions.add(ext) | ||
| break | ||
| media_file = None | ||
| if ext in raw: | ||
| media_file = ext | ||
| elif isinstance(metadata[tag][media_index[tag]], str) and \ | ||
| os.path.isfile(metadata[tag][media_index[tag]]): | ||
| # if cannot get it from the shard files, try to find the local file | ||
| media_file = metadata[tag][media_index[tag]] | ||
| elif check_if_media_file_exist: | ||
| sample_to_print = raw if raw else metadata | ||
| raise ValueError(f"Cannot find the media file {metadata[tag][media_index[tag]]} from {sample_to_print} or locally.") | ||
| else: | ||
| media_file = metadata[tag][media_index[tag]] | ||
| media_index[tag] += 1 | ||
| fragments += process_message_fragment(tag, media_file) | ||
| else: | ||
| # process text | ||
| if part.strip(): | ||
| fragments += process_message_fragment('text', part) | ||
| empty_text = False | ||
|
|
||
| if not allow_empty_text and empty_text: | ||
| fragments += process_message_fragment('text', ' ') | ||
|
|
||
| return fragments | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace hardcoded user-specific paths with placeholders.
The example config contains absolute paths to user-specific directories on an internal filesystem (
/lustre/fsw/portfolios/llmservice/users/yuanhangs/...). These paths won't work for other users and expose internal infrastructure details.Consider using placeholder paths that users can customize:
📝 Committable suggestion
🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with coderabbit's comment here about not putting our internal paths in the config. Is this data also available publicly somewhere? It may be useful to provide details (e.g. under https://github.com/NVIDIA-NeMo/RL/tree/main/docs/guides) for how someone could obtain and prepare this dataset. If this data isn't available publicly, is there an alternative publicly available dataset in a similar format we can use as an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Length of output: 8080
🏁 Script executed:
Length of output: 13938
🌐 Web query:
💡 Result:
Short answer: MiraData's metadata and code are publicly released, but the raw video files are not hosted directly — the project publishes meta CSVs (versions: 330K, 93K, 42K, 9K), download scripts, and instructions so users can fetch the original videos themselves; the repo notes videos come from Internet sources and are subject to copyright and license restrictions (GPL‑3.0). [1][2][3]
Sources:
🌐 Web query:
💡 Result:
Here are common public video (conversation / video‑QA) datasets and the typical QA annotation formats they use:
AVSD (Audio‑Visual Scene‑Aware Dialog) — dialog-style QA: each sample contains video_id, caption/summary and a Dialog array of 10 rounds (question, answer). Also a multiple‑choice “options” variant with answer_options and gt_index. [1]
TVQA — clip-level multiple‑choice QA for video+subtitles: each item has video clip id, start/end (localized moment), question, four answer candidates and the correct answer index. Frames and subtitles provided alongside. [2][3]
TGIF‑QA — short GIFs, multiple QA tasks (FrameQA, Count, Action, Transition): annotations provided as CSV/TSV per task (video_name, question, answer, etc.). [4][5]
MSVD‑QA / MSRVTT‑QA / ActivityNet‑QA — open‑ended QA: JSON files with entries like {video_id / video_name, question, answer, id}; train/val/test splits. Many repos supply CSV variants for model input. [6][7]
HowToVQA / HowToVQA69M / HowTo100M‑derived — large narrated video QA for pretraining: usually distributed as CSV/PKL lists (video_id, timestamp, question, answer) and separate feature files (S3D/S3D features). [8]
Typical formats and fields you’ll encounter (practical templates):
If you want, I can:
Sources: AVSD (dataset page/readme) [1]; TVQA (papers & dataset page) [2][3]; TGIF‑QA repo & description [4][5]; MSVD/MSRVTT/TGIF formats and examples from repos [6][7]; HowToVQA/HowTo100M details [8].
🏁 Script executed:
Length of output: 6349
🏁 Script executed:
Length of output: 77485
@yfw Based on my research, MiraData's metadata and code are publicly released, but the raw video files are not hosted directly — users must download the original videos themselves using provided scripts. The videos are subject to copyright and license restrictions.
For a better user experience, I'd recommend:
1. Use placeholder paths in the example config
Replace the hardcoded paths with placeholders as CodeRabbit suggested, e.g.:
2. Create documentation for VLM datasets
I notice there's no guide for VLM dataset preparation in
docs/guides/. Consider creating a guide (e.g.,docs/guides/vlm-datasets.md) that explains:GeneralConversationsJsonlDatasetformat (the docstring innemo_rl/data/datasets/response_datasets/general_conversations_dataset.pylines 30-73 already provides a good example)3. Consider a small public example
If MiraData requires complex setup, consider including a smaller, fully public example dataset (or instructions to create one) that users can quickly test with.
Would you like me to create an issue to track the documentation work for VLM dataset preparation?
🧠 Learnings used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfw Good catch. Let me find some public available video datasets. Feel free to suggest as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.