Skip to content

Comments

feat: Omni dataloader for HF models#2016

Open
yuanhangsu1986 wants to merge 2 commits intomainfrom
yuanhangs_dev
Open

feat: Omni dataloader for HF models#2016
yuanhangsu1986 wants to merge 2 commits intomainfrom
yuanhangs_dev

Conversation

@yuanhangsu1986
Copy link

@yuanhangsu1986 yuanhangsu1986 commented Feb 23, 2026

What does this PR do ?

Add video and audio dataloadling support for HF models

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this
export NRL_FORCE_REBUILD_VENVS=true
uv venv
uv run python ./examples/run_vlm_sft.py cluster.gpus_per_node=8 --config ./examples/configs/sft_avlm.yaml

Additional Information

Major changes:

  • nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py: base class GeneralConversationsJsonlDataset for loading conversation-based data structure which is described in the comments.
  • nemo_rl/data/datasets/response_datasets/daily_omni.py: base class and functions for Daily Omni dataset (a public video benchmarking dataset).
  • nemo_rl/data/datasets/processed_dataset.py: add preprocessor logic to AllTaskProcessedDataset. The purpose of this is to convert the dataset structure from its raw form to the open-ai compatible ones. Put the logic here so that we do not need to do the preprocessing for each downstream processors.
  • nemo_rl/data/datasets/raw_dataset.py: add preprocessor in RawDataset. This allows each dataset to optionally define its own preprocessor. If leave undefined, the default behavior is no data preprocessing.
  • nemo_rl/data/multimodal_utils.py: functions for loading omni data
  • nemo_rl/data/llm_message_utils.py: loading omni data for HF-based llm model
  • examples/configs/sft_avlm.yaml: example for loading video conversation-based dataset for Qwen3-VL with 16 frames per video.
  • nemo_rl/algorithms/utils.py: reads audo and video configs from the tokenizer config in the config yaml file.
  • tests/unit/data/datasets/test_response_dataset.py: unit test for daily omni dataset
  • tests/unit/data/datasets/test_general_conversations_dataset.py: unit test for GeneralConversationsJsonlDataset

This is a resubmission of the following PR:
#1639

Dependencies

  • Added decord library for video processing support.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for Daily-Omni and General Conversations JSONL datasets
    • Enhanced multimodal data processing for audio, video, and image content
    • New supervised fine-tuning configuration example for audio-visual models
    • Introduced per-task data preprocessing pipeline support
  • Tests

    • Added comprehensive unit tests for multimodal dataset handling
  • Chores

    • Added decord library dependency
    • Updated Docker build configuration

@yuanhangsu1986 yuanhangsu1986 requested review from a team as code owners February 23, 2026 23:18
@github-actions github-actions bot added the CI Relating to CI label Feb 23, 2026
@yuanhangsu1986 yuanhangsu1986 added CI:L1 Run doctests, unit tests, and functional tests community-request labels Feb 23, 2026
@yuanhangsu1986 yuanhangsu1986 added the needs-follow-up Issue needs follow-up label Feb 23, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

This PR introduces per-task preprocessing support and multimodal dataset infrastructure to NeMo RL. It adds new dataset classes (DailyOmniDataset, GeneralConversationsJsonlDataset), multimodal utilities for media loading, a TaskDataPreProcessFnCallable interface, and propagates preprocessor mappings through the training and validation pipelines. Includes Docker build optimization and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Multimodal Preprocessing Infrastructure
nemo_rl/data/interfaces.py, nemo_rl/data/multimodal_utils.py
Introduces TaskDataPreProcessFnCallable protocol and comprehensive multimodal utilities including media tagging constants, functions to extract/load media from messages (images, audio, video), and default settings extraction from processors.
Dataset Base Classes & Dataset Updates
nemo_rl/data/datasets/raw_dataset.py, nemo_rl/data/datasets/processed_dataset.py, nemo_rl/data/datasets/utils.py
Adds preprocessor attribute to RawDataset, extends AllTaskProcessedDataset to accept and apply per-task preprocessors, and updates dataset setup utilities to track and propagate preprocessor mappings during training/validation.
New Dataset Implementations
nemo_rl/data/datasets/response_datasets/daily_omni.py, nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py, nemo_rl/data/datasets/response_datasets/__init__.py
Adds DailyOmniDataset for video QA tasks and GeneralConversationsJsonlDataset for multimodal conversations with message processing, metadata conversion, and media extraction utilities.
Message and Data Processing
nemo_rl/data/llm_message_utils.py
Replaces image-specific processing with generalized media handling using multimodal utils; removes get_images_from_message and updates to load/process any media type dynamically.
Training Pipeline Integration
examples/run_sft.py, nemo_rl/algorithms/utils.py
Extends SFT training to pass task_data_preprocessors to datasets; adds runtime overrides for audio/video processor configurations in tokenizer initialization.
Configuration & Dependencies
examples/configs/sft_avlm.yaml, nemo_rl/models/policy/__init__.py, pyproject.toml, .github/workflows/cicd-main.yml
Adds audio/video fields to TokenizerConfig, adds decord dependency, introduces AVLM config example, and optimizes Docker build with SKIP_SGLANG_BUILD flag.
Test Coverage
tests/unit/data/datasets/test_general_conversations_dataset.py, tests/unit/data/datasets/test_response_dataset.py
Adds comprehensive unit tests for general-conversation-jsonl preprocessing and DailyOmniDataset dataset loading with multimodal validation.

Sequence Diagram

sequenceDiagram
    participant DataLoader as Data Loader
    participant RawDataset as RawDataset<br/>(+ preprocessor)
    participant Preprocessor as Task Preprocessor
    participant AllTaskProcessedDataset as AllTaskProcessedDataset<br/>(+ task_data_preprocessors)
    participant Processor as Task Processor
    participant MessageUtils as Message Utils<br/>(+ Media Loading)

    DataLoader->>RawDataset: Load dataset<br/>(e.g., DailyOmni)
    RawDataset-->>DataLoader: Return data + preprocessor
    DataLoader->>AllTaskProcessedDataset: Pass task_data_preprocessors<br/>mapping
    
    loop For each data sample
        AllTaskProcessedDataset->>Preprocessor: __call__(raw_datum)
        Preprocessor-->>AllTaskProcessedDataset: preprocessed_datum
        AllTaskProcessedDataset->>Processor: __call__(preprocessed_datum)
        Processor->>MessageUtils: load_media_from_message()
        MessageUtils-->>Processor: extracted media dict<br/>(images, audio, video)
        Processor-->>AllTaskProcessedDataset: formatted output
        AllTaskProcessedDataset-->>DataLoader: processed item
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #977: Introduces the AllTaskProcessedDataset refactor and base dataset infrastructure that this PR extends with per-task preprocessing hooks and new dataset implementations.
  • PR #1807: Modifies setup_response_data in nemo_rl/data/utils.py; this PR further extends it to propagate per-task preprocessor mappings through the training pipeline.
  • PR #1649: Involves RawDataset and response-dataset pipeline modifications; this PR adds complementary dataset implementations and preprocessing API enhancements to the same infrastructure.

Suggested labels

ci/cd, data, multimodal, datasets, preprocessing

Suggested reviewers

  • yuki-97
  • terrykong
  • odelalleau
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR includes major changes (new dataset classes, multimodal utilities, preprocessor infrastructure) but PR description lacks test results or testing information; tests exist but no pass/fail status or validation results documented. Update PR description to include test execution status (e.g., 'all unit tests pass'), specific test scenarios run, and any performance validation results to confirm no regression.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Omni dataloader for HF models' accurately captures the main change: adding video/audio dataloading support for Hugging Face models via the Omni dataset infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yuanhangs_dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
nemo_rl/models/policy/__init__.py (1)

201-208: ⚠️ Potential issue | 🟡 Minor

Document new audio/video config keys in TokenizerConfig.

Please add a Google-style class docstring (or expand an existing one) to document purpose, valid values/types, and recommended defaults for audio and video, and ensure exemplar YAMLs reflect those defaults.

As per coding guidelines "Use Google style docstrings for classes and functions" and "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/policy/__init__.py` around lines 201 - 208, Add a Google-style
class docstring to the TokenizerConfig TypedDict describing the purpose of the
TypedDict and documenting the new audio and video keys: state that audio and
video are optional multimodal config dicts, list valid value types (dict[str,
Any] or None), describe recommended defaults (e.g., audio: {} or None to
disable, video: {} or None to disable, and any recommended subkeys like
sample_rate, channels for audio or frame_rate, resolution for video), and
briefly mention chat_template_kwargs usage; then update the exemplar YAML files
under examples/configs/*.yaml to include the audio and video keys with the
recommended default values so the examples reflect the docstring defaults.
nemo_rl/data/utils.py (1)

221-248: ⚠️ Potential issue | 🟠 Major

Reset val_task_data_preprocessors per validation dataset.

In the val_data_paths loop, val_task_data_preprocessors is reused across iterations. If a later dataset has no preprocessor, the previous one leaks into its AllTaskProcessedDataset. Reinitialize per iteration.

Suggested fix
-    val_task_data_preprocessors = {}
     if "val_data_paths" in data_config and data_config["val_data_paths"]:
         ...
         for val_dataset_name, val_dataset_path in val_data_paths.items():
             ...
+            val_task_data_preprocessors = {}
             if hasattr(val_data, "preprocessor") and val_data.preprocessor is not None:
-                val_task_data_preprocessors = {
-                    val_data.task_name: val_data.preprocessor
-                }
+                val_task_data_preprocessors[val_data.task_name] = val_data.preprocessor
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/utils.py` around lines 221 - 248, The bug is that
val_task_data_preprocessors is declared once outside the val_data_paths loop and
may leak preprocessors across datasets; fix it by moving/reinitializing
val_task_data_preprocessors inside the for val_dataset_name, val_dataset_path in
val_data_paths.items() loop (set to {} at start of each iteration), then
populate it only when hasattr(val_data, "preprocessor") and
val_data.preprocessor is not None before passing it to AllTaskProcessedDataset;
ensure val_task_data_processors is also built per-iteration using
load_preference_dataset and val_data.task_name so each AllTaskProcessedDataset
gets only its own processors and preprocessors.
pyproject.toml (1)

18-53: ⚠️ Potential issue | 🔴 Critical

Pin decord to a specific version, with awareness of Python 3.12 compatibility.

The latest stable decord release is 0.6.0 (June 2021), which lacks pre-built wheels for Python 3.12. Installing decord on Python 3.12 will require building from source, which may fail depending on your FFmpeg setup and build toolchain. Pin decord==0.6.0 explicitly, and document any required build steps for Python 3.12, or verify that source builds work reliably in your environment before merging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 18 - 53, Pin the decord dependency to a specific
version by changing the "decord" entry to "decord==0.6.0" and add a short
comment noting Python 3.12 has no prebuilt wheels (so building from source may
require FFmpeg and a proper toolchain), or alternatively verify/source-build in
CI and document required build steps for Python 3.12 to avoid breakage; update
the dependency line for "decord" and include the explanatory comment nearby.
nemo_rl/data/multimodal_utils.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Copyright year is 2025; current year is 2026.

The file has substantial new code. Consider updating the copyright header to 2026.

As per coding guidelines, "Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` at line 1, Update the copyright header at
the top of the file to use the current year 2026: replace the 2025 year in the
existing NVIDIA copyright header comment (the file-level header at the very top
of nemo_rl/data/multimodal_utils.py) so the header reads 2026 instead of 2025,
ensuring it matches the project's required header format for Python files.
🧹 Nitpick comments (12)
nemo_rl/data/datasets/raw_dataset.py (1)

26-35: Initialize preprocessor in the constructor to satisfy class-member init guideline.

RawDataset exposes preprocessor publicly; per guidelines, it should be initialized in __init__ alongside other members. Consider adding an explicit constructor (or a dataclass) to set these fields.

As per coding guidelines "Initialize all externally visible members of a class in the constructor".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/raw_dataset.py` around lines 26 - 35, Add an explicit
constructor for class RawDataset that initializes all externally visible members
(data_config, dataset, val_dataset, processor, task_spec) and sets preprocessor
to None (or a provided value) so preprocessor is not left uninitialized;
implement __init__ on RawDataset to accept and assign these fields (or convert
the class to a dataclass with defaults) to satisfy the "initialize members in
constructor" guideline.
nemo_rl/algorithms/utils.py (2)

323-355: Add stacklevel=2 to warnings.warn calls so warnings point to the caller.

All three warnings.warn calls (lines 330, 341, 352) default to stacklevel=1, which causes the warning to reference this utility function rather than the calling code that configured the override.

Proposed fix (example for line 330; apply similarly to lines 341 and 352)
                 warnings.warn(
-                    f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}"
+                    f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}",
+                    stacklevel=2,
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/utils.py` around lines 323 - 355, The three warnings.warn
calls in the function that adjusts processor feature/video settings (the calls
that override processor.feature_extractor.sampling_rate,
processor.video_processor.fps, and processor.video_processor.num_frames) should
include stacklevel=2 so the warning points at the caller; update each
warnings.warn invocation (the ones emitting "Overriding audio sampling rate...",
"Overriding video fps...", and "Overriding video num_frames...") to pass
stacklevel=2 while keeping the existing message and variable usage
(tokenizer_config checks and assignments to
processor.feature_extractor.sampling_rate, processor.video_processor.fps,
processor.video_processor.num_frames).

345-355: Consider validating mutual exclusivity of fps and num_frames upfront.

The comment on line 345 acknowledges the conflict but defers it. An explicit check with a clear error message here would be more user-friendly than a cryptic failure later in the video processor.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/utils.py` around lines 345 - 355, Check for mutual
exclusivity of fps and num_frames in tokenizer_config["video"] before mutating
processor.video_processor.num_frames: if both "fps" and "num_frames" are
present, validate they are not contradictory and raise a clear ValueError (or
choose one deterministically) instead of letting the video processor fail later.
Specifically, in the block handling tokenizer_config["video"] (referencing
tokenizer_config, processor.video_processor.num_frames, "fps", and
"num_frames"), add an upfront check that raises a descriptive error like "Cannot
set both fps and num_frames in tokenizer_config['video']" (or compare values and
only warn/override when consistent) before assigning
processor.video_processor.num_frames.
nemo_rl/data/datasets/processed_dataset.py (1)

33-58: Document task_data_preprocessors in the class docstring.

The Args section (lines 36–44) documents task_data_processors and max_seq_length but omits the new task_data_preprocessors parameter. Adding a brief entry keeps the docstring consistent with the constructor signature.

📝 Proposed docstring addition
         task_data_processors: Either a single TaskDataProcessFnCallable for single-task,
             or a dict mapping task names to (TaskDataSpec, TaskDataProcessFnCallable) for multi-task
+        task_data_preprocessors: Optional preprocessing hook applied before task-specific processing.
+            Either a single TaskDataPreProcessFnCallable for all tasks,
+            or a dict mapping task names to TaskDataPreProcessFnCallable.
         max_seq_length: Maximum sequence length for tokenized outputs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/processed_dataset.py` around lines 33 - 58, The class
docstring for AllTaskProcessedDataset is missing documentation for the
constructor parameter task_data_preprocessors; update the Args section to add a
short entry describing task_data_preprocessors (type: Optional[Union[dict[str,
TaskDataPreProcessFnCallable], TaskDataPreProcessFnCallable]], default: None),
explaining it can be a single preprocessor applied to all examples or a dict
mapping task names to task-specific preprocessors and that missing tasks fall
back to default behavior; reference the parameter name task_data_preprocessors
and keep the wording consistent with the existing entries for
task_data_processors and default_task_data_spec.
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (2)

38-67: convert_metadata returns None when return_inplace=True — confusing API.

When return_inplace=True, the function mutates metadata in place and returns None implicitly. When False, it returns a new dict. The parameter name is also inverted from the typical inplace convention. Consider renaming to inplace (default False) or always returning the result to avoid caller confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`
around lines 38 - 67, The function convert_metadata currently uses
return_inplace (inverted naming) and returns None when return_inplace=True,
causing a confusing API; rename the parameter to inplace: bool = False (or keep
name but invert semantics) and ensure convert_metadata always returns the
processed dict (variable data) even when mutating the input, so callers get the
result; update logic that chooses data = metadata or metadata.copy(), keep the
mapping loops that reference multimodal_utils.MEDIA_TAGS_TO_ALLOWED and
multimodal_utils.MEDIA_TAGS unchanged, and update any callers to the new
parameter name if renamed.

26-26: Unused _DEBUG variable — remove before merging.

_DEBUG = True is never referenced in this file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py` at
line 26, The file defines an unused module-level variable named _DEBUG which is
never referenced; remove the _DEBUG = True declaration from
general_conversations_dataset (delete the unused symbol) to eliminate dead code
and avoid misleading debug flags.
nemo_rl/data/llm_message_utils.py (1)

607-613: Consider making the media-key-to-kwarg mapping explicit and extensible.

The hardcoded if/elif chain for mapping "image"→"images", "audio"→"audio", "video"→"videos" will need updating when new modalities are added. A small mapping dict would be more maintainable.

♻️ Example
-            media_kwargs = {}
-            if "image" in media_cur_message:
-                media_kwargs["images"] = media_cur_message["image"]
-            if "audio" in media_cur_message:
-                media_kwargs["audio"] = media_cur_message["audio"]
-            if "video" in media_cur_message:
-                media_kwargs["videos"] = media_cur_message["video"]
+            MEDIA_KEY_TO_KWARG = {"image": "images", "audio": "audio", "video": "videos"}
+            media_kwargs = {
+                MEDIA_KEY_TO_KWARG[k]: v
+                for k, v in media_cur_message.items()
+                if k in MEDIA_KEY_TO_KWARG
+            }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/llm_message_utils.py` around lines 607 - 613, Replace the
hardcoded if-chain that builds media_kwargs from media_cur_message with an
explicit mapping dict (e.g., MEDIA_KEY_TO_KWARG) and iterate over its items to
populate media_kwargs; locate the code block that references media_cur_message
and media_kwargs in nemo_rl.data.llm_message_utils (the snippet that currently
checks "image", "audio", "video") and change it to consult the mapping so new
modalities can be added by updating the dict rather than editing conditional
logic.
examples/run_sft.py (1)

104-153: Validation preprocessor wiring is correct.

Both paths (split-from-train and explicit validation config) properly collect and propagate preprocessors.

Minor note: since RawDataset declares preprocessor as a class attribute (line 33 of raw_dataset.py), the hasattr checks on lines 89 and 142 are always True for RawDataset subclasses. You could simplify to just data.preprocessor is not None, but the current form is safely defensive.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/run_sft.py` around lines 104 - 153, The hasattr(...) checks for
"preprocessor" are unnecessary because RawDataset declares preprocessor as a
class attribute; replace the two occurrences where you do hasattr(data,
"preprocessor") and hasattr(val_data, "preprocessor") with direct checks that
the attribute is not None (e.g., data.preprocessor is not None and
val_data.preprocessor is not None) so the code uses the actual None test on the
preprocessor before wiring it into val_task_data_preprocessors (references:
RawDataset, preprocessor, variables data and val_data, and
val_task_data_preprocessors).
nemo_rl/data/datasets/response_datasets/daily_omni.py (1)

73-78: Add extraction filter to prevent path-traversal risks (Ruff S202).

tarfile.extractall() without a filter can be exploited for path traversal attacks. Python 3.12+ supports filter='data' to block absolute paths and parent-directory traversals. Even though the tar is downloaded from a trusted HuggingFace repository, add the filter for defense in depth.

Apply filter parameter
                 with tarfile.open(archive_filename, "r:*") as tar:
-                    # Extract all contents to the specified path
-                    tar.extractall(path=self.hf_cache_dir)
+                    tar.extractall(path=self.hf_cache_dir, filter="data")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` around lines 73 - 78,
The tar extraction in the try block uses
tarfile.extractall(path=self.hf_cache_dir) which is vulnerable to
path-traversal; update the call to pass the filter parameter to block absolute
and parent-directory paths (use filter='data' on Python 3.12+ or equivalent safe
extraction logic) so extraction of archive_filename into self.hf_cache_dir is
validated before writing (affects the block referencing archive_filename,
self.hf_cache_dir and files_folder).
nemo_rl/data/multimodal_utils.py (3)

321-325: Missing docstring for load_media_from_message.

This function is part of the public API consumed by llm_message_utils.py. A Google-style docstring explaining the parameters, return value, and the fallback behavior would help maintainability.

As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments" and "Use Google style docstrings for classes and functions, which can be parsed by Sphinx."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 321 - 325, Add a Google-style
docstring to the public function load_media_from_message describing parameters,
return value, and fallback behavior: document message (expected keys/structure),
processor (type and when used), multimodal_load_kwargs (shape: mapping of media
type to kwargs and defaults), and the returned dict[str, list[Any]] format;
include behavior when media is missing or when processor is None, examples of
supported media types, and any exceptions raised or swallowed. Place the
docstring immediately under the def load_media_from_message(...) signature so
external modules like llm_message_utils.py can rely on the documented API.

219-261: Missing docstring for a public function used outside this file; consider replacing function-attribute caching.

get_multimodal_default_settings_from_processor is referenced by llm_message_utils.py per the summary, so it should have a docstring per coding guidelines.

Also, the function-attribute caching pattern (lines 237–242, 249–254) is unconventional and not thread-safe. A module-level functools.lru_cache or a simple module-level variable would be more idiomatic. Additionally, the list comprehensions [param for param in ...] are redundant — list(...) suffices.

♻️ Simplify signature caching
+import functools
+
+
+@functools.lru_cache(maxsize=1)
+def _load_video_param_names() -> list[str]:
+    return list(inspect.signature(load_video).parameters)
+
+
+@functools.lru_cache(maxsize=1)
+def _load_audio_param_names() -> list[str]:
+    return list(inspect.signature(load_audio).parameters)
+
+
 def get_multimodal_default_settings_from_processor(
     processor,
 ) -> dict[str, dict[str, Any]]:
+    """Extract default video/audio loading kwargs from a processor's sub-components."""
     ...
-        if not hasattr(
-            get_multimodal_default_settings_from_processor, "load_video_kwargs"
-        ):
-            get_multimodal_default_settings_from_processor.load_video_kwargs = [
-                param for param in inspect.signature(load_video).parameters
-            ]
         default_settings["video"] = {
             arg: video_settings_dict[arg]
-            for arg in get_multimodal_default_settings_from_processor.load_video_kwargs
+            for arg in _load_video_param_names()
             if arg in video_settings_dict
         }

(Apply analogous change for load_audio_kwargs.)

As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 219 - 261, The public function
get_multimodal_default_settings_from_processor lacks a docstring and uses unsafe
function-attribute caching and redundant list comprehensions; add a concise
docstring describing purpose, args, and return value, and replace the
function-attribute caches
get_multimodal_default_settings_from_processor.load_video_kwargs and
.load_audio_kwargs with a module-level cache (either simple module-level
variables or a `@functools.lru_cache-decorated` helper) that computes
list(inspect.signature(load_video).parameters) and
list(inspect.signature(load_audio).parameters) once, and update the list
comprehensions to use list(...) instead of [param for param in ...]; ensure the
rest of the function (video_settings_dict/feature_extractor usage and
default_settings keys) remains unchanged.

315-317: Conditional expression used solely for side effects hurts readability.

The ternary media[tag].extend(...) if isinstance(...) else media[tag].append(...) is used as a statement for its side effects only — the return value is discarded. This is a known anti-pattern in Python. A regular if/else block is clearer here.

♻️ Replace ternary statement with explicit if/else
         tag = item["type"]
         if tag in MEDIA_TAGS:
-            media[tag].extend(list(item[tag])) if isinstance(
-                item[tag], (list, tuple)
-            ) else media[tag].append(item[tag])
+            value = item[tag]
+            if isinstance(value, (list, tuple)):
+                media[tag].extend(value)
+            else:
+                media[tag].append(value)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 315 - 317, Replace the ternary
used for side effects with a clear if/else: locate the statement that currently
does "media[tag].extend(list(item[tag])) if isinstance(item[tag], (list, tuple))
else media[tag].append(item[tag])" and change it to an explicit if
isinstance(item[tag], (list, tuple)): media[tag].extend(list(item[tag])) else:
media[tag].append(item[tag]). This preserves the same behavior for
media[tag].extend/append and improves readability; keep the same variable names
(media, tag, item) and surrounding logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_rl/data/datasets/response_datasets/daily_omni.py`:
- Line 1: The file daily_omni.py has an incorrect copyright header starting with
"##" instead of the repo-standard single "#" header; open
nemo_rl/data/datasets/response_datasets/daily_omni.py and replace the top header
so it matches the other Python files (use "# Copyright (c) 2025, NVIDIA
CORPORATION.  All rights reserved." with a single leading '#' and same
spacing/phrasing), ensuring the header sits at the very top of the file.
- Around line 27-32: The class docstring for DailyOmniDataset incorrectly
mentions "CLEVR-CoGenT" due to a copy-paste error; update the docstring in class
DailyOmniDataset to describe the Daily Omni dataset (e.g., replace "Simple
wrapper around the CLEVR-CoGenT dataset." with a concise description referencing
the Daily Omni dataset) and keep the Args section (split) unchanged.
- Around line 85-90: Fix the typo and enable exception chaining in the tar
handling block: change the misspelled tarfile.ReadErro to tarfile.ReadError in
the except block and re-raise the ReadError with the new message using exception
chaining (raise tarfile.ReadError("...") from e). Likewise, in the generic
except Exception as e block, re-raise the new Exception with the formatted
message using "from e" so the original traceback is preserved; locate the
tarfile handling where tarfile.ReadError and the variable e are used in
daily_omni.py.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`:
- Around line 218-234: The return type annotation for process_message_fragment
is wrong: the function builds and returns a list (ret) of dicts, not a single
dict. Update the signature of process_message_fragment to return list[dict[str,
Any]] (or Sequence[Mapping[str, Any]] if you prefer an abstract type) and ensure
any callers expecting dict are adjusted; reference the function name
process_message_fragment and the local variable ret and the loop over
tag.split("-") to locate the code to change.
- Around line 148-153: Fix the typos in the class docstring for
GeneralConversationsJsonlDataset (subclass of RawDataset): change
"converstaions" to "conversations" and "requiement" to "requirement" so the
documentation reads correctly about jsonl datasets and media tag placement.
- Around line 107-118: The code is adding the file extension string to
tried_default_extensions and also assumes filenames always contain a '.',
causing an IndexError; update the logic in general_conversations_dataset.py
where ext is computed from metadata[ tag ][ media_index[tag] ] to safely extract
the extension (use os.path.splitext or check for '.' and handle missing
extension by setting ext to an empty string) and change the set insertion to
tried_default_extensions.add(tag) (not ext) so the guard `tag not in
tried_default_extensions` works as intended; keep the loop over
multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag] unchanged but ensure you only add
the media type tag to tried_default_extensions when you decide to skip trying
defaults.

In `@nemo_rl/data/datasets/utils.py`:
- Around line 18-24: The function get_huggingface_cache_path should be updated
to use the public API and improve typing and error handling: replace the private
_scan_cached_repo import/use with huggingface_hub.scan_cache_dir (update the
import), add a Google-style docstring describing parameters/return and behavior,
add precise type hints for parameters and return (Python 3.12+ style), narrow
the broad except Exception to specific exceptions such as OSError and
ValueError, and guard against calling max() on an empty revs mapping by
returning an appropriate fallback or raising a clear ValueError; locate these
changes around the get_huggingface_cache_path function and any imports
referencing _scan_cached_repo.

In `@nemo_rl/data/multimodal_utils.py`:
- Around line 344-347: Replace the runtime assertion that checks for "audio" and
"sampling_rate" in multimodal_load_kwargs with explicit validation and a clear
exception: verify that "audio" is a key in multimodal_load_kwargs and that
"sampling_rate" is present in multimodal_load_kwargs["audio"], and if either
check fails raise a ValueError with a descriptive message (e.g., referencing
multimodal_load_kwargs, "audio", and "sampling_rate") instead of using assert so
the check cannot be stripped under python -O; update the location where this
change is made around the multimodal_load_kwargs handling in multimodal_utils.py
(the block containing the current assert).
- Around line 348-364: Narrow the broad except in the load_audio fallback: catch
only expected errors from load_audio (e.g., RuntimeError, FileNotFoundError,
ValueError) instead of bare Exception; either log the caught error (use the
module logger via logger.warning with the exception info) or drop the unused "as
e" binding; replace the print("audio loading failed...") with logger.warning
including the error details and context (e.g., which file/path and that we are
falling back to decord). Also ensure the call to
get_dim_to_pack_along(processor, "audio") is safe when processor is None by
adding an explicit guard or documenting that processor may be None (or
defaulting to 0) so the fallback code's slicing behavior remains clear.

---

Outside diff comments:
In `@nemo_rl/data/multimodal_utils.py`:
- Line 1: Update the copyright header at the top of the file to use the current
year 2026: replace the 2025 year in the existing NVIDIA copyright header comment
(the file-level header at the very top of nemo_rl/data/multimodal_utils.py) so
the header reads 2026 instead of 2025, ensuring it matches the project's
required header format for Python files.

In `@nemo_rl/data/utils.py`:
- Around line 221-248: The bug is that val_task_data_preprocessors is declared
once outside the val_data_paths loop and may leak preprocessors across datasets;
fix it by moving/reinitializing val_task_data_preprocessors inside the for
val_dataset_name, val_dataset_path in val_data_paths.items() loop (set to {} at
start of each iteration), then populate it only when hasattr(val_data,
"preprocessor") and val_data.preprocessor is not None before passing it to
AllTaskProcessedDataset; ensure val_task_data_processors is also built
per-iteration using load_preference_dataset and val_data.task_name so each
AllTaskProcessedDataset gets only its own processors and preprocessors.

In `@nemo_rl/models/policy/__init__.py`:
- Around line 201-208: Add a Google-style class docstring to the TokenizerConfig
TypedDict describing the purpose of the TypedDict and documenting the new audio
and video keys: state that audio and video are optional multimodal config dicts,
list valid value types (dict[str, Any] or None), describe recommended defaults
(e.g., audio: {} or None to disable, video: {} or None to disable, and any
recommended subkeys like sample_rate, channels for audio or frame_rate,
resolution for video), and briefly mention chat_template_kwargs usage; then
update the exemplar YAML files under examples/configs/*.yaml to include the
audio and video keys with the recommended default values so the examples reflect
the docstring defaults.

In `@pyproject.toml`:
- Around line 18-53: Pin the decord dependency to a specific version by changing
the "decord" entry to "decord==0.6.0" and add a short comment noting Python 3.12
has no prebuilt wheels (so building from source may require FFmpeg and a proper
toolchain), or alternatively verify/source-build in CI and document required
build steps for Python 3.12 to avoid breakage; update the dependency line for
"decord" and include the explanatory comment nearby.

---

Nitpick comments:
In `@examples/run_sft.py`:
- Around line 104-153: The hasattr(...) checks for "preprocessor" are
unnecessary because RawDataset declares preprocessor as a class attribute;
replace the two occurrences where you do hasattr(data, "preprocessor") and
hasattr(val_data, "preprocessor") with direct checks that the attribute is not
None (e.g., data.preprocessor is not None and val_data.preprocessor is not None)
so the code uses the actual None test on the preprocessor before wiring it into
val_task_data_preprocessors (references: RawDataset, preprocessor, variables
data and val_data, and val_task_data_preprocessors).

In `@nemo_rl/algorithms/utils.py`:
- Around line 323-355: The three warnings.warn calls in the function that
adjusts processor feature/video settings (the calls that override
processor.feature_extractor.sampling_rate, processor.video_processor.fps, and
processor.video_processor.num_frames) should include stacklevel=2 so the warning
points at the caller; update each warnings.warn invocation (the ones emitting
"Overriding audio sampling rate...", "Overriding video fps...", and "Overriding
video num_frames...") to pass stacklevel=2 while keeping the existing message
and variable usage (tokenizer_config checks and assignments to
processor.feature_extractor.sampling_rate, processor.video_processor.fps,
processor.video_processor.num_frames).
- Around line 345-355: Check for mutual exclusivity of fps and num_frames in
tokenizer_config["video"] before mutating processor.video_processor.num_frames:
if both "fps" and "num_frames" are present, validate they are not contradictory
and raise a clear ValueError (or choose one deterministically) instead of
letting the video processor fail later. Specifically, in the block handling
tokenizer_config["video"] (referencing tokenizer_config,
processor.video_processor.num_frames, "fps", and "num_frames"), add an upfront
check that raises a descriptive error like "Cannot set both fps and num_frames
in tokenizer_config['video']" (or compare values and only warn/override when
consistent) before assigning processor.video_processor.num_frames.

In `@nemo_rl/data/datasets/processed_dataset.py`:
- Around line 33-58: The class docstring for AllTaskProcessedDataset is missing
documentation for the constructor parameter task_data_preprocessors; update the
Args section to add a short entry describing task_data_preprocessors (type:
Optional[Union[dict[str, TaskDataPreProcessFnCallable],
TaskDataPreProcessFnCallable]], default: None), explaining it can be a single
preprocessor applied to all examples or a dict mapping task names to
task-specific preprocessors and that missing tasks fall back to default
behavior; reference the parameter name task_data_preprocessors and keep the
wording consistent with the existing entries for task_data_processors and
default_task_data_spec.

In `@nemo_rl/data/datasets/raw_dataset.py`:
- Around line 26-35: Add an explicit constructor for class RawDataset that
initializes all externally visible members (data_config, dataset, val_dataset,
processor, task_spec) and sets preprocessor to None (or a provided value) so
preprocessor is not left uninitialized; implement __init__ on RawDataset to
accept and assign these fields (or convert the class to a dataclass with
defaults) to satisfy the "initialize members in constructor" guideline.

In `@nemo_rl/data/datasets/response_datasets/daily_omni.py`:
- Around line 73-78: The tar extraction in the try block uses
tarfile.extractall(path=self.hf_cache_dir) which is vulnerable to
path-traversal; update the call to pass the filter parameter to block absolute
and parent-directory paths (use filter='data' on Python 3.12+ or equivalent safe
extraction logic) so extraction of archive_filename into self.hf_cache_dir is
validated before writing (affects the block referencing archive_filename,
self.hf_cache_dir and files_folder).

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`:
- Around line 38-67: The function convert_metadata currently uses return_inplace
(inverted naming) and returns None when return_inplace=True, causing a confusing
API; rename the parameter to inplace: bool = False (or keep name but invert
semantics) and ensure convert_metadata always returns the processed dict
(variable data) even when mutating the input, so callers get the result; update
logic that chooses data = metadata or metadata.copy(), keep the mapping loops
that reference multimodal_utils.MEDIA_TAGS_TO_ALLOWED and
multimodal_utils.MEDIA_TAGS unchanged, and update any callers to the new
parameter name if renamed.
- Line 26: The file defines an unused module-level variable named _DEBUG which
is never referenced; remove the _DEBUG = True declaration from
general_conversations_dataset (delete the unused symbol) to eliminate dead code
and avoid misleading debug flags.

In `@nemo_rl/data/llm_message_utils.py`:
- Around line 607-613: Replace the hardcoded if-chain that builds media_kwargs
from media_cur_message with an explicit mapping dict (e.g., MEDIA_KEY_TO_KWARG)
and iterate over its items to populate media_kwargs; locate the code block that
references media_cur_message and media_kwargs in nemo_rl.data.llm_message_utils
(the snippet that currently checks "image", "audio", "video") and change it to
consult the mapping so new modalities can be added by updating the dict rather
than editing conditional logic.

In `@nemo_rl/data/multimodal_utils.py`:
- Around line 321-325: Add a Google-style docstring to the public function
load_media_from_message describing parameters, return value, and fallback
behavior: document message (expected keys/structure), processor (type and when
used), multimodal_load_kwargs (shape: mapping of media type to kwargs and
defaults), and the returned dict[str, list[Any]] format; include behavior when
media is missing or when processor is None, examples of supported media types,
and any exceptions raised or swallowed. Place the docstring immediately under
the def load_media_from_message(...) signature so external modules like
llm_message_utils.py can rely on the documented API.
- Around line 219-261: The public function
get_multimodal_default_settings_from_processor lacks a docstring and uses unsafe
function-attribute caching and redundant list comprehensions; add a concise
docstring describing purpose, args, and return value, and replace the
function-attribute caches
get_multimodal_default_settings_from_processor.load_video_kwargs and
.load_audio_kwargs with a module-level cache (either simple module-level
variables or a `@functools.lru_cache-decorated` helper) that computes
list(inspect.signature(load_video).parameters) and
list(inspect.signature(load_audio).parameters) once, and update the list
comprehensions to use list(...) instead of [param for param in ...]; ensure the
rest of the function (video_settings_dict/feature_extractor usage and
default_settings keys) remains unchanged.
- Around line 315-317: Replace the ternary used for side effects with a clear
if/else: locate the statement that currently does
"media[tag].extend(list(item[tag])) if isinstance(item[tag], (list, tuple)) else
media[tag].append(item[tag])" and change it to an explicit if
isinstance(item[tag], (list, tuple)): media[tag].extend(list(item[tag])) else:
media[tag].append(item[tag]). This preserves the same behavior for
media[tag].extend/append and improves readability; keep the same variable names
(media, tag, item) and surrounding logic.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9148186 and ee953ca.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • .github/workflows/cicd-main.yml
  • examples/configs/recipes/llm/performance/.grpo-deepseek-v3-32n4g.yaml.swp
  • examples/configs/recipes/llm/performance/.grpo-deepseek-v3-32n8g.yaml.swp
  • examples/configs/sft_avlm.yaml
  • examples/run_sft.py
  • nemo_rl/algorithms/utils.py
  • nemo_rl/data/datasets/processed_dataset.py
  • nemo_rl/data/datasets/raw_dataset.py
  • nemo_rl/data/datasets/response_datasets/__init__.py
  • nemo_rl/data/datasets/response_datasets/daily_omni.py
  • nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py
  • nemo_rl/data/datasets/utils.py
  • nemo_rl/data/interfaces.py
  • nemo_rl/data/llm_message_utils.py
  • nemo_rl/data/multimodal_utils.py
  • nemo_rl/data/utils.py
  • nemo_rl/models/policy/__init__.py
  • pyproject.toml
  • tests/unit/data/datasets/test_general_conversations_dataset.py
  • tests/unit/data/datasets/test_response_dataset.py

@@ -0,0 +1,140 @@
## Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Copyright header starts with ## instead of #.

All other files in the repo use # Copyright (c) 2025, NVIDIA CORPORATION. — this file has ## Copyright, which is inconsistent. As per coding guidelines, the NVIDIA copyright header should be added to all Python source files under nemo_rl/.

Proposed fix
-## Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` at line 1, The file
daily_omni.py has an incorrect copyright header starting with "##" instead of
the repo-standard single "#" header; open
nemo_rl/data/datasets/response_datasets/daily_omni.py and replace the top header
so it matches the other Python files (use "# Copyright (c) 2025, NVIDIA
CORPORATION.  All rights reserved." with a single leading '#' and same
spacing/phrasing), ensuring the header sits at the very top of the file.

Comment on lines 27 to 32
class DailyOmniDataset(RawDataset):
"""Simple wrapper around the CLEVR-CoGenT dataset.

Args:
split: Split name for the dataset, default is "train"
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring refers to "CLEVR-CoGenT" — this is the Daily Omni dataset.

Copy-paste artifact from another dataset class.

Proposed fix
 class DailyOmniDataset(RawDataset):
-    """Simple wrapper around the CLEVR-CoGenT dataset.
+    """Dataset wrapper for the Daily-Omni public video benchmarking dataset.
 
     Args:
         split: Split name for the dataset, default is "train"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class DailyOmniDataset(RawDataset):
"""Simple wrapper around the CLEVR-CoGenT dataset.
Args:
split: Split name for the dataset, default is "train"
"""
class DailyOmniDataset(RawDataset):
"""Dataset wrapper for the Daily-Omni public video benchmarking dataset.
Args:
split: Split name for the dataset, default is "train"
"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` around lines 27 - 32,
The class docstring for DailyOmniDataset incorrectly mentions "CLEVR-CoGenT" due
to a copy-paste error; update the docstring in class DailyOmniDataset to
describe the Daily Omni dataset (e.g., replace "Simple wrapper around the
CLEVR-CoGenT dataset." with a concise description referencing the Daily Omni
dataset) and keep the Args section (split) unchanged.

Comment on lines 85 to 90
except tarfile.ReadError:
raise tarfile.ReadErro(
"Error: Could not read the tar file. It might be corrupted or not a tar file."
)
except Exception as e:
raise Exception(f"An unexpected error occurred: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical typo: tarfile.ReadErrotarfile.ReadError — will raise NameError at runtime.

Line 86 has a typo that turns this error-handling path into an unhandled NameError. Additionally, both except blocks should chain the original exception using from.

🐛 Proposed fix
             except tarfile.ReadError:
-                raise tarfile.ReadErro(
+                raise tarfile.ReadError(
                     "Error: Could not read the tar file. It might be corrupted or not a tar file."
-                )
-            except Exception as e:
-                raise Exception(f"An unexpected error occurred: {e}")
+                ) from None
+            except Exception as e:
+                raise RuntimeError(f"An unexpected error occurred: {e}") from e
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 86-88: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


[warning] 86-88: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 89-89: Do not catch blind exception: Exception

(BLE001)


[warning] 90-90: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


[warning] 90-90: Create your own exception

(TRY002)


[warning] 90-90: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` around lines 85 - 90,
Fix the typo and enable exception chaining in the tar handling block: change the
misspelled tarfile.ReadErro to tarfile.ReadError in the except block and
re-raise the ReadError with the new message using exception chaining (raise
tarfile.ReadError("...") from e). Likewise, in the generic except Exception as e
block, re-raise the new Exception with the formatted message using "from e" so
the original traceback is preserved; locate the tarfile handling where
tarfile.ReadError and the variable e are used in daily_omni.py.

Comment on lines 107 to 118
ext = os.path.basename(metadata[tag][media_index[tag]]).split(".", 1)[1]
if (
raw
and ext not in raw
and tag not in tried_default_extensions
and tag in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS
):
# try the default extension
for ext in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag]:
if ext in raw:
tried_default_extensions.add(ext)
break
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: tried_default_extensions.add(ext) should be tried_default_extensions.add(tag) — the guard on line 111 checks tag, not ext.

Line 111 checks tag not in tried_default_extensions to avoid retrying default extensions for the same media type. But line 117 adds ext (the file extension string, e.g. "mp4") instead of tag (the media type, e.g. "video"). Since these are different values, the guard will never match, and default extension resolution will be retried on every occurrence.

Also, line 107 will raise IndexError if the media filename contains no . (e.g., a path or identifier with no extension).

🐛 Proposed fix
-            ext = os.path.basename(metadata[tag][media_index[tag]]).split(".", 1)[1]
+            basename = os.path.basename(metadata[tag][media_index[tag]])
+            ext = basename.split(".", 1)[1] if "." in basename else ""
             if (
                 raw
                 and ext not in raw
                 and tag not in tried_default_extensions
                 and tag in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS
             ):
                 # try the default extension
                 for ext in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag]:
                     if ext in raw:
-                        tried_default_extensions.add(ext)
+                        tried_default_extensions.add(tag)
                         break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`
around lines 107 - 118, The code is adding the file extension string to
tried_default_extensions and also assumes filenames always contain a '.',
causing an IndexError; update the logic in general_conversations_dataset.py
where ext is computed from metadata[ tag ][ media_index[tag] ] to safely extract
the extension (use os.path.splitext or check for '.' and handle missing
extension by setting ext to an empty string) and change the set insertion to
tried_default_extensions.add(tag) (not ext) so the guard `tag not in
tried_default_extensions` works as intended; keep the loop over
multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag] unchanged but ensure you only add
the media type tag to tried_default_extensions when you decide to skip trying
defaults.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a false alarm. add(ext) is correct since we need to check the extension tried, not tag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuanhangsu1986, thank you for the clarification. I understand you want to track extensions rather than tags. However, I notice there's a mismatch in the current implementation:

  • Line 111 checks: tag not in tried_default_extensions (checking if the media type tag like "video" is in the set)
  • Line 117 adds: tried_default_extensions.add(ext) (adding the extension string like "mp4")

Since tag (e.g., "video") and ext (e.g., "mp4") are different values, the guard on line 111 will always pass even after adding ext to the set, which means default extensions will be retried on every occurrence of the same tag.

If the intention is to track which extensions have been tried (as you mentioned), would it make sense to also update line 111 to check ext not in tried_default_extensions instead of tag not in tried_default_extensions? This would make the guard consistent with what's being added to the set.

Let me know if there's a specific behavior I'm missing!

Comment on lines 148 to 153
class GeneralConversationsJsonlDataset(RawDataset):
"""Loads general conversation datasets that have the json (manifest) files and media files in separate files (jsonl datasets).

Each sample can be single/multi-turn converstaions with multiple modalities.
Each modality can have one or more number of media objects.
There is no requiement of where the media tag (e.g. '<sound>') should appear in the conversations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring typos: "converstaions" → "conversations", "requiement" → "requirement".

Proposed fix
-    Each sample can be single/multi-turn converstaions with multiple modalities.
+    Each sample can be single/multi-turn conversations with multiple modalities.
     Each modality can have one or more number of media objects.
-    There is no requiement of where the media tag (e.g. '<sound>') should appear in the conversations.
+    There is no requirement of where the media tag (e.g. '<sound>') should appear in the conversations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`
around lines 148 - 153, Fix the typos in the class docstring for
GeneralConversationsJsonlDataset (subclass of RawDataset): change
"converstaions" to "conversations" and "requiement" to "requirement" so the
documentation reads correctly about jsonl datasets and media tag placement.

Comment on lines 218 to 234
@classmethod
def process_message_fragment(
cls, tag: str, fragment: Any, media_directory: Optional[str] = None
) -> dict[str, Any]:
if (
media_directory is not None
and tag in multimodal_utils.MEDIA_TAGS
and isinstance(fragment, str)
and not os.path.isfile(fragment)
):
media_path = os.path.join(media_directory, fragment)
if os.path.isfile(media_path):
fragment = media_path
ret = []
for t in tag.split("-"):
ret.append({"type": t, t: fragment})
return ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Return type annotation is incorrect: returns list[dict[str, Any]], not dict[str, Any].

process_message_fragment builds and returns ret (a list), but the annotation says -> dict[str, Any].

Proposed fix
     `@classmethod`
     def process_message_fragment(
         cls, tag: str, fragment: Any, media_directory: Optional[str] = None
-    ) -> dict[str, Any]:
+    ) -> list[dict[str, Any]]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`
around lines 218 - 234, The return type annotation for process_message_fragment
is wrong: the function builds and returns a list (ret) of dicts, not a single
dict. Update the signature of process_message_fragment to return list[dict[str,
Any]] (or Sequence[Mapping[str, Any]] if you prefer an abstract type) and ensure
any callers expecting dict are adjusted; reference the function name
process_message_fragment and the local variable ret and the loop over
tag.split("-") to locate the code to change.

Comment on lines +18 to 24
from pathlib import Path
from typing import Optional, Union

import torch
from datasets import DatasetDict, load_dataset, load_from_disk
from huggingface_hub.utils._cache_manager import _scan_cached_repo
from PIL import Image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the file in question
find . -type f -name "utils.py" -path "*/nemo_rl/data/datasets/*" | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 90


🏁 Script executed:

# Read the utils.py file to see the imports and get_huggingface_cache_path function
cat -n nemo_rl/data/datasets/utils.py | sed -n '1,50p'

Repository: NVIDIA-NeMo/RL

Length of output: 2233


🏁 Script executed:

# Read the function in question (lines 148-176)
cat -n nemo_rl/data/datasets/utils.py | sed -n '140,180p'

Repository: NVIDIA-NeMo/RL

Length of output: 1918


🌐 Web query:

huggingface_hub public API cache inspection alternatives to _scan_cached_repo

💡 Result:

_scan_cached_repo is a private helper. The public, supported ways to inspect the Hugging Face Hub cache are:

1) Scan the whole cache (recommended): scan_cache_dir()

Use the public API that returns a structured HFCacheInfo report (repos, revisions, sizes, timestamps, paths). (huggingface.co)

from huggingface_hub import scan_cache_dir

info = scan_cache_dir()          # or scan_cache_dir(cache_dir="...")

# list cached repos
for repo in sorted(info.repos, key=lambda r: r.size_on_disk, reverse=True):
    print(repo.repo_id, repo.repo_type, repo.size_on_disk, repo.repo_path)

# find one repo
repo = next(r for r in info.repos if r.repo_id == "google/fleurs" and r.repo_type == "dataset")
print(repo.revisions)  # commit hashes + snapshot paths + files

2) Check whether a specific file is already cached: try_to_load_from_cache()

If you don’t need a full scan, this is the public “is it cached, and where?” helper. (huggingface.co)

from huggingface_hub import try_to_load_from_cache

path = try_to_load_from_cache("bert-base-uncased", "config.json")
print(path)  # local path or None (or special sentinel)

3) CLI inspection (same cache system)

  • huggingface-cli scan-cache prints a cache report (repo id/type, disk usage, refs, local path). (huggingface.co)
  • Newer CLI also documents hf cache commands (e.g., list/remove) depending on your installed CLI version. (hugging-face.cn)

4) If you also want cleanup after inspection

scan_cache_dir() returns an HFCacheInfo that can prepare a deletion plan via delete_revisions(...) (dry-run style via the returned strategy, then execute()). (huggingface.co)

Citations:


🌐 Web query:

huggingface_hub CachedRepoInfo cache directory public API

💡 Result:

In huggingface_hub (Python), CachedRepoInfo is part of the public cache-scanning API:

  • Use huggingface_hub.scan_cache_dir(cache_dir=...) to scan a cache root and get an HFCacheInfo, whose .repos contains CachedRepoInfo objects (with fields like repo_path, size_on_disk, revisions, etc.). [1]

To get / control the Hub cache directory (the “cache root” that scan_cache_dir() defaults to):

  • Configure via environment variables:

    • HF_HOME (root; default ~/.cache/huggingface)
    • HF_HUB_CACHE (Hub repos cache; default $HF_HOME/hub, i.e. typically ~/.cache/huggingface/hub) [2]
  • Programmatic (public) access to the resolved default path:

    • from huggingface_hub.constants import HF_HUB_CACHE (prints the effective Hub cache dir, honoring env overrides). [3]

Example

from huggingface_hub import scan_cache_dir
from huggingface_hub.constants import HF_HUB_CACHE

info = scan_cache_dir()              # defaults to HF_HUB_CACHE
print(HF_HUB_CACHE)                  # cache root (e.g. ~/.cache/huggingface/hub)

for repo in info.repos:
    print(repo.repo_id, repo.repo_path)

Sources: [1] [2] [3]


Add docstring, use public HF API, and narrow exception handling in get_huggingface_cache_path.

This function lacks a Google-style docstring and uses the private _scan_cached_repo API. Replace it with the public scan_cache_dir() from huggingface_hub, which is the supported, documented way to inspect the local cache. Additionally, the broad Exception catch on line 174 masks real errors; narrow it to specific exceptions (e.g., OSError, ValueError). Also add type hints to match Python 3.12+ best practices, and guard against max() on an empty sequence if the revs dict is empty after filtering.

Refactor sketch
+from huggingface_hub import scan_cache_dir
+
-def get_huggingface_cache_path(repo_id, branch="main", repo_type="datasets"):
+def get_huggingface_cache_path(
+    repo_id: str, branch: str | None = "main", repo_type: str = "datasets"
+) -> str | None:
+    """Return the latest cached snapshot path for a HF repo if present.
+
+    Args:
+        repo_id: Repository ID in "org/name" form.
+        branch: Optional branch/tag name to filter revisions.
+        repo_type: Repository type ("datasets", "models", etc.).
+
+    Returns:
+        Path to the snapshot directory if cached, else None.
+    """
     cache_path = None
     try:
-        # ... existing cache detection logic ...
-        hf_cache_info = _scan_cached_repo(repo_path=repo_path)
+        hf_cache_info = scan_cache_dir()
+        repo = next((r for r in hf_cache_info.repos 
+                     if r.repo_id == repo_id and r.repo_type == repo_type), None)
+        if not repo:
+            return None
+        revs = {r.refs: r for r in repo.revisions}
-        revs = {r.refs: r for r in hf_cache_info.revisions}
         if branch is not None:
             revs = {refs: r for refs, r in revs.items() if branch in refs}
+        if not revs:
+            return None
         rev2keep = max(revs.values(), key=lambda r: r.last_modified)
         return str(rev2keep.snapshot_path)
-    except Exception as e:
+    except (OSError, ValueError) as e:
         print(f"{type(e)}: {e}")
         return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/utils.py` around lines 18 - 24, The function
get_huggingface_cache_path should be updated to use the public API and improve
typing and error handling: replace the private _scan_cached_repo import/use with
huggingface_hub.scan_cache_dir (update the import), add a Google-style docstring
describing parameters/return and behavior, add precise type hints for parameters
and return (Python 3.12+ style), narrow the broad except Exception to specific
exceptions such as OSError and ValueError, and guard against calling max() on an
empty revs mapping by returning an appropriate fallback or raising a clear
ValueError; locate these changes around the get_huggingface_cache_path function
and any imports referencing _scan_cached_repo.

Comment on lines 344 to 347
assert (
"audio" in multimodal_load_kwargs
and "sampling_rate" in multimodal_load_kwargs["audio"]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

assert used for runtime input validation — will silently pass under python -O.

Assertions are stripped when Python runs with optimizations enabled (-O / -OO). If sampling_rate is genuinely required, raise a ValueError instead.

Proposed fix
-                assert (
-                    "audio" in multimodal_load_kwargs
-                    and "sampling_rate" in multimodal_load_kwargs["audio"]
-                )
+                if (
+                    "audio" not in multimodal_load_kwargs
+                    or "sampling_rate" not in multimodal_load_kwargs.get("audio", {})
+                ):
+                    raise ValueError(
+                        "multimodal_load_kwargs must include 'audio' with a 'sampling_rate' "
+                        "key to load audio from file path."
+                    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert (
"audio" in multimodal_load_kwargs
and "sampling_rate" in multimodal_load_kwargs["audio"]
)
if (
"audio" not in multimodal_load_kwargs
or "sampling_rate" not in multimodal_load_kwargs.get("audio", {})
):
raise ValueError(
"multimodal_load_kwargs must include 'audio' with a 'sampling_rate' "
"key to load audio from file path."
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 344 - 347, Replace the runtime
assertion that checks for "audio" and "sampling_rate" in multimodal_load_kwargs
with explicit validation and a clear exception: verify that "audio" is a key in
multimodal_load_kwargs and that "sampling_rate" is present in
multimodal_load_kwargs["audio"], and if either check fails raise a ValueError
with a descriptive message (e.g., referencing multimodal_load_kwargs, "audio",
and "sampling_rate") instead of using assert so the check cannot be stripped
under python -O; update the location where this change is made around the
multimodal_load_kwargs handling in multimodal_utils.py (the block containing the
current assert).

Comment on lines 348 to 364
try:
loaded_media["audio"].append(
load_audio(aud, **multimodal_load_kwargs["audio"])
)
except Exception as e:
print("audio loading failed. Fall back to decord.")
# use decord
loaded_audio = decord.AudioReader(
aud,
sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"],
mono=True,
)
loaded_media["audio"].append(
loaded_audio[:].asnumpy()[
get_dim_to_pack_along(processor, "audio")
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bare Exception catch, unused variable, and print instead of logging.

Three issues in this block, two confirmed by static analysis (Ruff BLE001, F841):

  1. Bare Exception — catches everything including KeyboardInterrupt subclasses (in Python 3 KeyboardInterrupt doesn't inherit from Exception, so that's fine, but SystemExit edge aside, catching all exceptions masks bugs). Narrow to expected failure types (e.g., RuntimeError, FileNotFoundError, ValueError).
  2. Unused e — either log it or drop as e.
  3. print(...) in library code — use logging.warning(...) (or logger.warning) so callers can control log levels.

Additionally, on line 362, get_dim_to_pack_along(processor, "audio") is called, but processor can be None (parameter default on line 323). While get_dim_to_pack_along would not crash on None (it returns 0), it's semantically unclear — consider guarding or documenting.

Proposed fix
+import logging
+
+logger = logging.getLogger(__name__)
+
 ...
-                except Exception as e:
-                    print("audio loading failed. Fall back to decord.")
+                except (RuntimeError, FileNotFoundError, OSError) as e:
+                    logger.warning("Audio loading via transformers failed (%s). Falling back to decord.", e)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
loaded_media["audio"].append(
load_audio(aud, **multimodal_load_kwargs["audio"])
)
except Exception as e:
print("audio loading failed. Fall back to decord.")
# use decord
loaded_audio = decord.AudioReader(
aud,
sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"],
mono=True,
)
loaded_media["audio"].append(
loaded_audio[:].asnumpy()[
get_dim_to_pack_along(processor, "audio")
]
)
try:
loaded_media["audio"].append(
load_audio(aud, **multimodal_load_kwargs["audio"])
)
except (RuntimeError, FileNotFoundError, OSError) as e:
logger.warning("Audio loading via transformers failed (%s). Falling back to decord.", e)
# use decord
loaded_audio = decord.AudioReader(
aud,
sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"],
mono=True,
)
loaded_media["audio"].append(
loaded_audio[:].asnumpy()[
get_dim_to_pack_along(processor, "audio")
]
)
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 352-352: Do not catch blind exception: Exception

(BLE001)


[error] 352-352: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 348 - 364, Narrow the broad
except in the load_audio fallback: catch only expected errors from load_audio
(e.g., RuntimeError, FileNotFoundError, ValueError) instead of bare Exception;
either log the caught error (use the module logger via logger.warning with the
exception info) or drop the unused "as e" binding; replace the print("audio
loading failed...") with logger.warning including the error details and context
(e.g., which file/path and that we are falling back to decord). Also ensure the
call to get_dim_to_pack_along(processor, "audio") is safe when processor is None
by adding an explicit guard or documenting that processor may be None (or
defaulting to 0) so the fallback code's slicing behavior remains clear.

@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Feb 24, 2026
@yuanhangsu1986 yuanhangsu1986 changed the title Yuanhangs dev feat: Omni dataloader for HF models Feb 24, 2026
@yuanhangsu1986 yuanhangsu1986 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 24, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@github-actions github-actions bot removed the CI Relating to CI label Feb 24, 2026
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants