feat: Omni dataloader for HF models by yuanhangsu1986 · Pull Request #2016 · NVIDIA-NeMo/RL

yuanhangsu1986 · 2026-02-23T23:18:31Z

What does this PR do ?

Add video and audio dataloadling support for HF models

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this
export NRL_FORCE_REBUILD_VENVS=true
uv venv
uv run python ./examples/run_vlm_sft.py cluster.gpus_per_node=8 --config ./examples/configs/sft_avlm.yaml

Additional Information

Major changes:

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py: base class GeneralConversationsJsonlDataset for loading conversation-based data structure which is described in the comments.
nemo_rl/data/datasets/response_datasets/daily_omni.py: base class and functions for Daily Omni dataset (a public video benchmarking dataset).
nemo_rl/data/datasets/processed_dataset.py: add preprocessor logic to AllTaskProcessedDataset. The purpose of this is to convert the dataset structure from its raw form to the open-ai compatible ones. Put the logic here so that we do not need to do the preprocessing for each downstream processors.
nemo_rl/data/datasets/raw_dataset.py: add preprocessor in RawDataset. This allows each dataset to optionally define its own preprocessor. If leave undefined, the default behavior is no data preprocessing.
nemo_rl/data/multimodal_utils.py: functions for loading omni data
nemo_rl/data/llm_message_utils.py: loading omni data for HF-based llm model
examples/configs/sft_avlm.yaml: example for loading video conversation-based dataset for Qwen3-VL with 16 frames per video.
nemo_rl/algorithms/utils.py: reads audo and video configs from the tokenizer config in the config yaml file.
tests/unit/data/datasets/test_response_dataset.py: unit test for daily omni dataset
tests/unit/data/datasets/test_general_conversations_dataset.py: unit test for GeneralConversationsJsonlDataset

This is a resubmission of the following PR:
#1639

Dependencies

Added decord library for video processing support.

Summary by CodeRabbit

Release Notes

New Features
- Added support for Daily-Omni and General Conversations JSONL datasets
- Enhanced multimodal data processing for audio, video, and image content
- New supervised fine-tuning configuration example for audio-visual models
- Introduced per-task data preprocessing pipeline support
Tests
- Added comprehensive unit tests for multimodal dataset handling
Chores
- Added decord library dependency
- Updated Docker build configuration

coderabbitai · 2026-02-23T23:26:00Z

📝 Walkthrough

Walkthrough

This PR introduces per-task preprocessing support and multimodal dataset infrastructure to NeMo RL. It adds new dataset classes (DailyOmniDataset, GeneralConversationsJsonlDataset), multimodal utilities for media loading, a TaskDataPreProcessFnCallable interface, and propagates preprocessor mappings through the training and validation pipelines. Includes Docker build optimization and comprehensive test coverage.

Changes

Cohort / File(s)	Summary
Multimodal Preprocessing Infrastructure `nemo_rl/data/interfaces.py`, `nemo_rl/data/multimodal_utils.py`	Introduces TaskDataPreProcessFnCallable protocol and comprehensive multimodal utilities including media tagging constants, functions to extract/load media from messages (images, audio, video), and default settings extraction from processors.
Dataset Base Classes & Dataset Updates `nemo_rl/data/datasets/raw_dataset.py`, `nemo_rl/data/datasets/processed_dataset.py`, `nemo_rl/data/datasets/utils.py`	Adds preprocessor attribute to RawDataset, extends AllTaskProcessedDataset to accept and apply per-task preprocessors, and updates dataset setup utilities to track and propagate preprocessor mappings during training/validation.
New Dataset Implementations `nemo_rl/data/datasets/response_datasets/daily_omni.py`, `nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`, `nemo_rl/data/datasets/response_datasets/__init__.py`	Adds DailyOmniDataset for video QA tasks and GeneralConversationsJsonlDataset for multimodal conversations with message processing, metadata conversion, and media extraction utilities.
Message and Data Processing `nemo_rl/data/llm_message_utils.py`	Replaces image-specific processing with generalized media handling using multimodal utils; removes get_images_from_message and updates to load/process any media type dynamically.
Training Pipeline Integration `examples/run_sft.py`, `nemo_rl/algorithms/utils.py`	Extends SFT training to pass task_data_preprocessors to datasets; adds runtime overrides for audio/video processor configurations in tokenizer initialization.
Configuration & Dependencies `examples/configs/sft_avlm.yaml`, `nemo_rl/models/policy/__init__.py`, `pyproject.toml`, `.github/workflows/cicd-main.yml`	Adds audio/video fields to TokenizerConfig, adds decord dependency, introduces AVLM config example, and optimizes Docker build with SKIP_SGLANG_BUILD flag.
Test Coverage `tests/unit/data/datasets/test_general_conversations_dataset.py`, `tests/unit/data/datasets/test_response_dataset.py`	Adds comprehensive unit tests for general-conversation-jsonl preprocessing and DailyOmniDataset dataset loading with multimodal validation.

Sequence Diagram

sequenceDiagram
    participant DataLoader as Data Loader
    participant RawDataset as RawDataset<br/>(+ preprocessor)
    participant Preprocessor as Task Preprocessor
    participant AllTaskProcessedDataset as AllTaskProcessedDataset<br/>(+ task_data_preprocessors)
    participant Processor as Task Processor
    participant MessageUtils as Message Utils<br/>(+ Media Loading)

    DataLoader->>RawDataset: Load dataset<br/>(e.g., DailyOmni)
    RawDataset-->>DataLoader: Return data + preprocessor
    DataLoader->>AllTaskProcessedDataset: Pass task_data_preprocessors<br/>mapping
    
    loop For each data sample
        AllTaskProcessedDataset->>Preprocessor: __call__(raw_datum)
        Preprocessor-->>AllTaskProcessedDataset: preprocessed_datum
        AllTaskProcessedDataset->>Processor: __call__(preprocessed_datum)
        Processor->>MessageUtils: load_media_from_message()
        MessageUtils-->>Processor: extracted media dict<br/>(images, audio, video)
        Processor-->>AllTaskProcessedDataset: formatted output
        AllTaskProcessedDataset-->>DataLoader: processed item
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

PR #977: Introduces the AllTaskProcessedDataset refactor and base dataset infrastructure that this PR extends with per-task preprocessing hooks and new dataset implementations.
PR #1807: Modifies setup_response_data in nemo_rl/data/utils.py; this PR further extends it to propagate per-task preprocessor mappings through the training pipeline.
PR #1649: Involves RawDataset and response-dataset pipeline modifications; this PR adds complementary dataset implementations and preprocessing API enhancements to the same infrastructure.

Suggested labels

ci/cd, data, multimodal, datasets, preprocessing

Suggested reviewers

yuki-97
terrykong
odelalleau

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR includes major changes (new dataset classes, multimodal utilities, preprocessor infrastructure) but PR description lacks test results or testing information; tests exist but no pass/fail status or validation results documented.	Update PR description to include test execution status (e.g., 'all unit tests pass'), specific test scenarios run, and any performance validation results to confirm no regression.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: Omni dataloader for HF models' accurately captures the main change: adding video/audio dataloading support for Hugging Face models via the Omni dataset infrastructure.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yuanhangs_dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

nemo_rl/models/policy/__init__.py (1)
201-208: ⚠️ Potential issue | 🟡 Minor

Document new audio/video config keys in TokenizerConfig.

Please add a Google-style class docstring (or expand an existing one) to document purpose, valid values/types, and recommended defaults for audio and video, and ensure exemplar YAMLs reflect those defaults.

As per coding guidelines "Use Google style docstrings for classes and functions" and "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/policy/__init__.py` around lines 201 - 208, Add a Google-style
class docstring to the TokenizerConfig TypedDict describing the purpose of the
TypedDict and documenting the new audio and video keys: state that audio and
video are optional multimodal config dicts, list valid value types (dict[str,
Any] or None), describe recommended defaults (e.g., audio: {} or None to
disable, video: {} or None to disable, and any recommended subkeys like
sample_rate, channels for audio or frame_rate, resolution for video), and
briefly mention chat_template_kwargs usage; then update the exemplar YAML files
under examples/configs/*.yaml to include the audio and video keys with the
recommended default values so the examples reflect the docstring defaults.
nemo_rl/data/utils.py (1)
221-248: ⚠️ Potential issue | 🟠 Major

Reset val_task_data_preprocessors per validation dataset.

In the val_data_paths loop, val_task_data_preprocessors is reused across iterations. If a later dataset has no preprocessor, the previous one leaks into its AllTaskProcessedDataset. Reinitialize per iteration.
Suggested fix
-    val_task_data_preprocessors = {}
     if "val_data_paths" in data_config and data_config["val_data_paths"]:
         ...
         for val_dataset_name, val_dataset_path in val_data_paths.items():
             ...
+            val_task_data_preprocessors = {}
             if hasattr(val_data, "preprocessor") and val_data.preprocessor is not None:
-                val_task_data_preprocessors = {
-                    val_data.task_name: val_data.preprocessor
-                }
+                val_task_data_preprocessors[val_data.task_name] = val_data.preprocessor
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/utils.py` around lines 221 - 248, The bug is that
val_task_data_preprocessors is declared once outside the val_data_paths loop and
may leak preprocessors across datasets; fix it by moving/reinitializing
val_task_data_preprocessors inside the for val_dataset_name, val_dataset_path in
val_data_paths.items() loop (set to {} at start of each iteration), then
populate it only when hasattr(val_data, "preprocessor") and
val_data.preprocessor is not None before passing it to AllTaskProcessedDataset;
ensure val_task_data_processors is also built per-iteration using
load_preference_dataset and val_data.task_name so each AllTaskProcessedDataset
gets only its own processors and preprocessors.
pyproject.toml (1)
18-53: ⚠️ Potential issue | 🔴 Critical

Pin decord to a specific version, with awareness of Python 3.12 compatibility.

The latest stable decord release is 0.6.0 (June 2021), which lacks pre-built wheels for Python 3.12. Installing decord on Python 3.12 will require building from source, which may fail depending on your FFmpeg setup and build toolchain. Pin decord==0.6.0 explicitly, and document any required build steps for Python 3.12, or verify that source builds work reliably in your environment before merging.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 18 - 53, Pin the decord dependency to a specific
version by changing the "decord" entry to "decord==0.6.0" and add a short
comment noting Python 3.12 has no prebuilt wheels (so building from source may
require FFmpeg and a proper toolchain), or alternatively verify/source-build in
CI and document required build steps for Python 3.12 to avoid breakage; update
the dependency line for "decord" and include the explanatory comment nearby.
nemo_rl/data/multimodal_utils.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Copyright year is 2025; current year is 2026.

The file has substantial new code. Consider updating the copyright header to 2026.

As per coding guidelines, "Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` at line 1, Update the copyright header at
the top of the file to use the current year 2026: replace the 2025 year in the
existing NVIDIA copyright header comment (the file-level header at the very top
of nemo_rl/data/multimodal_utils.py) so the header reads 2026 instead of 2025,
ensuring it matches the project's required header format for Python files.

🧹 Nitpick comments (12)

nemo_rl/data/datasets/raw_dataset.py (1)

26-35: Initialize preprocessor in the constructor to satisfy class-member init guideline.

RawDataset exposes preprocessor publicly; per guidelines, it should be initialized in __init__ alongside other members. Consider adding an explicit constructor (or a dataclass) to set these fields.

As per coding guidelines "Initialize all externally visible members of a class in the constructor".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/raw_dataset.py` around lines 26 - 35, Add an explicit
constructor for class RawDataset that initializes all externally visible members
(data_config, dataset, val_dataset, processor, task_spec) and sets preprocessor
to None (or a provided value) so preprocessor is not left uninitialized;
implement __init__ on RawDataset to accept and assign these fields (or convert
the class to a dataclass with defaults) to satisfy the "initialize members in
constructor" guideline.

nemo_rl/algorithms/utils.py (2)

323-355: Add stacklevel=2 to warnings.warn calls so warnings point to the caller.

All three warnings.warn calls (lines 330, 341, 352) default to stacklevel=1, which causes the warning to reference this utility function rather than the calling code that configured the override.

Proposed fix (example for line 330; apply similarly to lines 341 and 352)

                 warnings.warn(
-                    f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}"
+                    f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}",
+                    stacklevel=2,
                 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/utils.py` around lines 323 - 355, The three warnings.warn
calls in the function that adjusts processor feature/video settings (the calls
that override processor.feature_extractor.sampling_rate,
processor.video_processor.fps, and processor.video_processor.num_frames) should
include stacklevel=2 so the warning points at the caller; update each
warnings.warn invocation (the ones emitting "Overriding audio sampling rate...",
"Overriding video fps...", and "Overriding video num_frames...") to pass
stacklevel=2 while keeping the existing message and variable usage
(tokenizer_config checks and assignments to
processor.feature_extractor.sampling_rate, processor.video_processor.fps,
processor.video_processor.num_frames).

345-355: Consider validating mutual exclusivity of fps and num_frames upfront.

The comment on line 345 acknowledges the conflict but defers it. An explicit check with a clear error message here would be more user-friendly than a cryptic failure later in the video processor.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/algorithms/utils.py` around lines 345 - 355, Check for mutual
exclusivity of fps and num_frames in tokenizer_config["video"] before mutating
processor.video_processor.num_frames: if both "fps" and "num_frames" are
present, validate they are not contradictory and raise a clear ValueError (or
choose one deterministically) instead of letting the video processor fail later.
Specifically, in the block handling tokenizer_config["video"] (referencing
tokenizer_config, processor.video_processor.num_frames, "fps", and
"num_frames"), add an upfront check that raises a descriptive error like "Cannot
set both fps and num_frames in tokenizer_config['video']" (or compare values and
only warn/override when consistent) before assigning
processor.video_processor.num_frames.

nemo_rl/data/datasets/processed_dataset.py (1)

33-58: Document task_data_preprocessors in the class docstring.

The Args section (lines 36–44) documents task_data_processors and max_seq_length but omits the new task_data_preprocessors parameter. Adding a brief entry keeps the docstring consistent with the constructor signature.

📝 Proposed docstring addition

         task_data_processors: Either a single TaskDataProcessFnCallable for single-task,
             or a dict mapping task names to (TaskDataSpec, TaskDataProcessFnCallable) for multi-task
+        task_data_preprocessors: Optional preprocessing hook applied before task-specific processing.
+            Either a single TaskDataPreProcessFnCallable for all tasks,
+            or a dict mapping task names to TaskDataPreProcessFnCallable.
         max_seq_length: Maximum sequence length for tokenized outputs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/processed_dataset.py` around lines 33 - 58, The class
docstring for AllTaskProcessedDataset is missing documentation for the
constructor parameter task_data_preprocessors; update the Args section to add a
short entry describing task_data_preprocessors (type: Optional[Union[dict[str,
TaskDataPreProcessFnCallable], TaskDataPreProcessFnCallable]], default: None),
explaining it can be a single preprocessor applied to all examples or a dict
mapping task names to task-specific preprocessors and that missing tasks fall
back to default behavior; reference the parameter name task_data_preprocessors
and keep the wording consistent with the existing entries for
task_data_processors and default_task_data_spec.

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (2)

38-67: convert_metadata returns None when return_inplace=True — confusing API.

When return_inplace=True, the function mutates metadata in place and returns None implicitly. When False, it returns a new dict. The parameter name is also inverted from the typical inplace convention. Consider renaming to inplace (default False) or always returning the result to avoid caller confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`
around lines 38 - 67, The function convert_metadata currently uses
return_inplace (inverted naming) and returns None when return_inplace=True,
causing a confusing API; rename the parameter to inplace: bool = False (or keep
name but invert semantics) and ensure convert_metadata always returns the
processed dict (variable data) even when mutating the input, so callers get the
result; update logic that chooses data = metadata or metadata.copy(), keep the
mapping loops that reference multimodal_utils.MEDIA_TAGS_TO_ALLOWED and
multimodal_utils.MEDIA_TAGS unchanged, and update any callers to the new
parameter name if renamed.
26-26: Unused _DEBUG variable — remove before merging.

_DEBUG = True is never referenced in this file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py` at
line 26, The file defines an unused module-level variable named _DEBUG which is
never referenced; remove the _DEBUG = True declaration from
general_conversations_dataset (delete the unused symbol) to eliminate dead code
and avoid misleading debug flags.

nemo_rl/data/llm_message_utils.py (1)

607-613: Consider making the media-key-to-kwarg mapping explicit and extensible.

The hardcoded if/elif chain for mapping "image"→"images", "audio"→"audio", "video"→"videos" will need updating when new modalities are added. A small mapping dict would be more maintainable.

♻️ Example

-            media_kwargs = {}
-            if "image" in media_cur_message:
-                media_kwargs["images"] = media_cur_message["image"]
-            if "audio" in media_cur_message:
-                media_kwargs["audio"] = media_cur_message["audio"]
-            if "video" in media_cur_message:
-                media_kwargs["videos"] = media_cur_message["video"]
+            MEDIA_KEY_TO_KWARG = {"image": "images", "audio": "audio", "video": "videos"}
+            media_kwargs = {
+                MEDIA_KEY_TO_KWARG[k]: v
+                for k, v in media_cur_message.items()
+                if k in MEDIA_KEY_TO_KWARG
+            }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/llm_message_utils.py` around lines 607 - 613, Replace the
hardcoded if-chain that builds media_kwargs from media_cur_message with an
explicit mapping dict (e.g., MEDIA_KEY_TO_KWARG) and iterate over its items to
populate media_kwargs; locate the code block that references media_cur_message
and media_kwargs in nemo_rl.data.llm_message_utils (the snippet that currently
checks "image", "audio", "video") and change it to consult the mapping so new
modalities can be added by updating the dict rather than editing conditional
logic.

examples/run_sft.py (1)

104-153: Validation preprocessor wiring is correct.

Both paths (split-from-train and explicit validation config) properly collect and propagate preprocessors.

Minor note: since RawDataset declares preprocessor as a class attribute (line 33 of raw_dataset.py), the hasattr checks on lines 89 and 142 are always True for RawDataset subclasses. You could simplify to just data.preprocessor is not None, but the current form is safely defensive.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/run_sft.py` around lines 104 - 153, The hasattr(...) checks for
"preprocessor" are unnecessary because RawDataset declares preprocessor as a
class attribute; replace the two occurrences where you do hasattr(data,
"preprocessor") and hasattr(val_data, "preprocessor") with direct checks that
the attribute is not None (e.g., data.preprocessor is not None and
val_data.preprocessor is not None) so the code uses the actual None test on the
preprocessor before wiring it into val_task_data_preprocessors (references:
RawDataset, preprocessor, variables data and val_data, and
val_task_data_preprocessors).

nemo_rl/data/datasets/response_datasets/daily_omni.py (1)

73-78: Add extraction filter to prevent path-traversal risks (Ruff S202).

tarfile.extractall() without a filter can be exploited for path traversal attacks. Python 3.12+ supports filter='data' to block absolute paths and parent-directory traversals. Even though the tar is downloaded from a trusted HuggingFace repository, add the filter for defense in depth.

Apply filter parameter

                 with tarfile.open(archive_filename, "r:*") as tar:
-                    # Extract all contents to the specified path
-                    tar.extractall(path=self.hf_cache_dir)
+                    tar.extractall(path=self.hf_cache_dir, filter="data")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` around lines 73 - 78,
The tar extraction in the try block uses
tarfile.extractall(path=self.hf_cache_dir) which is vulnerable to
path-traversal; update the call to pass the filter parameter to block absolute
and parent-directory paths (use filter='data' on Python 3.12+ or equivalent safe
extraction logic) so extraction of archive_filename into self.hf_cache_dir is
validated before writing (affects the block referencing archive_filename,
self.hf_cache_dir and files_folder).

nemo_rl/data/multimodal_utils.py (3)

321-325: Missing docstring for load_media_from_message.

This function is part of the public API consumed by llm_message_utils.py. A Google-style docstring explaining the parameters, return value, and the fallback behavior would help maintainability.

As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments" and "Use Google style docstrings for classes and functions, which can be parsed by Sphinx."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 321 - 325, Add a Google-style
docstring to the public function load_media_from_message describing parameters,
return value, and fallback behavior: document message (expected keys/structure),
processor (type and when used), multimodal_load_kwargs (shape: mapping of media
type to kwargs and defaults), and the returned dict[str, list[Any]] format;
include behavior when media is missing or when processor is None, examples of
supported media types, and any exceptions raised or swallowed. Place the
docstring immediately under the def load_media_from_message(...) signature so
external modules like llm_message_utils.py can rely on the documented API.

219-261: Missing docstring for a public function used outside this file; consider replacing function-attribute caching.

get_multimodal_default_settings_from_processor is referenced by llm_message_utils.py per the summary, so it should have a docstring per coding guidelines.

Also, the function-attribute caching pattern (lines 237–242, 249–254) is unconventional and not thread-safe. A module-level functools.lru_cache or a simple module-level variable would be more idiomatic. Additionally, the list comprehensions [param for param in ...] are redundant — list(...) suffices.

♻️ Simplify signature caching

+import functools
+
+
+@functools.lru_cache(maxsize=1)
+def _load_video_param_names() -> list[str]:
+    return list(inspect.signature(load_video).parameters)
+
+
+@functools.lru_cache(maxsize=1)
+def _load_audio_param_names() -> list[str]:
+    return list(inspect.signature(load_audio).parameters)
+
+
 def get_multimodal_default_settings_from_processor(
     processor,
 ) -> dict[str, dict[str, Any]]:
+    """Extract default video/audio loading kwargs from a processor's sub-components."""
     ...
-        if not hasattr(
-            get_multimodal_default_settings_from_processor, "load_video_kwargs"
-        ):
-            get_multimodal_default_settings_from_processor.load_video_kwargs = [
-                param for param in inspect.signature(load_video).parameters
-            ]
         default_settings["video"] = {
             arg: video_settings_dict[arg]
-            for arg in get_multimodal_default_settings_from_processor.load_video_kwargs
+            for arg in _load_video_param_names()
             if arg in video_settings_dict
         }

(Apply analogous change for load_audio_kwargs.)

As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 219 - 261, The public function
get_multimodal_default_settings_from_processor lacks a docstring and uses unsafe
function-attribute caching and redundant list comprehensions; add a concise
docstring describing purpose, args, and return value, and replace the
function-attribute caches
get_multimodal_default_settings_from_processor.load_video_kwargs and
.load_audio_kwargs with a module-level cache (either simple module-level
variables or a `@functools.lru_cache-decorated` helper) that computes
list(inspect.signature(load_video).parameters) and
list(inspect.signature(load_audio).parameters) once, and update the list
comprehensions to use list(...) instead of [param for param in ...]; ensure the
rest of the function (video_settings_dict/feature_extractor usage and
default_settings keys) remains unchanged.

315-317: Conditional expression used solely for side effects hurts readability.

The ternary media[tag].extend(...) if isinstance(...) else media[tag].append(...) is used as a statement for its side effects only — the return value is discarded. This is a known anti-pattern in Python. A regular if/else block is clearer here.

♻️ Replace ternary statement with explicit if/else

         tag = item["type"]
         if tag in MEDIA_TAGS:
-            media[tag].extend(list(item[tag])) if isinstance(
-                item[tag], (list, tuple)
-            ) else media[tag].append(item[tag])
+            value = item[tag]
+            if isinstance(value, (list, tuple)):
+                media[tag].extend(value)
+            else:
+                media[tag].append(value)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/multimodal_utils.py` around lines 315 - 317, Replace the ternary
used for side effects with a clear if/else: locate the statement that currently
does "media[tag].extend(list(item[tag])) if isinstance(item[tag], (list, tuple))
else media[tag].append(item[tag])" and change it to an explicit if
isinstance(item[tag], (list, tuple)): media[tag].extend(list(item[tag])) else:
media[tag].append(item[tag]). This preserves the same behavior for
media[tag].extend/append and improves readability; keep the same variable names
(media, tag, item) and surrounding logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_rl/data/datasets/response_datasets/daily_omni.py`:
- Line 1: The file daily_omni.py has an incorrect copyright header starting with
"##" instead of the repo-standard single "#" header; open
nemo_rl/data/datasets/response_datasets/daily_omni.py and replace the top header
so it matches the other Python files (use "# Copyright (c) 2025, NVIDIA
CORPORATION.  All rights reserved." with a single leading '#' and same
spacing/phrasing), ensuring the header sits at the very top of the file.
- Around line 27-32: The class docstring for DailyOmniDataset incorrectly
mentions "CLEVR-CoGenT" due to a copy-paste error; update the docstring in class
DailyOmniDataset to describe the Daily Omni dataset (e.g., replace "Simple
wrapper around the CLEVR-CoGenT dataset." with a concise description referencing
the Daily Omni dataset) and keep the Args section (split) unchanged.
- Around line 85-90: Fix the typo and enable exception chaining in the tar
handling block: change the misspelled tarfile.ReadErro to tarfile.ReadError in
the except block and re-raise the ReadError with the new message using exception
chaining (raise tarfile.ReadError("...") from e). Likewise, in the generic
except Exception as e block, re-raise the new Exception with the formatted
message using "from e" so the original traceback is preserved; locate the
tarfile handling where tarfile.ReadError and the variable e are used in
daily_omni.py.

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`:
- Around line 218-234: The return type annotation for process_message_fragment
is wrong: the function builds and returns a list (ret) of dicts, not a single
dict. Update the signature of process_message_fragment to return list[dict[str,
Any]] (or Sequence[Mapping[str, Any]] if you prefer an abstract type) and ensure
any callers expecting dict are adjusted; reference the function name
process_message_fragment and the local variable ret and the loop over
tag.split("-") to locate the code to change.
- Around line 148-153: Fix the typos in the class docstring for
GeneralConversationsJsonlDataset (subclass of RawDataset): change
"converstaions" to "conversations" and "requiement" to "requirement" so the
documentation reads correctly about jsonl datasets and media tag placement.
- Around line 107-118: The code is adding the file extension string to
tried_default_extensions and also assumes filenames always contain a '.',
causing an IndexError; update the logic in general_conversations_dataset.py
where ext is computed from metadata[ tag ][ media_index[tag] ] to safely extract
the extension (use os.path.splitext or check for '.' and handle missing
extension by setting ext to an empty string) and change the set insertion to
tried_default_extensions.add(tag) (not ext) so the guard `tag not in
tried_default_extensions` works as intended; keep the loop over
multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag] unchanged but ensure you only add
the media type tag to tried_default_extensions when you decide to skip trying
defaults.

In `@nemo_rl/data/datasets/utils.py`:
- Around line 18-24: The function get_huggingface_cache_path should be updated
to use the public API and improve typing and error handling: replace the private
_scan_cached_repo import/use with huggingface_hub.scan_cache_dir (update the
import), add a Google-style docstring describing parameters/return and behavior,
add precise type hints for parameters and return (Python 3.12+ style), narrow
the broad except Exception to specific exceptions such as OSError and
ValueError, and guard against calling max() on an empty revs mapping by
returning an appropriate fallback or raising a clear ValueError; locate these
changes around the get_huggingface_cache_path function and any imports
referencing _scan_cached_repo.

In `@nemo_rl/data/multimodal_utils.py`:
- Around line 344-347: Replace the runtime assertion that checks for "audio" and
"sampling_rate" in multimodal_load_kwargs with explicit validation and a clear
exception: verify that "audio" is a key in multimodal_load_kwargs and that
"sampling_rate" is present in multimodal_load_kwargs["audio"], and if either
check fails raise a ValueError with a descriptive message (e.g., referencing
multimodal_load_kwargs, "audio", and "sampling_rate") instead of using assert so
the check cannot be stripped under python -O; update the location where this
change is made around the multimodal_load_kwargs handling in multimodal_utils.py
(the block containing the current assert).
- Around line 348-364: Narrow the broad except in the load_audio fallback: catch
only expected errors from load_audio (e.g., RuntimeError, FileNotFoundError,
ValueError) instead of bare Exception; either log the caught error (use the
module logger via logger.warning with the exception info) or drop the unused "as
e" binding; replace the print("audio loading failed...") with logger.warning
including the error details and context (e.g., which file/path and that we are
falling back to decord). Also ensure the call to
get_dim_to_pack_along(processor, "audio") is safe when processor is None by
adding an explicit guard or documenting that processor may be None (or
defaulting to 0) so the fallback code's slicing behavior remains clear.

---

Outside diff comments:
In `@nemo_rl/data/multimodal_utils.py`:
- Line 1: Update the copyright header at the top of the file to use the current
year 2026: replace the 2025 year in the existing NVIDIA copyright header comment
(the file-level header at the very top of nemo_rl/data/multimodal_utils.py) so
the header reads 2026 instead of 2025, ensuring it matches the project's
required header format for Python files.

In `@nemo_rl/data/utils.py`:
- Around line 221-248: The bug is that val_task_data_preprocessors is declared
once outside the val_data_paths loop and may leak preprocessors across datasets;
fix it by moving/reinitializing val_task_data_preprocessors inside the for
val_dataset_name, val_dataset_path in val_data_paths.items() loop (set to {} at
start of each iteration), then populate it only when hasattr(val_data,
"preprocessor") and val_data.preprocessor is not None before passing it to
AllTaskProcessedDataset; ensure val_task_data_processors is also built
per-iteration using load_preference_dataset and val_data.task_name so each
AllTaskProcessedDataset gets only its own processors and preprocessors.

In `@nemo_rl/models/policy/__init__.py`:
- Around line 201-208: Add a Google-style class docstring to the TokenizerConfig
TypedDict describing the purpose of the TypedDict and documenting the new audio
and video keys: state that audio and video are optional multimodal config dicts,
list valid value types (dict[str, Any] or None), describe recommended defaults
(e.g., audio: {} or None to disable, video: {} or None to disable, and any
recommended subkeys like sample_rate, channels for audio or frame_rate,
resolution for video), and briefly mention chat_template_kwargs usage; then
update the exemplar YAML files under examples/configs/*.yaml to include the
audio and video keys with the recommended default values so the examples reflect
the docstring defaults.

In `@pyproject.toml`:
- Around line 18-53: Pin the decord dependency to a specific version by changing
the "decord" entry to "decord==0.6.0" and add a short comment noting Python 3.12
has no prebuilt wheels (so building from source may require FFmpeg and a proper
toolchain), or alternatively verify/source-build in CI and document required
build steps for Python 3.12 to avoid breakage; update the dependency line for
"decord" and include the explanatory comment nearby.

---

Nitpick comments:
In `@examples/run_sft.py`:
- Around line 104-153: The hasattr(...) checks for "preprocessor" are
unnecessary because RawDataset declares preprocessor as a class attribute;
replace the two occurrences where you do hasattr(data, "preprocessor") and
hasattr(val_data, "preprocessor") with direct checks that the attribute is not
None (e.g., data.preprocessor is not None and val_data.preprocessor is not None)
so the code uses the actual None test on the preprocessor before wiring it into
val_task_data_preprocessors (references: RawDataset, preprocessor, variables
data and val_data, and val_task_data_preprocessors).

In `@nemo_rl/algorithms/utils.py`:
- Around line 323-355: The three warnings.warn calls in the function that
adjusts processor feature/video settings (the calls that override
processor.feature_extractor.sampling_rate, processor.video_processor.fps, and
processor.video_processor.num_frames) should include stacklevel=2 so the warning
points at the caller; update each warnings.warn invocation (the ones emitting
"Overriding audio sampling rate...", "Overriding video fps...", and "Overriding
video num_frames...") to pass stacklevel=2 while keeping the existing message
and variable usage (tokenizer_config checks and assignments to
processor.feature_extractor.sampling_rate, processor.video_processor.fps,
processor.video_processor.num_frames).
- Around line 345-355: Check for mutual exclusivity of fps and num_frames in
tokenizer_config["video"] before mutating processor.video_processor.num_frames:
if both "fps" and "num_frames" are present, validate they are not contradictory
and raise a clear ValueError (or choose one deterministically) instead of
letting the video processor fail later. Specifically, in the block handling
tokenizer_config["video"] (referencing tokenizer_config,
processor.video_processor.num_frames, "fps", and "num_frames"), add an upfront
check that raises a descriptive error like "Cannot set both fps and num_frames
in tokenizer_config['video']" (or compare values and only warn/override when
consistent) before assigning processor.video_processor.num_frames.

In `@nemo_rl/data/datasets/processed_dataset.py`:
- Around line 33-58: The class docstring for AllTaskProcessedDataset is missing
documentation for the constructor parameter task_data_preprocessors; update the
Args section to add a short entry describing task_data_preprocessors (type:
Optional[Union[dict[str, TaskDataPreProcessFnCallable],
TaskDataPreProcessFnCallable]], default: None), explaining it can be a single
preprocessor applied to all examples or a dict mapping task names to
task-specific preprocessors and that missing tasks fall back to default
behavior; reference the parameter name task_data_preprocessors and keep the
wording consistent with the existing entries for task_data_processors and
default_task_data_spec.

In `@nemo_rl/data/datasets/raw_dataset.py`:
- Around line 26-35: Add an explicit constructor for class RawDataset that
initializes all externally visible members (data_config, dataset, val_dataset,
processor, task_spec) and sets preprocessor to None (or a provided value) so
preprocessor is not left uninitialized; implement __init__ on RawDataset to
accept and assign these fields (or convert the class to a dataclass with
defaults) to satisfy the "initialize members in constructor" guideline.

In `@nemo_rl/data/datasets/response_datasets/daily_omni.py`:
- Around line 73-78: The tar extraction in the try block uses
tarfile.extractall(path=self.hf_cache_dir) which is vulnerable to
path-traversal; update the call to pass the filter parameter to block absolute
and parent-directory paths (use filter='data' on Python 3.12+ or equivalent safe
extraction logic) so extraction of archive_filename into self.hf_cache_dir is
validated before writing (affects the block referencing archive_filename,
self.hf_cache_dir and files_folder).

In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`:
- Around line 38-67: The function convert_metadata currently uses return_inplace
(inverted naming) and returns None when return_inplace=True, causing a confusing
API; rename the parameter to inplace: bool = False (or keep name but invert
semantics) and ensure convert_metadata always returns the processed dict
(variable data) even when mutating the input, so callers get the result; update
logic that chooses data = metadata or metadata.copy(), keep the mapping loops
that reference multimodal_utils.MEDIA_TAGS_TO_ALLOWED and
multimodal_utils.MEDIA_TAGS unchanged, and update any callers to the new
parameter name if renamed.
- Line 26: The file defines an unused module-level variable named _DEBUG which
is never referenced; remove the _DEBUG = True declaration from
general_conversations_dataset (delete the unused symbol) to eliminate dead code
and avoid misleading debug flags.

In `@nemo_rl/data/llm_message_utils.py`:
- Around line 607-613: Replace the hardcoded if-chain that builds media_kwargs
from media_cur_message with an explicit mapping dict (e.g., MEDIA_KEY_TO_KWARG)
and iterate over its items to populate media_kwargs; locate the code block that
references media_cur_message and media_kwargs in nemo_rl.data.llm_message_utils
(the snippet that currently checks "image", "audio", "video") and change it to
consult the mapping so new modalities can be added by updating the dict rather
than editing conditional logic.

In `@nemo_rl/data/multimodal_utils.py`:
- Around line 321-325: Add a Google-style docstring to the public function
load_media_from_message describing parameters, return value, and fallback
behavior: document message (expected keys/structure), processor (type and when
used), multimodal_load_kwargs (shape: mapping of media type to kwargs and
defaults), and the returned dict[str, list[Any]] format; include behavior when
media is missing or when processor is None, examples of supported media types,
and any exceptions raised or swallowed. Place the docstring immediately under
the def load_media_from_message(...) signature so external modules like
llm_message_utils.py can rely on the documented API.
- Around line 219-261: The public function
get_multimodal_default_settings_from_processor lacks a docstring and uses unsafe
function-attribute caching and redundant list comprehensions; add a concise
docstring describing purpose, args, and return value, and replace the
function-attribute caches
get_multimodal_default_settings_from_processor.load_video_kwargs and
.load_audio_kwargs with a module-level cache (either simple module-level
variables or a `@functools.lru_cache-decorated` helper) that computes
list(inspect.signature(load_video).parameters) and
list(inspect.signature(load_audio).parameters) once, and update the list
comprehensions to use list(...) instead of [param for param in ...]; ensure the
rest of the function (video_settings_dict/feature_extractor usage and
default_settings keys) remains unchanged.
- Around line 315-317: Replace the ternary used for side effects with a clear
if/else: locate the statement that currently does
"media[tag].extend(list(item[tag])) if isinstance(item[tag], (list, tuple)) else
media[tag].append(item[tag])" and change it to an explicit if
isinstance(item[tag], (list, tuple)): media[tag].extend(list(item[tag])) else:
media[tag].append(item[tag]). This preserves the same behavior for
media[tag].extend/append and improves readability; keep the same variable names
(media, tag, item) and surrounding logic.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9148186 and ee953ca.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (20)

.github/workflows/cicd-main.yml
examples/configs/recipes/llm/performance/.grpo-deepseek-v3-32n4g.yaml.swp
examples/configs/recipes/llm/performance/.grpo-deepseek-v3-32n8g.yaml.swp
examples/configs/sft_avlm.yaml
examples/run_sft.py
nemo_rl/algorithms/utils.py
nemo_rl/data/datasets/processed_dataset.py
nemo_rl/data/datasets/raw_dataset.py
nemo_rl/data/datasets/response_datasets/__init__.py
nemo_rl/data/datasets/response_datasets/daily_omni.py
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py
nemo_rl/data/datasets/utils.py
nemo_rl/data/interfaces.py
nemo_rl/data/llm_message_utils.py
nemo_rl/data/multimodal_utils.py
nemo_rl/data/utils.py
nemo_rl/models/policy/__init__.py
pyproject.toml
tests/unit/data/datasets/test_general_conversations_dataset.py
tests/unit/data/datasets/test_response_dataset.py

coderabbitai · 2026-02-23T23:26:03Z

nemo_rl/data/datasets/response_datasets/daily_omni.py

@@ -0,0 +1,140 @@
+## Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


⚠️ Potential issue | 🟡 Minor

Copyright header starts with ## instead of #.

All other files in the repo use # Copyright (c) 2025, NVIDIA CORPORATION. — this file has ## Copyright, which is inconsistent. As per coding guidelines, the NVIDIA copyright header should be added to all Python source files under nemo_rl/.

Proposed fix

-## Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` at line 1, The file daily_omni.py has an incorrect copyright header starting with "##" instead of the repo-standard single "#" header; open nemo_rl/data/datasets/response_datasets/daily_omni.py and replace the top header so it matches the other Python files (use "# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved." with a single leading '#' and same spacing/phrasing), ensuring the header sits at the very top of the file.

coderabbitai · 2026-02-23T23:26:03Z

nemo_rl/data/datasets/response_datasets/daily_omni.py

+class DailyOmniDataset(RawDataset):
+    """Simple wrapper around the CLEVR-CoGenT dataset.
+
+    Args:
+        split: Split name for the dataset, default is "train"
+    """


⚠️ Potential issue | 🟡 Minor

Docstring refers to "CLEVR-CoGenT" — this is the Daily Omni dataset.

Copy-paste artifact from another dataset class.

Proposed fix

class DailyOmniDataset(RawDataset): - """Simple wrapper around the CLEVR-CoGenT dataset. + """Dataset wrapper for the Daily-Omni public video benchmarking dataset. Args: split: Split name for the dataset, default is "train"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class DailyOmniDataset(RawDataset):

"""Simple wrapper around the CLEVR-CoGenT dataset.

Args:

split: Split name for the dataset, default is "train"

"""

class DailyOmniDataset(RawDataset):

"""Dataset wrapper for the Daily-Omni public video benchmarking dataset.

Args:

split: Split name for the dataset, default is "train"

"""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` around lines 27 - 32, The class docstring for DailyOmniDataset incorrectly mentions "CLEVR-CoGenT" due to a copy-paste error; update the docstring in class DailyOmniDataset to describe the Daily Omni dataset (e.g., replace "Simple wrapper around the CLEVR-CoGenT dataset." with a concise description referencing the Daily Omni dataset) and keep the Args section (split) unchanged.

coderabbitai · 2026-02-23T23:26:03Z

nemo_rl/data/datasets/response_datasets/daily_omni.py

+            except tarfile.ReadError:
+                raise tarfile.ReadErro(
+                    "Error: Could not read the tar file. It might be corrupted or not a tar file."
+                )
+            except Exception as e:
+                raise Exception(f"An unexpected error occurred: {e}")


⚠️ Potential issue | 🔴 Critical

Critical typo: tarfile.ReadErro → tarfile.ReadError — will raise NameError at runtime.

Line 86 has a typo that turns this error-handling path into an unhandled NameError. Additionally, both except blocks should chain the original exception using from.

🐛 Proposed fix

except tarfile.ReadError: - raise tarfile.ReadErro( + raise tarfile.ReadError( "Error: Could not read the tar file. It might be corrupted or not a tar file." - ) - except Exception as e: - raise Exception(f"An unexpected error occurred: {e}") + ) from None + except Exception as e: + raise RuntimeError(f"An unexpected error occurred: {e}") from e

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 86-88: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

[warning] 86-88: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 89-89: Do not catch blind exception: Exception

(BLE001)

[warning] 90-90: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

[warning] 90-90: Create your own exception

(TRY002)

[warning] 90-90: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` around lines 85 - 90, Fix the typo and enable exception chaining in the tar handling block: change the misspelled tarfile.ReadErro to tarfile.ReadError in the except block and re-raise the ReadError with the new message using exception chaining (raise tarfile.ReadError("...") from e). Likewise, in the generic except Exception as e block, re-raise the new Exception with the formatted message using "from e" so the original traceback is preserved; locate the tarfile handling where tarfile.ReadError and the variable e are used in daily_omni.py.

coderabbitai · 2026-02-23T23:26:03Z

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py

+            ext = os.path.basename(metadata[tag][media_index[tag]]).split(".", 1)[1]
+            if (
+                raw
+                and ext not in raw
+                and tag not in tried_default_extensions
+                and tag in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS
+            ):
+                # try the default extension
+                for ext in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag]:
+                    if ext in raw:
+                        tried_default_extensions.add(ext)
+                        break


⚠️ Potential issue | 🟠 Major

Bug: tried_default_extensions.add(ext) should be tried_default_extensions.add(tag) — the guard on line 111 checks tag, not ext.

Line 111 checks tag not in tried_default_extensions to avoid retrying default extensions for the same media type. But line 117 adds ext (the file extension string, e.g. "mp4") instead of tag (the media type, e.g. "video"). Since these are different values, the guard will never match, and default extension resolution will be retried on every occurrence.

Also, line 107 will raise IndexError if the media filename contains no . (e.g., a path or identifier with no extension).

🐛 Proposed fix

- ext = os.path.basename(metadata[tag][media_index[tag]]).split(".", 1)[1] + basename = os.path.basename(metadata[tag][media_index[tag]]) + ext = basename.split(".", 1)[1] if "." in basename else "" if ( raw and ext not in raw and tag not in tried_default_extensions and tag in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS ): # try the default extension for ext in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag]: if ext in raw: - tried_default_extensions.add(ext) + tried_default_extensions.add(tag) break

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py` around lines 107 - 118, The code is adding the file extension string to tried_default_extensions and also assumes filenames always contain a '.', causing an IndexError; update the logic in general_conversations_dataset.py where ext is computed from metadata[ tag ][ media_index[tag] ] to safely extract the extension (use os.path.splitext or check for '.' and handle missing extension by setting ext to an empty string) and change the set insertion to tried_default_extensions.add(tag) (not ext) so the guard `tag not in tried_default_extensions` works as intended; keep the loop over multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag] unchanged but ensure you only add the media type tag to tried_default_extensions when you decide to skip trying defaults.

This is a false alarm. add(ext) is correct since we need to check the extension tried, not tag.

@yuanhangsu1986, thank you for the clarification. I understand you want to track extensions rather than tags. However, I notice there's a mismatch in the current implementation:

Line 111 checks: tag not in tried_default_extensions (checking if the media type tag like "video" is in the set)

Line 117 adds: tried_default_extensions.add(ext) (adding the extension string like "mp4")

Since tag (e.g., "video") and ext (e.g., "mp4") are different values, the guard on line 111 will always pass even after adding ext to the set, which means default extensions will be retried on every occurrence of the same tag.

If the intention is to track which extensions have been tried (as you mentioned), would it make sense to also update line 111 to check ext not in tried_default_extensions instead of tag not in tried_default_extensions? This would make the guard consistent with what's being added to the set.

Let me know if there's a specific behavior I'm missing!

coderabbitai · 2026-02-23T23:26:03Z

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py

+class GeneralConversationsJsonlDataset(RawDataset):
+    """Loads general conversation datasets that have the json (manifest) files and media files in separate files (jsonl datasets).
+
+    Each sample can be single/multi-turn converstaions with multiple modalities.
+    Each modality can have one or more number of media objects.
+    There is no requiement of where the media tag (e.g. '<sound>') should appear in the conversations.


⚠️ Potential issue | 🟡 Minor

Docstring typos: "converstaions" → "conversations", "requiement" → "requirement".

Proposed fix

- Each sample can be single/multi-turn converstaions with multiple modalities. + Each sample can be single/multi-turn conversations with multiple modalities. Each modality can have one or more number of media objects. - There is no requiement of where the media tag (e.g. '<sound>') should appear in the conversations. + There is no requirement of where the media tag (e.g. '<sound>') should appear in the conversations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py` around lines 148 - 153, Fix the typos in the class docstring for GeneralConversationsJsonlDataset (subclass of RawDataset): change "converstaions" to "conversations" and "requiement" to "requirement" so the documentation reads correctly about jsonl datasets and media tag placement.

coderabbitai · 2026-02-23T23:26:03Z

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py

+    @classmethod
+    def process_message_fragment(
+        cls, tag: str, fragment: Any, media_directory: Optional[str] = None
+    ) -> dict[str, Any]:
+        if (
+            media_directory is not None
+            and tag in multimodal_utils.MEDIA_TAGS
+            and isinstance(fragment, str)
+            and not os.path.isfile(fragment)
+        ):
+            media_path = os.path.join(media_directory, fragment)
+            if os.path.isfile(media_path):
+                fragment = media_path
+        ret = []
+        for t in tag.split("-"):
+            ret.append({"type": t, t: fragment})
+        return ret


⚠️ Potential issue | 🟡 Minor

Return type annotation is incorrect: returns list[dict[str, Any]], not dict[str, Any].

process_message_fragment builds and returns ret (a list), but the annotation says -> dict[str, Any].

Proposed fix

`@classmethod` def process_message_fragment( cls, tag: str, fragment: Any, media_directory: Optional[str] = None - ) -> dict[str, Any]: + ) -> list[dict[str, Any]]:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py` around lines 218 - 234, The return type annotation for process_message_fragment is wrong: the function builds and returns a list (ret) of dicts, not a single dict. Update the signature of process_message_fragment to return list[dict[str, Any]] (or Sequence[Mapping[str, Any]] if you prefer an abstract type) and ensure any callers expecting dict are adjusted; reference the function name process_message_fragment and the local variable ret and the loop over tag.split("-") to locate the code to change.

coderabbitai · 2026-02-23T23:26:04Z

nemo_rl/data/datasets/utils.py

+from pathlib import Path
 from typing import Optional, Union

 import torch
 from datasets import DatasetDict, load_dataset, load_from_disk
+from huggingface_hub.utils._cache_manager import _scan_cached_repo
 from PIL import Image


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the file in question find . -type f -name "utils.py" -path "*/nemo_rl/data/datasets/*" | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 90

🏁 Script executed:

# Read the utils.py file to see the imports and get_huggingface_cache_path function cat -n nemo_rl/data/datasets/utils.py | sed -n '1,50p'

Repository: NVIDIA-NeMo/RL

Length of output: 2233

🏁 Script executed:

# Read the function in question (lines 148-176) cat -n nemo_rl/data/datasets/utils.py | sed -n '140,180p'

Repository: NVIDIA-NeMo/RL

Length of output: 1918

🌐 Web query:

huggingface_hub public API cache inspection alternatives to _scan_cached_repo

💡 Result:

_scan_cached_repo is a private helper. The public, supported ways to inspect the Hugging Face Hub cache are:

1) Scan the whole cache (recommended): scan_cache_dir()

Use the public API that returns a structured HFCacheInfo report (repos, revisions, sizes, timestamps, paths). (huggingface.co)

from huggingface_hub import scan_cache_dir info = scan_cache_dir() # or scan_cache_dir(cache_dir="...") # list cached repos for repo in sorted(info.repos, key=lambda r: r.size_on_disk, reverse=True): print(repo.repo_id, repo.repo_type, repo.size_on_disk, repo.repo_path) # find one repo repo = next(r for r in info.repos if r.repo_id == "google/fleurs" and r.repo_type == "dataset") print(repo.revisions) # commit hashes + snapshot paths + files

2) Check whether a specific file is already cached: try_to_load_from_cache()

If you don’t need a full scan, this is the public “is it cached, and where?” helper. (huggingface.co)

from huggingface_hub import try_to_load_from_cache path = try_to_load_from_cache("bert-base-uncased", "config.json") print(path) # local path or None (or special sentinel)

3) CLI inspection (same cache system)

huggingface-cli scan-cache prints a cache report (repo id/type, disk usage, refs, local path). (huggingface.co)

Newer CLI also documents hf cache commands (e.g., list/remove) depending on your installed CLI version. (hugging-face.cn)

4) If you also want cleanup after inspection

scan_cache_dir() returns an HFCacheInfo that can prepare a deletion plan via delete_revisions(...) (dry-run style via the returned strategy, then execute()). (huggingface.co)

Citations:

1: https://huggingface.co/docs/huggingface_hub/en/package_reference/cache?utm_source=openai

2: https://huggingface.co/docs/huggingface_hub/package_reference/cache?utm_source=openai

3: https://huggingface.co/docs/huggingface_hub/v0.31.0/en/guides/manage-cache?utm_source=openai

4: https://hugging-face.cn/docs/huggingface_hub/guides/manage-cache?utm_source=openai

5: https://huggingface.co/docs/huggingface_hub/en/package_reference/cache?utm_source=openai

🌐 Web query:

huggingface_hub CachedRepoInfo cache directory public API

💡 Result:

In huggingface_hub (Python), CachedRepoInfo is part of the public cache-scanning API:

Use huggingface_hub.scan_cache_dir(cache_dir=...) to scan a cache root and get an HFCacheInfo, whose .repos contains CachedRepoInfo objects (with fields like repo_path, size_on_disk, revisions, etc.). [1]

To get / control the Hub cache directory (the “cache root” that scan_cache_dir() defaults to):

Configure via environment variables:

HF_HOME (root; default ~/.cache/huggingface)

HF_HUB_CACHE (Hub repos cache; default $HF_HOME/hub, i.e. typically ~/.cache/huggingface/hub) [2]

Programmatic (public) access to the resolved default path:

from huggingface_hub.constants import HF_HUB_CACHE (prints the effective Hub cache dir, honoring env overrides). [3]

Example

from huggingface_hub import scan_cache_dir from huggingface_hub.constants import HF_HUB_CACHE info = scan_cache_dir() # defaults to HF_HUB_CACHE print(HF_HUB_CACHE) # cache root (e.g. ~/.cache/huggingface/hub) for repo in info.repos: print(repo.repo_id, repo.repo_path)

Sources: [1] [2] [3]

Add docstring, use public HF API, and narrow exception handling in get_huggingface_cache_path.

This function lacks a Google-style docstring and uses the private _scan_cached_repo API. Replace it with the public scan_cache_dir() from huggingface_hub, which is the supported, documented way to inspect the local cache. Additionally, the broad Exception catch on line 174 masks real errors; narrow it to specific exceptions (e.g., OSError, ValueError). Also add type hints to match Python 3.12+ best practices, and guard against max() on an empty sequence if the revs dict is empty after filtering.

Refactor sketch

+from huggingface_hub import scan_cache_dir + -def get_huggingface_cache_path(repo_id, branch="main", repo_type="datasets"): +def get_huggingface_cache_path( + repo_id: str, branch: str | None = "main", repo_type: str = "datasets" +) -> str | None: + """Return the latest cached snapshot path for a HF repo if present. + + Args: + repo_id: Repository ID in "org/name" form. + branch: Optional branch/tag name to filter revisions. + repo_type: Repository type ("datasets", "models", etc.). + + Returns: + Path to the snapshot directory if cached, else None. + """ cache_path = None try: - # ... existing cache detection logic ... - hf_cache_info = _scan_cached_repo(repo_path=repo_path) + hf_cache_info = scan_cache_dir() + repo = next((r for r in hf_cache_info.repos + if r.repo_id == repo_id and r.repo_type == repo_type), None) + if not repo: + return None + revs = {r.refs: r for r in repo.revisions} - revs = {r.refs: r for r in hf_cache_info.revisions} if branch is not None: revs = {refs: r for refs, r in revs.items() if branch in refs} + if not revs: + return None rev2keep = max(revs.values(), key=lambda r: r.last_modified) return str(rev2keep.snapshot_path) - except Exception as e: + except (OSError, ValueError) as e: print(f"{type(e)}: {e}") return None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/utils.py` around lines 18 - 24, The function get_huggingface_cache_path should be updated to use the public API and improve typing and error handling: replace the private _scan_cached_repo import/use with huggingface_hub.scan_cache_dir (update the import), add a Google-style docstring describing parameters/return and behavior, add precise type hints for parameters and return (Python 3.12+ style), narrow the broad except Exception to specific exceptions such as OSError and ValueError, and guard against calling max() on an empty revs mapping by returning an appropriate fallback or raising a clear ValueError; locate these changes around the get_huggingface_cache_path function and any imports referencing _scan_cached_repo.

coderabbitai · 2026-02-23T23:26:04Z

nemo_rl/data/multimodal_utils.py

+                assert (
+                    "audio" in multimodal_load_kwargs
+                    and "sampling_rate" in multimodal_load_kwargs["audio"]
+                )


⚠️ Potential issue | 🟠 Major

assert used for runtime input validation — will silently pass under python -O.

Assertions are stripped when Python runs with optimizations enabled (-O / -OO). If sampling_rate is genuinely required, raise a ValueError instead.

Proposed fix

- assert ( - "audio" in multimodal_load_kwargs - and "sampling_rate" in multimodal_load_kwargs["audio"] - ) + if ( + "audio" not in multimodal_load_kwargs + or "sampling_rate" not in multimodal_load_kwargs.get("audio", {}) + ): + raise ValueError( + "multimodal_load_kwargs must include 'audio' with a 'sampling_rate' " + "key to load audio from file path." + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

assert (

"audio" in multimodal_load_kwargs

and "sampling_rate" in multimodal_load_kwargs["audio"]

)

if (

"audio" not in multimodal_load_kwargs

or "sampling_rate" not in multimodal_load_kwargs.get("audio", {})

):

raise ValueError(

"multimodal_load_kwargs must include 'audio' with a 'sampling_rate' "

"key to load audio from file path."

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/multimodal_utils.py` around lines 344 - 347, Replace the runtime assertion that checks for "audio" and "sampling_rate" in multimodal_load_kwargs with explicit validation and a clear exception: verify that "audio" is a key in multimodal_load_kwargs and that "sampling_rate" is present in multimodal_load_kwargs["audio"], and if either check fails raise a ValueError with a descriptive message (e.g., referencing multimodal_load_kwargs, "audio", and "sampling_rate") instead of using assert so the check cannot be stripped under python -O; update the location where this change is made around the multimodal_load_kwargs handling in multimodal_utils.py (the block containing the current assert).

coderabbitai · 2026-02-23T23:26:04Z

nemo_rl/data/multimodal_utils.py

+                try:
+                    loaded_media["audio"].append(
+                        load_audio(aud, **multimodal_load_kwargs["audio"])
+                    )
+                except Exception as e:
+                    print("audio loading failed. Fall back to decord.")
+                    # use decord
+                    loaded_audio = decord.AudioReader(
+                        aud,
+                        sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"],
+                        mono=True,
+                    )
+                    loaded_media["audio"].append(
+                        loaded_audio[:].asnumpy()[
+                            get_dim_to_pack_along(processor, "audio")
+                        ]
+                    )


⚠️ Potential issue | 🟠 Major

Bare Exception catch, unused variable, and print instead of logging.

Three issues in this block, two confirmed by static analysis (Ruff BLE001, F841):

Bare Exception — catches everything including KeyboardInterrupt subclasses (in Python 3 KeyboardInterrupt doesn't inherit from Exception, so that's fine, but SystemExit edge aside, catching all exceptions masks bugs). Narrow to expected failure types (e.g., RuntimeError, FileNotFoundError, ValueError).

Unused e — either log it or drop as e.

print(...) in library code — use logging.warning(...) (or logger.warning) so callers can control log levels.

Additionally, on line 362, get_dim_to_pack_along(processor, "audio") is called, but processor can be None (parameter default on line 323). While get_dim_to_pack_along would not crash on None (it returns 0), it's semantically unclear — consider guarding or documenting.

Proposed fix

+import logging + +logger = logging.getLogger(__name__) + ... - except Exception as e: - print("audio loading failed. Fall back to decord.") + except (RuntimeError, FileNotFoundError, OSError) as e: + logger.warning("Audio loading via transformers failed (%s). Falling back to decord.", e)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

loaded_media["audio"].append(

load_audio(aud, **multimodal_load_kwargs["audio"])

)

except Exception as e:

print("audio loading failed. Fall back to decord.")

# use decord

loaded_audio = decord.AudioReader(

aud,

sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"],

mono=True,

)

loaded_media["audio"].append(

loaded_audio[:].asnumpy()[

get_dim_to_pack_along(processor, "audio")

]

)

try:

loaded_media["audio"].append(

load_audio(aud, **multimodal_load_kwargs["audio"])

)

except (RuntimeError, FileNotFoundError, OSError) as e:

logger.warning("Audio loading via transformers failed (%s). Falling back to decord.", e)

# use decord

loaded_audio = decord.AudioReader(

aud,

sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"],

mono=True,

)

loaded_media["audio"].append(

loaded_audio[:].asnumpy()[

get_dim_to_pack_along(processor, "audio")

]

)

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 352-352: Do not catch blind exception: Exception

(BLE001)

[error] 352-352: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/multimodal_utils.py` around lines 348 - 364, Narrow the broad except in the load_audio fallback: catch only expected errors from load_audio (e.g., RuntimeError, FileNotFoundError, ValueError) instead of bare Exception; either log the caught error (use the module logger via logger.warning with the exception info) or drop the unused "as e" binding; replace the print("audio loading failed...") with logger.warning including the error details and context (e.g., which file/path and that we are falling back to decord). Also ensure the call to get_dim_to_pack_along(processor, "audio") is safe when processor is None by adding an explicit guard or documenting that processor may be None (or defaulting to 0) so the fallback code's slicing behavior remains clear.

nemo_rl/data/datasets/utils.py

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py

.github/workflows/cicd-main.yml

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuanhangsu1986 requested review from a team as code owners February 23, 2026 23:18

github-actions bot added the CI Relating to CI label Feb 23, 2026

yuanhangsu1986 assigned terrykong and yuanhangsu1986 and unassigned terrykong Feb 23, 2026

yuanhangsu1986 requested review from terrykong, yfw and yuki-97 February 23, 2026 23:19

yuanhangsu1986 assigned terrykong, yfw and yuki-97 and unassigned yuanhangsu1986 Feb 23, 2026

yuanhangsu1986 added CI:L1 Run doctests, unit tests, and functional tests community-request labels Feb 23, 2026

yuanhangsu1986 temporarily deployed to nemo-ci February 23, 2026 23:22 — with GitHub Actions Inactive

yuanhangsu1986 added the needs-follow-up Issue needs follow-up label Feb 23, 2026

coderabbitai bot reviewed Feb 23, 2026

View reviewed changes

github-actions bot removed the community-request label Feb 23, 2026

yuanhangsu1986 temporarily deployed to nemo-ci February 23, 2026 23:54 — with GitHub Actions Inactive

chtruong814 removed the needs-follow-up Issue needs follow-up label Feb 24, 2026

yuanhangsu1986 changed the title ~~Yuanhangs dev~~ feat: Omni dataloader for HF models Feb 24, 2026

yuanhangsu1986 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 24, 2026

yuanhangsu1986 temporarily deployed to nemo-ci February 24, 2026 01:42 — with GitHub Actions Inactive

yuanhangsu1986 temporarily deployed to nemo-ci February 24, 2026 01:46 — with GitHub Actions Inactive

yuki-97 reviewed Feb 24, 2026

View reviewed changes

nemo_rl/data/datasets/utils.py Show resolved Hide resolved

nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py Outdated Show resolved Hide resolved

.github/workflows/cicd-main.yml Outdated Show resolved Hide resolved

yuanhangsu1986 had a problem deploying to nemo-ci February 24, 2026 08:20 — with GitHub Actions Error

squash

c1f2862

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 force-pushed the yuanhangs_dev branch from 808cbd7 to c1f2862 Compare February 24, 2026 15:05

fix merge and remove debug

bbe815b

Signed-off-by: Yuki Huang <yukih@nvidia.com>

github-actions bot removed the CI Relating to CI label Feb 24, 2026

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 24, 2026

yuki-97 requested a deployment to nemo-ci February 24, 2026 15:07 — with GitHub Actions In progress

yuki-97 mentioned this pull request Feb 24, 2026

Simplify get_huggingface_cache_path #2020

Open

		@@ -0,0 +1,140 @@
		## Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

	## Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
	# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

-                assert (
-                    "audio" in multimodal_load_kwargs
-                    and "sampling_rate" in multimodal_load_kwargs["audio"]
-                )
+                if (
+                    "audio" not in multimodal_load_kwargs
+                    or "sampling_rate" not in multimodal_load_kwargs.get("audio", {})
+                ):
+                    raise ValueError(
+                        "multimodal_load_kwargs must include 'audio' with a 'sampling_rate' "
+                        "key to load audio from file path."
+                    )

Comments

Conversation

yuanhangsu1986 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Usage

Additional Information

Dependencies

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanhangsu1986 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

1) Scan the whole cache (recommended): scan_cache_dir()

2) Check whether a specific file is already cached: try_to_load_from_cache()

3) CLI inspection (same cache system)

4) If you also want cleanup after inspection

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuanhangsu1986 commented Feb 23, 2026 •

edited

Loading

coderabbitai bot commented Feb 23, 2026 •

edited

Loading

coderabbitai bot Feb 23, 2026 •

edited

Loading

1) Scan the whole cache (recommended): `scan_cache_dir()`

2) Check whether a specific file is already cached: `try_to_load_from_cache()`