Conversation
📝 WalkthroughWalkthroughThis PR introduces per-task preprocessing support and multimodal dataset infrastructure to NeMo RL. It adds new dataset classes (DailyOmniDataset, GeneralConversationsJsonlDataset), multimodal utilities for media loading, a TaskDataPreProcessFnCallable interface, and propagates preprocessor mappings through the training and validation pipelines. Includes Docker build optimization and comprehensive test coverage. Changes
Sequence DiagramsequenceDiagram
participant DataLoader as Data Loader
participant RawDataset as RawDataset<br/>(+ preprocessor)
participant Preprocessor as Task Preprocessor
participant AllTaskProcessedDataset as AllTaskProcessedDataset<br/>(+ task_data_preprocessors)
participant Processor as Task Processor
participant MessageUtils as Message Utils<br/>(+ Media Loading)
DataLoader->>RawDataset: Load dataset<br/>(e.g., DailyOmni)
RawDataset-->>DataLoader: Return data + preprocessor
DataLoader->>AllTaskProcessedDataset: Pass task_data_preprocessors<br/>mapping
loop For each data sample
AllTaskProcessedDataset->>Preprocessor: __call__(raw_datum)
Preprocessor-->>AllTaskProcessedDataset: preprocessed_datum
AllTaskProcessedDataset->>Processor: __call__(preprocessed_datum)
Processor->>MessageUtils: load_media_from_message()
MessageUtils-->>Processor: extracted media dict<br/>(images, audio, video)
Processor-->>AllTaskProcessedDataset: formatted output
AllTaskProcessedDataset-->>DataLoader: processed item
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
nemo_rl/models/policy/__init__.py (1)
201-208:⚠️ Potential issue | 🟡 MinorDocument new
audio/videoconfig keys inTokenizerConfig.Please add a Google-style class docstring (or expand an existing one) to document purpose, valid values/types, and recommended defaults for
audioandvideo, and ensure exemplar YAMLs reflect those defaults.As per coding guidelines "Use Google style docstrings for classes and functions" and "When adding a new config key to a TypedDict subclass, document the key's purpose, valid values/types, recommended default, and reflect the default in exemplar YAMLs under examples/configs/*.yaml".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/models/policy/__init__.py` around lines 201 - 208, Add a Google-style class docstring to the TokenizerConfig TypedDict describing the purpose of the TypedDict and documenting the new audio and video keys: state that audio and video are optional multimodal config dicts, list valid value types (dict[str, Any] or None), describe recommended defaults (e.g., audio: {} or None to disable, video: {} or None to disable, and any recommended subkeys like sample_rate, channels for audio or frame_rate, resolution for video), and briefly mention chat_template_kwargs usage; then update the exemplar YAML files under examples/configs/*.yaml to include the audio and video keys with the recommended default values so the examples reflect the docstring defaults.nemo_rl/data/utils.py (1)
221-248:⚠️ Potential issue | 🟠 MajorReset
val_task_data_preprocessorsper validation dataset.In the
val_data_pathsloop,val_task_data_preprocessorsis reused across iterations. If a later dataset has no preprocessor, the previous one leaks into itsAllTaskProcessedDataset. Reinitialize per iteration.Suggested fix
- val_task_data_preprocessors = {} if "val_data_paths" in data_config and data_config["val_data_paths"]: ... for val_dataset_name, val_dataset_path in val_data_paths.items(): ... + val_task_data_preprocessors = {} if hasattr(val_data, "preprocessor") and val_data.preprocessor is not None: - val_task_data_preprocessors = { - val_data.task_name: val_data.preprocessor - } + val_task_data_preprocessors[val_data.task_name] = val_data.preprocessor🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/utils.py` around lines 221 - 248, The bug is that val_task_data_preprocessors is declared once outside the val_data_paths loop and may leak preprocessors across datasets; fix it by moving/reinitializing val_task_data_preprocessors inside the for val_dataset_name, val_dataset_path in val_data_paths.items() loop (set to {} at start of each iteration), then populate it only when hasattr(val_data, "preprocessor") and val_data.preprocessor is not None before passing it to AllTaskProcessedDataset; ensure val_task_data_processors is also built per-iteration using load_preference_dataset and val_data.task_name so each AllTaskProcessedDataset gets only its own processors and preprocessors.pyproject.toml (1)
18-53:⚠️ Potential issue | 🔴 CriticalPin
decordto a specific version, with awareness of Python 3.12 compatibility.The latest stable
decordrelease is 0.6.0 (June 2021), which lacks pre-built wheels for Python 3.12. Installingdecordon Python 3.12 will require building from source, which may fail depending on your FFmpeg setup and build toolchain. Pindecord==0.6.0explicitly, and document any required build steps for Python 3.12, or verify that source builds work reliably in your environment before merging.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 18 - 53, Pin the decord dependency to a specific version by changing the "decord" entry to "decord==0.6.0" and add a short comment noting Python 3.12 has no prebuilt wheels (so building from source may require FFmpeg and a proper toolchain), or alternatively verify/source-build in CI and document required build steps for Python 3.12 to avoid breakage; update the dependency line for "decord" and include the explanatory comment nearby.nemo_rl/data/multimodal_utils.py (1)
1-1:⚠️ Potential issue | 🟡 MinorCopyright year is 2025; current year is 2026.
The file has substantial new code. Consider updating the copyright header to 2026.
As per coding guidelines, "Add the NVIDIA copyright header (with current year) to all Python files and shell scripts, excluding tests."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/multimodal_utils.py` at line 1, Update the copyright header at the top of the file to use the current year 2026: replace the 2025 year in the existing NVIDIA copyright header comment (the file-level header at the very top of nemo_rl/data/multimodal_utils.py) so the header reads 2026 instead of 2025, ensuring it matches the project's required header format for Python files.
🧹 Nitpick comments (12)
nemo_rl/data/datasets/raw_dataset.py (1)
26-35: Initializepreprocessorin the constructor to satisfy class-member init guideline.
RawDatasetexposespreprocessorpublicly; per guidelines, it should be initialized in__init__alongside other members. Consider adding an explicit constructor (or a dataclass) to set these fields.As per coding guidelines "Initialize all externally visible members of a class in the constructor".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/raw_dataset.py` around lines 26 - 35, Add an explicit constructor for class RawDataset that initializes all externally visible members (data_config, dataset, val_dataset, processor, task_spec) and sets preprocessor to None (or a provided value) so preprocessor is not left uninitialized; implement __init__ on RawDataset to accept and assign these fields (or convert the class to a dataclass with defaults) to satisfy the "initialize members in constructor" guideline.nemo_rl/algorithms/utils.py (2)
323-355: Addstacklevel=2towarnings.warncalls so warnings point to the caller.All three
warnings.warncalls (lines 330, 341, 352) default tostacklevel=1, which causes the warning to reference this utility function rather than the calling code that configured the override.Proposed fix (example for line 330; apply similarly to lines 341 and 352)
warnings.warn( - f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}" + f"Overriding audio sampling rate from {processor.feature_extractor.sampling_rate} to {new_sampling_rate}", + stacklevel=2, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/algorithms/utils.py` around lines 323 - 355, The three warnings.warn calls in the function that adjusts processor feature/video settings (the calls that override processor.feature_extractor.sampling_rate, processor.video_processor.fps, and processor.video_processor.num_frames) should include stacklevel=2 so the warning points at the caller; update each warnings.warn invocation (the ones emitting "Overriding audio sampling rate...", "Overriding video fps...", and "Overriding video num_frames...") to pass stacklevel=2 while keeping the existing message and variable usage (tokenizer_config checks and assignments to processor.feature_extractor.sampling_rate, processor.video_processor.fps, processor.video_processor.num_frames).
345-355: Consider validating mutual exclusivity offpsandnum_framesupfront.The comment on line 345 acknowledges the conflict but defers it. An explicit check with a clear error message here would be more user-friendly than a cryptic failure later in the video processor.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/algorithms/utils.py` around lines 345 - 355, Check for mutual exclusivity of fps and num_frames in tokenizer_config["video"] before mutating processor.video_processor.num_frames: if both "fps" and "num_frames" are present, validate they are not contradictory and raise a clear ValueError (or choose one deterministically) instead of letting the video processor fail later. Specifically, in the block handling tokenizer_config["video"] (referencing tokenizer_config, processor.video_processor.num_frames, "fps", and "num_frames"), add an upfront check that raises a descriptive error like "Cannot set both fps and num_frames in tokenizer_config['video']" (or compare values and only warn/override when consistent) before assigning processor.video_processor.num_frames.nemo_rl/data/datasets/processed_dataset.py (1)
33-58: Documenttask_data_preprocessorsin the class docstring.The
Argssection (lines 36–44) documentstask_data_processorsandmax_seq_lengthbut omits the newtask_data_preprocessorsparameter. Adding a brief entry keeps the docstring consistent with the constructor signature.📝 Proposed docstring addition
task_data_processors: Either a single TaskDataProcessFnCallable for single-task, or a dict mapping task names to (TaskDataSpec, TaskDataProcessFnCallable) for multi-task + task_data_preprocessors: Optional preprocessing hook applied before task-specific processing. + Either a single TaskDataPreProcessFnCallable for all tasks, + or a dict mapping task names to TaskDataPreProcessFnCallable. max_seq_length: Maximum sequence length for tokenized outputs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/processed_dataset.py` around lines 33 - 58, The class docstring for AllTaskProcessedDataset is missing documentation for the constructor parameter task_data_preprocessors; update the Args section to add a short entry describing task_data_preprocessors (type: Optional[Union[dict[str, TaskDataPreProcessFnCallable], TaskDataPreProcessFnCallable]], default: None), explaining it can be a single preprocessor applied to all examples or a dict mapping task names to task-specific preprocessors and that missing tasks fall back to default behavior; reference the parameter name task_data_preprocessors and keep the wording consistent with the existing entries for task_data_processors and default_task_data_spec.nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py (2)
38-67:convert_metadatareturnsNonewhenreturn_inplace=True— confusing API.When
return_inplace=True, the function mutatesmetadatain place and returnsNoneimplicitly. WhenFalse, it returns a new dict. The parameter name is also inverted from the typicalinplaceconvention. Consider renaming toinplace(defaultFalse) or always returning the result to avoid caller confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py` around lines 38 - 67, The function convert_metadata currently uses return_inplace (inverted naming) and returns None when return_inplace=True, causing a confusing API; rename the parameter to inplace: bool = False (or keep name but invert semantics) and ensure convert_metadata always returns the processed dict (variable data) even when mutating the input, so callers get the result; update logic that chooses data = metadata or metadata.copy(), keep the mapping loops that reference multimodal_utils.MEDIA_TAGS_TO_ALLOWED and multimodal_utils.MEDIA_TAGS unchanged, and update any callers to the new parameter name if renamed.
26-26: Unused_DEBUGvariable — remove before merging.
_DEBUG = Trueis never referenced in this file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py` at line 26, The file defines an unused module-level variable named _DEBUG which is never referenced; remove the _DEBUG = True declaration from general_conversations_dataset (delete the unused symbol) to eliminate dead code and avoid misleading debug flags.nemo_rl/data/llm_message_utils.py (1)
607-613: Consider making the media-key-to-kwarg mapping explicit and extensible.The hardcoded
if/elifchain for mapping"image"→"images","audio"→"audio","video"→"videos"will need updating when new modalities are added. A small mapping dict would be more maintainable.♻️ Example
- media_kwargs = {} - if "image" in media_cur_message: - media_kwargs["images"] = media_cur_message["image"] - if "audio" in media_cur_message: - media_kwargs["audio"] = media_cur_message["audio"] - if "video" in media_cur_message: - media_kwargs["videos"] = media_cur_message["video"] + MEDIA_KEY_TO_KWARG = {"image": "images", "audio": "audio", "video": "videos"} + media_kwargs = { + MEDIA_KEY_TO_KWARG[k]: v + for k, v in media_cur_message.items() + if k in MEDIA_KEY_TO_KWARG + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/llm_message_utils.py` around lines 607 - 613, Replace the hardcoded if-chain that builds media_kwargs from media_cur_message with an explicit mapping dict (e.g., MEDIA_KEY_TO_KWARG) and iterate over its items to populate media_kwargs; locate the code block that references media_cur_message and media_kwargs in nemo_rl.data.llm_message_utils (the snippet that currently checks "image", "audio", "video") and change it to consult the mapping so new modalities can be added by updating the dict rather than editing conditional logic.examples/run_sft.py (1)
104-153: Validation preprocessor wiring is correct.Both paths (split-from-train and explicit validation config) properly collect and propagate preprocessors.
Minor note: since
RawDatasetdeclarespreprocessoras a class attribute (line 33 ofraw_dataset.py), thehasattrchecks on lines 89 and 142 are always True forRawDatasetsubclasses. You could simplify to justdata.preprocessor is not None, but the current form is safely defensive.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/run_sft.py` around lines 104 - 153, The hasattr(...) checks for "preprocessor" are unnecessary because RawDataset declares preprocessor as a class attribute; replace the two occurrences where you do hasattr(data, "preprocessor") and hasattr(val_data, "preprocessor") with direct checks that the attribute is not None (e.g., data.preprocessor is not None and val_data.preprocessor is not None) so the code uses the actual None test on the preprocessor before wiring it into val_task_data_preprocessors (references: RawDataset, preprocessor, variables data and val_data, and val_task_data_preprocessors).nemo_rl/data/datasets/response_datasets/daily_omni.py (1)
73-78: Add extraction filter to prevent path-traversal risks (Ruff S202).
tarfile.extractall()without a filter can be exploited for path traversal attacks. Python 3.12+ supportsfilter='data'to block absolute paths and parent-directory traversals. Even though the tar is downloaded from a trusted HuggingFace repository, add the filter for defense in depth.Apply filter parameter
with tarfile.open(archive_filename, "r:*") as tar: - # Extract all contents to the specified path - tar.extractall(path=self.hf_cache_dir) + tar.extractall(path=self.hf_cache_dir, filter="data")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` around lines 73 - 78, The tar extraction in the try block uses tarfile.extractall(path=self.hf_cache_dir) which is vulnerable to path-traversal; update the call to pass the filter parameter to block absolute and parent-directory paths (use filter='data' on Python 3.12+ or equivalent safe extraction logic) so extraction of archive_filename into self.hf_cache_dir is validated before writing (affects the block referencing archive_filename, self.hf_cache_dir and files_folder).nemo_rl/data/multimodal_utils.py (3)
321-325: Missing docstring forload_media_from_message.This function is part of the public API consumed by
llm_message_utils.py. A Google-style docstring explaining the parameters, return value, and the fallback behavior would help maintainability.As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments" and "Use Google style docstrings for classes and functions, which can be parsed by Sphinx."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/multimodal_utils.py` around lines 321 - 325, Add a Google-style docstring to the public function load_media_from_message describing parameters, return value, and fallback behavior: document message (expected keys/structure), processor (type and when used), multimodal_load_kwargs (shape: mapping of media type to kwargs and defaults), and the returned dict[str, list[Any]] format; include behavior when media is missing or when processor is None, examples of supported media types, and any exceptions raised or swallowed. Place the docstring immediately under the def load_media_from_message(...) signature so external modules like llm_message_utils.py can rely on the documented API.
219-261: Missing docstring for a public function used outside this file; consider replacing function-attribute caching.
get_multimodal_default_settings_from_processoris referenced byllm_message_utils.pyper the summary, so it should have a docstring per coding guidelines.Also, the function-attribute caching pattern (lines 237–242, 249–254) is unconventional and not thread-safe. A module-level
functools.lru_cacheor a simple module-level variable would be more idiomatic. Additionally, the list comprehensions[param for param in ...]are redundant —list(...)suffices.♻️ Simplify signature caching
+import functools + + +@functools.lru_cache(maxsize=1) +def _load_video_param_names() -> list[str]: + return list(inspect.signature(load_video).parameters) + + +@functools.lru_cache(maxsize=1) +def _load_audio_param_names() -> list[str]: + return list(inspect.signature(load_audio).parameters) + + def get_multimodal_default_settings_from_processor( processor, ) -> dict[str, dict[str, Any]]: + """Extract default video/audio loading kwargs from a processor's sub-components.""" ... - if not hasattr( - get_multimodal_default_settings_from_processor, "load_video_kwargs" - ): - get_multimodal_default_settings_from_processor.load_video_kwargs = [ - param for param in inspect.signature(load_video).parameters - ] default_settings["video"] = { arg: video_settings_dict[arg] - for arg in get_multimodal_default_settings_from_processor.load_video_kwargs + for arg in _load_video_param_names() if arg in video_settings_dict }(Apply analogous change for
load_audio_kwargs.)As per coding guidelines, "For interfaces that may be used outside a file, prefer docstrings over comments."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/multimodal_utils.py` around lines 219 - 261, The public function get_multimodal_default_settings_from_processor lacks a docstring and uses unsafe function-attribute caching and redundant list comprehensions; add a concise docstring describing purpose, args, and return value, and replace the function-attribute caches get_multimodal_default_settings_from_processor.load_video_kwargs and .load_audio_kwargs with a module-level cache (either simple module-level variables or a `@functools.lru_cache-decorated` helper) that computes list(inspect.signature(load_video).parameters) and list(inspect.signature(load_audio).parameters) once, and update the list comprehensions to use list(...) instead of [param for param in ...]; ensure the rest of the function (video_settings_dict/feature_extractor usage and default_settings keys) remains unchanged.
315-317: Conditional expression used solely for side effects hurts readability.The ternary
media[tag].extend(...) if isinstance(...) else media[tag].append(...)is used as a statement for its side effects only — the return value is discarded. This is a known anti-pattern in Python. A regularif/elseblock is clearer here.♻️ Replace ternary statement with explicit if/else
tag = item["type"] if tag in MEDIA_TAGS: - media[tag].extend(list(item[tag])) if isinstance( - item[tag], (list, tuple) - ) else media[tag].append(item[tag]) + value = item[tag] + if isinstance(value, (list, tuple)): + media[tag].extend(value) + else: + media[tag].append(value)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/multimodal_utils.py` around lines 315 - 317, Replace the ternary used for side effects with a clear if/else: locate the statement that currently does "media[tag].extend(list(item[tag])) if isinstance(item[tag], (list, tuple)) else media[tag].append(item[tag])" and change it to an explicit if isinstance(item[tag], (list, tuple)): media[tag].extend(list(item[tag])) else: media[tag].append(item[tag]). This preserves the same behavior for media[tag].extend/append and improves readability; keep the same variable names (media, tag, item) and surrounding logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_rl/data/datasets/response_datasets/daily_omni.py`:
- Line 1: The file daily_omni.py has an incorrect copyright header starting with
"##" instead of the repo-standard single "#" header; open
nemo_rl/data/datasets/response_datasets/daily_omni.py and replace the top header
so it matches the other Python files (use "# Copyright (c) 2025, NVIDIA
CORPORATION. All rights reserved." with a single leading '#' and same
spacing/phrasing), ensuring the header sits at the very top of the file.
- Around line 27-32: The class docstring for DailyOmniDataset incorrectly
mentions "CLEVR-CoGenT" due to a copy-paste error; update the docstring in class
DailyOmniDataset to describe the Daily Omni dataset (e.g., replace "Simple
wrapper around the CLEVR-CoGenT dataset." with a concise description referencing
the Daily Omni dataset) and keep the Args section (split) unchanged.
- Around line 85-90: Fix the typo and enable exception chaining in the tar
handling block: change the misspelled tarfile.ReadErro to tarfile.ReadError in
the except block and re-raise the ReadError with the new message using exception
chaining (raise tarfile.ReadError("...") from e). Likewise, in the generic
except Exception as e block, re-raise the new Exception with the formatted
message using "from e" so the original traceback is preserved; locate the
tarfile handling where tarfile.ReadError and the variable e are used in
daily_omni.py.
In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`:
- Around line 218-234: The return type annotation for process_message_fragment
is wrong: the function builds and returns a list (ret) of dicts, not a single
dict. Update the signature of process_message_fragment to return list[dict[str,
Any]] (or Sequence[Mapping[str, Any]] if you prefer an abstract type) and ensure
any callers expecting dict are adjusted; reference the function name
process_message_fragment and the local variable ret and the loop over
tag.split("-") to locate the code to change.
- Around line 148-153: Fix the typos in the class docstring for
GeneralConversationsJsonlDataset (subclass of RawDataset): change
"converstaions" to "conversations" and "requiement" to "requirement" so the
documentation reads correctly about jsonl datasets and media tag placement.
- Around line 107-118: The code is adding the file extension string to
tried_default_extensions and also assumes filenames always contain a '.',
causing an IndexError; update the logic in general_conversations_dataset.py
where ext is computed from metadata[ tag ][ media_index[tag] ] to safely extract
the extension (use os.path.splitext or check for '.' and handle missing
extension by setting ext to an empty string) and change the set insertion to
tried_default_extensions.add(tag) (not ext) so the guard `tag not in
tried_default_extensions` works as intended; keep the loop over
multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag] unchanged but ensure you only add
the media type tag to tried_default_extensions when you decide to skip trying
defaults.
In `@nemo_rl/data/datasets/utils.py`:
- Around line 18-24: The function get_huggingface_cache_path should be updated
to use the public API and improve typing and error handling: replace the private
_scan_cached_repo import/use with huggingface_hub.scan_cache_dir (update the
import), add a Google-style docstring describing parameters/return and behavior,
add precise type hints for parameters and return (Python 3.12+ style), narrow
the broad except Exception to specific exceptions such as OSError and
ValueError, and guard against calling max() on an empty revs mapping by
returning an appropriate fallback or raising a clear ValueError; locate these
changes around the get_huggingface_cache_path function and any imports
referencing _scan_cached_repo.
In `@nemo_rl/data/multimodal_utils.py`:
- Around line 344-347: Replace the runtime assertion that checks for "audio" and
"sampling_rate" in multimodal_load_kwargs with explicit validation and a clear
exception: verify that "audio" is a key in multimodal_load_kwargs and that
"sampling_rate" is present in multimodal_load_kwargs["audio"], and if either
check fails raise a ValueError with a descriptive message (e.g., referencing
multimodal_load_kwargs, "audio", and "sampling_rate") instead of using assert so
the check cannot be stripped under python -O; update the location where this
change is made around the multimodal_load_kwargs handling in multimodal_utils.py
(the block containing the current assert).
- Around line 348-364: Narrow the broad except in the load_audio fallback: catch
only expected errors from load_audio (e.g., RuntimeError, FileNotFoundError,
ValueError) instead of bare Exception; either log the caught error (use the
module logger via logger.warning with the exception info) or drop the unused "as
e" binding; replace the print("audio loading failed...") with logger.warning
including the error details and context (e.g., which file/path and that we are
falling back to decord). Also ensure the call to
get_dim_to_pack_along(processor, "audio") is safe when processor is None by
adding an explicit guard or documenting that processor may be None (or
defaulting to 0) so the fallback code's slicing behavior remains clear.
---
Outside diff comments:
In `@nemo_rl/data/multimodal_utils.py`:
- Line 1: Update the copyright header at the top of the file to use the current
year 2026: replace the 2025 year in the existing NVIDIA copyright header comment
(the file-level header at the very top of nemo_rl/data/multimodal_utils.py) so
the header reads 2026 instead of 2025, ensuring it matches the project's
required header format for Python files.
In `@nemo_rl/data/utils.py`:
- Around line 221-248: The bug is that val_task_data_preprocessors is declared
once outside the val_data_paths loop and may leak preprocessors across datasets;
fix it by moving/reinitializing val_task_data_preprocessors inside the for
val_dataset_name, val_dataset_path in val_data_paths.items() loop (set to {} at
start of each iteration), then populate it only when hasattr(val_data,
"preprocessor") and val_data.preprocessor is not None before passing it to
AllTaskProcessedDataset; ensure val_task_data_processors is also built
per-iteration using load_preference_dataset and val_data.task_name so each
AllTaskProcessedDataset gets only its own processors and preprocessors.
In `@nemo_rl/models/policy/__init__.py`:
- Around line 201-208: Add a Google-style class docstring to the TokenizerConfig
TypedDict describing the purpose of the TypedDict and documenting the new audio
and video keys: state that audio and video are optional multimodal config dicts,
list valid value types (dict[str, Any] or None), describe recommended defaults
(e.g., audio: {} or None to disable, video: {} or None to disable, and any
recommended subkeys like sample_rate, channels for audio or frame_rate,
resolution for video), and briefly mention chat_template_kwargs usage; then
update the exemplar YAML files under examples/configs/*.yaml to include the
audio and video keys with the recommended default values so the examples reflect
the docstring defaults.
In `@pyproject.toml`:
- Around line 18-53: Pin the decord dependency to a specific version by changing
the "decord" entry to "decord==0.6.0" and add a short comment noting Python 3.12
has no prebuilt wheels (so building from source may require FFmpeg and a proper
toolchain), or alternatively verify/source-build in CI and document required
build steps for Python 3.12 to avoid breakage; update the dependency line for
"decord" and include the explanatory comment nearby.
---
Nitpick comments:
In `@examples/run_sft.py`:
- Around line 104-153: The hasattr(...) checks for "preprocessor" are
unnecessary because RawDataset declares preprocessor as a class attribute;
replace the two occurrences where you do hasattr(data, "preprocessor") and
hasattr(val_data, "preprocessor") with direct checks that the attribute is not
None (e.g., data.preprocessor is not None and val_data.preprocessor is not None)
so the code uses the actual None test on the preprocessor before wiring it into
val_task_data_preprocessors (references: RawDataset, preprocessor, variables
data and val_data, and val_task_data_preprocessors).
In `@nemo_rl/algorithms/utils.py`:
- Around line 323-355: The three warnings.warn calls in the function that
adjusts processor feature/video settings (the calls that override
processor.feature_extractor.sampling_rate, processor.video_processor.fps, and
processor.video_processor.num_frames) should include stacklevel=2 so the warning
points at the caller; update each warnings.warn invocation (the ones emitting
"Overriding audio sampling rate...", "Overriding video fps...", and "Overriding
video num_frames...") to pass stacklevel=2 while keeping the existing message
and variable usage (tokenizer_config checks and assignments to
processor.feature_extractor.sampling_rate, processor.video_processor.fps,
processor.video_processor.num_frames).
- Around line 345-355: Check for mutual exclusivity of fps and num_frames in
tokenizer_config["video"] before mutating processor.video_processor.num_frames:
if both "fps" and "num_frames" are present, validate they are not contradictory
and raise a clear ValueError (or choose one deterministically) instead of
letting the video processor fail later. Specifically, in the block handling
tokenizer_config["video"] (referencing tokenizer_config,
processor.video_processor.num_frames, "fps", and "num_frames"), add an upfront
check that raises a descriptive error like "Cannot set both fps and num_frames
in tokenizer_config['video']" (or compare values and only warn/override when
consistent) before assigning processor.video_processor.num_frames.
In `@nemo_rl/data/datasets/processed_dataset.py`:
- Around line 33-58: The class docstring for AllTaskProcessedDataset is missing
documentation for the constructor parameter task_data_preprocessors; update the
Args section to add a short entry describing task_data_preprocessors (type:
Optional[Union[dict[str, TaskDataPreProcessFnCallable],
TaskDataPreProcessFnCallable]], default: None), explaining it can be a single
preprocessor applied to all examples or a dict mapping task names to
task-specific preprocessors and that missing tasks fall back to default
behavior; reference the parameter name task_data_preprocessors and keep the
wording consistent with the existing entries for task_data_processors and
default_task_data_spec.
In `@nemo_rl/data/datasets/raw_dataset.py`:
- Around line 26-35: Add an explicit constructor for class RawDataset that
initializes all externally visible members (data_config, dataset, val_dataset,
processor, task_spec) and sets preprocessor to None (or a provided value) so
preprocessor is not left uninitialized; implement __init__ on RawDataset to
accept and assign these fields (or convert the class to a dataclass with
defaults) to satisfy the "initialize members in constructor" guideline.
In `@nemo_rl/data/datasets/response_datasets/daily_omni.py`:
- Around line 73-78: The tar extraction in the try block uses
tarfile.extractall(path=self.hf_cache_dir) which is vulnerable to
path-traversal; update the call to pass the filter parameter to block absolute
and parent-directory paths (use filter='data' on Python 3.12+ or equivalent safe
extraction logic) so extraction of archive_filename into self.hf_cache_dir is
validated before writing (affects the block referencing archive_filename,
self.hf_cache_dir and files_folder).
In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`:
- Around line 38-67: The function convert_metadata currently uses return_inplace
(inverted naming) and returns None when return_inplace=True, causing a confusing
API; rename the parameter to inplace: bool = False (or keep name but invert
semantics) and ensure convert_metadata always returns the processed dict
(variable data) even when mutating the input, so callers get the result; update
logic that chooses data = metadata or metadata.copy(), keep the mapping loops
that reference multimodal_utils.MEDIA_TAGS_TO_ALLOWED and
multimodal_utils.MEDIA_TAGS unchanged, and update any callers to the new
parameter name if renamed.
- Line 26: The file defines an unused module-level variable named _DEBUG which
is never referenced; remove the _DEBUG = True declaration from
general_conversations_dataset (delete the unused symbol) to eliminate dead code
and avoid misleading debug flags.
In `@nemo_rl/data/llm_message_utils.py`:
- Around line 607-613: Replace the hardcoded if-chain that builds media_kwargs
from media_cur_message with an explicit mapping dict (e.g., MEDIA_KEY_TO_KWARG)
and iterate over its items to populate media_kwargs; locate the code block that
references media_cur_message and media_kwargs in nemo_rl.data.llm_message_utils
(the snippet that currently checks "image", "audio", "video") and change it to
consult the mapping so new modalities can be added by updating the dict rather
than editing conditional logic.
In `@nemo_rl/data/multimodal_utils.py`:
- Around line 321-325: Add a Google-style docstring to the public function
load_media_from_message describing parameters, return value, and fallback
behavior: document message (expected keys/structure), processor (type and when
used), multimodal_load_kwargs (shape: mapping of media type to kwargs and
defaults), and the returned dict[str, list[Any]] format; include behavior when
media is missing or when processor is None, examples of supported media types,
and any exceptions raised or swallowed. Place the docstring immediately under
the def load_media_from_message(...) signature so external modules like
llm_message_utils.py can rely on the documented API.
- Around line 219-261: The public function
get_multimodal_default_settings_from_processor lacks a docstring and uses unsafe
function-attribute caching and redundant list comprehensions; add a concise
docstring describing purpose, args, and return value, and replace the
function-attribute caches
get_multimodal_default_settings_from_processor.load_video_kwargs and
.load_audio_kwargs with a module-level cache (either simple module-level
variables or a `@functools.lru_cache-decorated` helper) that computes
list(inspect.signature(load_video).parameters) and
list(inspect.signature(load_audio).parameters) once, and update the list
comprehensions to use list(...) instead of [param for param in ...]; ensure the
rest of the function (video_settings_dict/feature_extractor usage and
default_settings keys) remains unchanged.
- Around line 315-317: Replace the ternary used for side effects with a clear
if/else: locate the statement that currently does
"media[tag].extend(list(item[tag])) if isinstance(item[tag], (list, tuple)) else
media[tag].append(item[tag])" and change it to an explicit if
isinstance(item[tag], (list, tuple)): media[tag].extend(list(item[tag])) else:
media[tag].append(item[tag]). This preserves the same behavior for
media[tag].extend/append and improves readability; keep the same variable names
(media, tag, item) and surrounding logic.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (20)
.github/workflows/cicd-main.ymlexamples/configs/recipes/llm/performance/.grpo-deepseek-v3-32n4g.yaml.swpexamples/configs/recipes/llm/performance/.grpo-deepseek-v3-32n8g.yaml.swpexamples/configs/sft_avlm.yamlexamples/run_sft.pynemo_rl/algorithms/utils.pynemo_rl/data/datasets/processed_dataset.pynemo_rl/data/datasets/raw_dataset.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/daily_omni.pynemo_rl/data/datasets/response_datasets/general_conversations_dataset.pynemo_rl/data/datasets/utils.pynemo_rl/data/interfaces.pynemo_rl/data/llm_message_utils.pynemo_rl/data/multimodal_utils.pynemo_rl/data/utils.pynemo_rl/models/policy/__init__.pypyproject.tomltests/unit/data/datasets/test_general_conversations_dataset.pytests/unit/data/datasets/test_response_dataset.py
| @@ -0,0 +1,140 @@ | |||
| ## Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright header starts with ## instead of #.
All other files in the repo use # Copyright (c) 2025, NVIDIA CORPORATION. — this file has ## Copyright, which is inconsistent. As per coding guidelines, the NVIDIA copyright header should be added to all Python source files under nemo_rl/.
Proposed fix
-## Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` at line 1, The file
daily_omni.py has an incorrect copyright header starting with "##" instead of
the repo-standard single "#" header; open
nemo_rl/data/datasets/response_datasets/daily_omni.py and replace the top header
so it matches the other Python files (use "# Copyright (c) 2025, NVIDIA
CORPORATION. All rights reserved." with a single leading '#' and same
spacing/phrasing), ensuring the header sits at the very top of the file.
| class DailyOmniDataset(RawDataset): | ||
| """Simple wrapper around the CLEVR-CoGenT dataset. | ||
|
|
||
| Args: | ||
| split: Split name for the dataset, default is "train" | ||
| """ |
There was a problem hiding this comment.
Docstring refers to "CLEVR-CoGenT" — this is the Daily Omni dataset.
Copy-paste artifact from another dataset class.
Proposed fix
class DailyOmniDataset(RawDataset):
- """Simple wrapper around the CLEVR-CoGenT dataset.
+ """Dataset wrapper for the Daily-Omni public video benchmarking dataset.
Args:
split: Split name for the dataset, default is "train"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class DailyOmniDataset(RawDataset): | |
| """Simple wrapper around the CLEVR-CoGenT dataset. | |
| Args: | |
| split: Split name for the dataset, default is "train" | |
| """ | |
| class DailyOmniDataset(RawDataset): | |
| """Dataset wrapper for the Daily-Omni public video benchmarking dataset. | |
| Args: | |
| split: Split name for the dataset, default is "train" | |
| """ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` around lines 27 - 32,
The class docstring for DailyOmniDataset incorrectly mentions "CLEVR-CoGenT" due
to a copy-paste error; update the docstring in class DailyOmniDataset to
describe the Daily Omni dataset (e.g., replace "Simple wrapper around the
CLEVR-CoGenT dataset." with a concise description referencing the Daily Omni
dataset) and keep the Args section (split) unchanged.
| except tarfile.ReadError: | ||
| raise tarfile.ReadErro( | ||
| "Error: Could not read the tar file. It might be corrupted or not a tar file." | ||
| ) | ||
| except Exception as e: | ||
| raise Exception(f"An unexpected error occurred: {e}") |
There was a problem hiding this comment.
Critical typo: tarfile.ReadErro → tarfile.ReadError — will raise NameError at runtime.
Line 86 has a typo that turns this error-handling path into an unhandled NameError. Additionally, both except blocks should chain the original exception using from.
🐛 Proposed fix
except tarfile.ReadError:
- raise tarfile.ReadErro(
+ raise tarfile.ReadError(
"Error: Could not read the tar file. It might be corrupted or not a tar file."
- )
- except Exception as e:
- raise Exception(f"An unexpected error occurred: {e}")
+ ) from None
+ except Exception as e:
+ raise RuntimeError(f"An unexpected error occurred: {e}") from e🧰 Tools
🪛 Ruff (0.15.2)
[warning] 86-88: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
[warning] 86-88: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 89-89: Do not catch blind exception: Exception
(BLE001)
[warning] 90-90: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
[warning] 90-90: Create your own exception
(TRY002)
[warning] 90-90: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/daily_omni.py` around lines 85 - 90,
Fix the typo and enable exception chaining in the tar handling block: change the
misspelled tarfile.ReadErro to tarfile.ReadError in the except block and
re-raise the ReadError with the new message using exception chaining (raise
tarfile.ReadError("...") from e). Likewise, in the generic except Exception as e
block, re-raise the new Exception with the formatted message using "from e" so
the original traceback is preserved; locate the tarfile handling where
tarfile.ReadError and the variable e are used in daily_omni.py.
| ext = os.path.basename(metadata[tag][media_index[tag]]).split(".", 1)[1] | ||
| if ( | ||
| raw | ||
| and ext not in raw | ||
| and tag not in tried_default_extensions | ||
| and tag in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS | ||
| ): | ||
| # try the default extension | ||
| for ext in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag]: | ||
| if ext in raw: | ||
| tried_default_extensions.add(ext) | ||
| break |
There was a problem hiding this comment.
Bug: tried_default_extensions.add(ext) should be tried_default_extensions.add(tag) — the guard on line 111 checks tag, not ext.
Line 111 checks tag not in tried_default_extensions to avoid retrying default extensions for the same media type. But line 117 adds ext (the file extension string, e.g. "mp4") instead of tag (the media type, e.g. "video"). Since these are different values, the guard will never match, and default extension resolution will be retried on every occurrence.
Also, line 107 will raise IndexError if the media filename contains no . (e.g., a path or identifier with no extension).
🐛 Proposed fix
- ext = os.path.basename(metadata[tag][media_index[tag]]).split(".", 1)[1]
+ basename = os.path.basename(metadata[tag][media_index[tag]])
+ ext = basename.split(".", 1)[1] if "." in basename else ""
if (
raw
and ext not in raw
and tag not in tried_default_extensions
and tag in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS
):
# try the default extension
for ext in multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag]:
if ext in raw:
- tried_default_extensions.add(ext)
+ tried_default_extensions.add(tag)
break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`
around lines 107 - 118, The code is adding the file extension string to
tried_default_extensions and also assumes filenames always contain a '.',
causing an IndexError; update the logic in general_conversations_dataset.py
where ext is computed from metadata[ tag ][ media_index[tag] ] to safely extract
the extension (use os.path.splitext or check for '.' and handle missing
extension by setting ext to an empty string) and change the set insertion to
tried_default_extensions.add(tag) (not ext) so the guard `tag not in
tried_default_extensions` works as intended; keep the loop over
multimodal_utils.DEFAULT_MEDIA_EXTENSIONS[tag] unchanged but ensure you only add
the media type tag to tried_default_extensions when you decide to skip trying
defaults.
There was a problem hiding this comment.
This is a false alarm. add(ext) is correct since we need to check the extension tried, not tag.
There was a problem hiding this comment.
@yuanhangsu1986, thank you for the clarification. I understand you want to track extensions rather than tags. However, I notice there's a mismatch in the current implementation:
- Line 111 checks:
tag not in tried_default_extensions(checking if the media type tag like"video"is in the set) - Line 117 adds:
tried_default_extensions.add(ext)(adding the extension string like"mp4")
Since tag (e.g., "video") and ext (e.g., "mp4") are different values, the guard on line 111 will always pass even after adding ext to the set, which means default extensions will be retried on every occurrence of the same tag.
If the intention is to track which extensions have been tried (as you mentioned), would it make sense to also update line 111 to check ext not in tried_default_extensions instead of tag not in tried_default_extensions? This would make the guard consistent with what's being added to the set.
Let me know if there's a specific behavior I'm missing!
| class GeneralConversationsJsonlDataset(RawDataset): | ||
| """Loads general conversation datasets that have the json (manifest) files and media files in separate files (jsonl datasets). | ||
|
|
||
| Each sample can be single/multi-turn converstaions with multiple modalities. | ||
| Each modality can have one or more number of media objects. | ||
| There is no requiement of where the media tag (e.g. '<sound>') should appear in the conversations. |
There was a problem hiding this comment.
Docstring typos: "converstaions" → "conversations", "requiement" → "requirement".
Proposed fix
- Each sample can be single/multi-turn converstaions with multiple modalities.
+ Each sample can be single/multi-turn conversations with multiple modalities.
Each modality can have one or more number of media objects.
- There is no requiement of where the media tag (e.g. '<sound>') should appear in the conversations.
+ There is no requirement of where the media tag (e.g. '<sound>') should appear in the conversations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`
around lines 148 - 153, Fix the typos in the class docstring for
GeneralConversationsJsonlDataset (subclass of RawDataset): change
"converstaions" to "conversations" and "requiement" to "requirement" so the
documentation reads correctly about jsonl datasets and media tag placement.
| @classmethod | ||
| def process_message_fragment( | ||
| cls, tag: str, fragment: Any, media_directory: Optional[str] = None | ||
| ) -> dict[str, Any]: | ||
| if ( | ||
| media_directory is not None | ||
| and tag in multimodal_utils.MEDIA_TAGS | ||
| and isinstance(fragment, str) | ||
| and not os.path.isfile(fragment) | ||
| ): | ||
| media_path = os.path.join(media_directory, fragment) | ||
| if os.path.isfile(media_path): | ||
| fragment = media_path | ||
| ret = [] | ||
| for t in tag.split("-"): | ||
| ret.append({"type": t, t: fragment}) | ||
| return ret |
There was a problem hiding this comment.
Return type annotation is incorrect: returns list[dict[str, Any]], not dict[str, Any].
process_message_fragment builds and returns ret (a list), but the annotation says -> dict[str, Any].
Proposed fix
`@classmethod`
def process_message_fragment(
cls, tag: str, fragment: Any, media_directory: Optional[str] = None
- ) -> dict[str, Any]:
+ ) -> list[dict[str, Any]]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py`
around lines 218 - 234, The return type annotation for process_message_fragment
is wrong: the function builds and returns a list (ret) of dicts, not a single
dict. Update the signature of process_message_fragment to return list[dict[str,
Any]] (or Sequence[Mapping[str, Any]] if you prefer an abstract type) and ensure
any callers expecting dict are adjusted; reference the function name
process_message_fragment and the local variable ret and the loop over
tag.split("-") to locate the code to change.
| from pathlib import Path | ||
| from typing import Optional, Union | ||
|
|
||
| import torch | ||
| from datasets import DatasetDict, load_dataset, load_from_disk | ||
| from huggingface_hub.utils._cache_manager import _scan_cached_repo | ||
| from PIL import Image |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and read the file in question
find . -type f -name "utils.py" -path "*/nemo_rl/data/datasets/*" | head -5Repository: NVIDIA-NeMo/RL
Length of output: 90
🏁 Script executed:
# Read the utils.py file to see the imports and get_huggingface_cache_path function
cat -n nemo_rl/data/datasets/utils.py | sed -n '1,50p'Repository: NVIDIA-NeMo/RL
Length of output: 2233
🏁 Script executed:
# Read the function in question (lines 148-176)
cat -n nemo_rl/data/datasets/utils.py | sed -n '140,180p'Repository: NVIDIA-NeMo/RL
Length of output: 1918
🌐 Web query:
huggingface_hub public API cache inspection alternatives to _scan_cached_repo
💡 Result:
_scan_cached_repo is a private helper. The public, supported ways to inspect the Hugging Face Hub cache are:
1) Scan the whole cache (recommended): scan_cache_dir()
Use the public API that returns a structured HFCacheInfo report (repos, revisions, sizes, timestamps, paths). (huggingface.co)
from huggingface_hub import scan_cache_dir
info = scan_cache_dir() # or scan_cache_dir(cache_dir="...")
# list cached repos
for repo in sorted(info.repos, key=lambda r: r.size_on_disk, reverse=True):
print(repo.repo_id, repo.repo_type, repo.size_on_disk, repo.repo_path)
# find one repo
repo = next(r for r in info.repos if r.repo_id == "google/fleurs" and r.repo_type == "dataset")
print(repo.revisions) # commit hashes + snapshot paths + files2) Check whether a specific file is already cached: try_to_load_from_cache()
If you don’t need a full scan, this is the public “is it cached, and where?” helper. (huggingface.co)
from huggingface_hub import try_to_load_from_cache
path = try_to_load_from_cache("bert-base-uncased", "config.json")
print(path) # local path or None (or special sentinel)3) CLI inspection (same cache system)
huggingface-cli scan-cacheprints a cache report (repo id/type, disk usage, refs, local path). (huggingface.co)- Newer CLI also documents
hf cachecommands (e.g., list/remove) depending on your installed CLI version. (hugging-face.cn)
4) If you also want cleanup after inspection
scan_cache_dir() returns an HFCacheInfo that can prepare a deletion plan via delete_revisions(...) (dry-run style via the returned strategy, then execute()). (huggingface.co)
Citations:
- 1: https://huggingface.co/docs/huggingface_hub/en/package_reference/cache?utm_source=openai
- 2: https://huggingface.co/docs/huggingface_hub/package_reference/cache?utm_source=openai
- 3: https://huggingface.co/docs/huggingface_hub/v0.31.0/en/guides/manage-cache?utm_source=openai
- 4: https://hugging-face.cn/docs/huggingface_hub/guides/manage-cache?utm_source=openai
- 5: https://huggingface.co/docs/huggingface_hub/en/package_reference/cache?utm_source=openai
🌐 Web query:
huggingface_hub CachedRepoInfo cache directory public API
💡 Result:
In huggingface_hub (Python), CachedRepoInfo is part of the public cache-scanning API:
- Use
huggingface_hub.scan_cache_dir(cache_dir=...)to scan a cache root and get anHFCacheInfo, whose.reposcontainsCachedRepoInfoobjects (with fields likerepo_path,size_on_disk,revisions, etc.). [1]
To get / control the Hub cache directory (the “cache root” that scan_cache_dir() defaults to):
-
Configure via environment variables:
HF_HOME(root; default~/.cache/huggingface)HF_HUB_CACHE(Hub repos cache; default$HF_HOME/hub, i.e. typically~/.cache/huggingface/hub) [2]
-
Programmatic (public) access to the resolved default path:
from huggingface_hub.constants import HF_HUB_CACHE(prints the effective Hub cache dir, honoring env overrides). [3]
Example
from huggingface_hub import scan_cache_dir
from huggingface_hub.constants import HF_HUB_CACHE
info = scan_cache_dir() # defaults to HF_HUB_CACHE
print(HF_HUB_CACHE) # cache root (e.g. ~/.cache/huggingface/hub)
for repo in info.repos:
print(repo.repo_id, repo.repo_path)Sources: [1] [2] [3]
Add docstring, use public HF API, and narrow exception handling in get_huggingface_cache_path.
This function lacks a Google-style docstring and uses the private _scan_cached_repo API. Replace it with the public scan_cache_dir() from huggingface_hub, which is the supported, documented way to inspect the local cache. Additionally, the broad Exception catch on line 174 masks real errors; narrow it to specific exceptions (e.g., OSError, ValueError). Also add type hints to match Python 3.12+ best practices, and guard against max() on an empty sequence if the revs dict is empty after filtering.
Refactor sketch
+from huggingface_hub import scan_cache_dir
+
-def get_huggingface_cache_path(repo_id, branch="main", repo_type="datasets"):
+def get_huggingface_cache_path(
+ repo_id: str, branch: str | None = "main", repo_type: str = "datasets"
+) -> str | None:
+ """Return the latest cached snapshot path for a HF repo if present.
+
+ Args:
+ repo_id: Repository ID in "org/name" form.
+ branch: Optional branch/tag name to filter revisions.
+ repo_type: Repository type ("datasets", "models", etc.).
+
+ Returns:
+ Path to the snapshot directory if cached, else None.
+ """
cache_path = None
try:
- # ... existing cache detection logic ...
- hf_cache_info = _scan_cached_repo(repo_path=repo_path)
+ hf_cache_info = scan_cache_dir()
+ repo = next((r for r in hf_cache_info.repos
+ if r.repo_id == repo_id and r.repo_type == repo_type), None)
+ if not repo:
+ return None
+ revs = {r.refs: r for r in repo.revisions}
- revs = {r.refs: r for r in hf_cache_info.revisions}
if branch is not None:
revs = {refs: r for refs, r in revs.items() if branch in refs}
+ if not revs:
+ return None
rev2keep = max(revs.values(), key=lambda r: r.last_modified)
return str(rev2keep.snapshot_path)
- except Exception as e:
+ except (OSError, ValueError) as e:
print(f"{type(e)}: {e}")
return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/datasets/utils.py` around lines 18 - 24, The function
get_huggingface_cache_path should be updated to use the public API and improve
typing and error handling: replace the private _scan_cached_repo import/use with
huggingface_hub.scan_cache_dir (update the import), add a Google-style docstring
describing parameters/return and behavior, add precise type hints for parameters
and return (Python 3.12+ style), narrow the broad except Exception to specific
exceptions such as OSError and ValueError, and guard against calling max() on an
empty revs mapping by returning an appropriate fallback or raising a clear
ValueError; locate these changes around the get_huggingface_cache_path function
and any imports referencing _scan_cached_repo.
nemo_rl/data/multimodal_utils.py
Outdated
| assert ( | ||
| "audio" in multimodal_load_kwargs | ||
| and "sampling_rate" in multimodal_load_kwargs["audio"] | ||
| ) |
There was a problem hiding this comment.
assert used for runtime input validation — will silently pass under python -O.
Assertions are stripped when Python runs with optimizations enabled (-O / -OO). If sampling_rate is genuinely required, raise a ValueError instead.
Proposed fix
- assert (
- "audio" in multimodal_load_kwargs
- and "sampling_rate" in multimodal_load_kwargs["audio"]
- )
+ if (
+ "audio" not in multimodal_load_kwargs
+ or "sampling_rate" not in multimodal_load_kwargs.get("audio", {})
+ ):
+ raise ValueError(
+ "multimodal_load_kwargs must include 'audio' with a 'sampling_rate' "
+ "key to load audio from file path."
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert ( | |
| "audio" in multimodal_load_kwargs | |
| and "sampling_rate" in multimodal_load_kwargs["audio"] | |
| ) | |
| if ( | |
| "audio" not in multimodal_load_kwargs | |
| or "sampling_rate" not in multimodal_load_kwargs.get("audio", {}) | |
| ): | |
| raise ValueError( | |
| "multimodal_load_kwargs must include 'audio' with a 'sampling_rate' " | |
| "key to load audio from file path." | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/multimodal_utils.py` around lines 344 - 347, Replace the runtime
assertion that checks for "audio" and "sampling_rate" in multimodal_load_kwargs
with explicit validation and a clear exception: verify that "audio" is a key in
multimodal_load_kwargs and that "sampling_rate" is present in
multimodal_load_kwargs["audio"], and if either check fails raise a ValueError
with a descriptive message (e.g., referencing multimodal_load_kwargs, "audio",
and "sampling_rate") instead of using assert so the check cannot be stripped
under python -O; update the location where this change is made around the
multimodal_load_kwargs handling in multimodal_utils.py (the block containing the
current assert).
| try: | ||
| loaded_media["audio"].append( | ||
| load_audio(aud, **multimodal_load_kwargs["audio"]) | ||
| ) | ||
| except Exception as e: | ||
| print("audio loading failed. Fall back to decord.") | ||
| # use decord | ||
| loaded_audio = decord.AudioReader( | ||
| aud, | ||
| sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"], | ||
| mono=True, | ||
| ) | ||
| loaded_media["audio"].append( | ||
| loaded_audio[:].asnumpy()[ | ||
| get_dim_to_pack_along(processor, "audio") | ||
| ] | ||
| ) |
There was a problem hiding this comment.
Bare Exception catch, unused variable, and print instead of logging.
Three issues in this block, two confirmed by static analysis (Ruff BLE001, F841):
- Bare
Exception— catches everything includingKeyboardInterruptsubclasses (in Python 3KeyboardInterruptdoesn't inherit fromException, so that's fine, butSystemExitedge aside, catching all exceptions masks bugs). Narrow to expected failure types (e.g.,RuntimeError,FileNotFoundError,ValueError). - Unused
e— either log it or dropas e. print(...)in library code — uselogging.warning(...)(orlogger.warning) so callers can control log levels.
Additionally, on line 362, get_dim_to_pack_along(processor, "audio") is called, but processor can be None (parameter default on line 323). While get_dim_to_pack_along would not crash on None (it returns 0), it's semantically unclear — consider guarding or documenting.
Proposed fix
+import logging
+
+logger = logging.getLogger(__name__)
+
...
- except Exception as e:
- print("audio loading failed. Fall back to decord.")
+ except (RuntimeError, FileNotFoundError, OSError) as e:
+ logger.warning("Audio loading via transformers failed (%s). Falling back to decord.", e)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| loaded_media["audio"].append( | |
| load_audio(aud, **multimodal_load_kwargs["audio"]) | |
| ) | |
| except Exception as e: | |
| print("audio loading failed. Fall back to decord.") | |
| # use decord | |
| loaded_audio = decord.AudioReader( | |
| aud, | |
| sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"], | |
| mono=True, | |
| ) | |
| loaded_media["audio"].append( | |
| loaded_audio[:].asnumpy()[ | |
| get_dim_to_pack_along(processor, "audio") | |
| ] | |
| ) | |
| try: | |
| loaded_media["audio"].append( | |
| load_audio(aud, **multimodal_load_kwargs["audio"]) | |
| ) | |
| except (RuntimeError, FileNotFoundError, OSError) as e: | |
| logger.warning("Audio loading via transformers failed (%s). Falling back to decord.", e) | |
| # use decord | |
| loaded_audio = decord.AudioReader( | |
| aud, | |
| sample_rate=multimodal_load_kwargs["audio"]["sampling_rate"], | |
| mono=True, | |
| ) | |
| loaded_media["audio"].append( | |
| loaded_audio[:].asnumpy()[ | |
| get_dim_to_pack_along(processor, "audio") | |
| ] | |
| ) |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 352-352: Do not catch blind exception: Exception
(BLE001)
[error] 352-352: Local variable e is assigned to but never used
Remove assignment to unused variable e
(F841)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_rl/data/multimodal_utils.py` around lines 348 - 364, Narrow the broad
except in the load_audio fallback: catch only expected errors from load_audio
(e.g., RuntimeError, FileNotFoundError, ValueError) instead of bare Exception;
either log the caught error (use the module logger via logger.warning with the
exception info) or drop the unused "as e" binding; replace the print("audio
loading failed...") with logger.warning including the error details and context
(e.g., which file/path and that we are falling back to decord). Also ensure the
call to get_dim_to_pack_along(processor, "audio") is safe when processor is None
by adding an explicit guard or documenting that processor may be None (or
defaulting to 0) so the fallback code's slicing behavior remains clear.
nemo_rl/data/datasets/response_datasets/general_conversations_dataset.py
Outdated
Show resolved
Hide resolved
808cbd7 to
c1f2862
Compare
Signed-off-by: Yuki Huang <yukih@nvidia.com>
What does this PR do ?
Add video and audio dataloadling support for HF models
Usage
Additional Information
Major changes:
This is a resubmission of the following PR:
#1639
Dependencies
Summary by CodeRabbit
Release Notes
New Features
Tests
Chores