Introduced vLLM_multimodal model to save multimodal outputs #1136

vmendelev · 2025-12-19T17:52:17Z

This model class is needed to enable parsing of the OpenAI API server output to get non-text (audio in this case) data and saving them into separate files in order to keep output.jsonl small.

Summary by CodeRabbit

Release Notes

New Features
- Audio response extraction with automatic WAV file storage capability added
- New multimodal model variant now available
Improvements
- Model configuration now supports data directory and output directory parameters for enhanced workflow management

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-19T17:56:12Z

📝 Walkthrough

Walkthrough

The PR extends the inference system to support audio responses. It propagates data_dir and output_dir parameters through model initialization paths and introduces a new VLLMMultimodalModel class that processes and persists audio data from chat completion responses.

Changes

Cohort / File(s)	Summary
Model initialization parameter propagation `nemo_skills/inference/generate.py`, `nemo_skills/inference/model/base.py`	Extract `data_dir` and `output_dir` from config and propagate through `get_code_execution_model()`, `get_tool_calling_model()`, and `get_model()`. Add corresponding instance attributes to `BaseModel.__init__` with defaults.
Multimodal model support `nemo_skills/inference/model/vllm_multimodal.py`	Introduce `VLLMMultimodalModel` subclass of `VLLMModel` with methods to parse chat completion responses, extract audio data, decode base64 audio, and save as WAV files to the output directory.
Model registry update `nemo_skills/inference/model/__init__.py`	Import and register `VLLMMultimodalModel` in the models mapping to enable `get_model()` support for the new server type.

Sequence Diagram(s)

sequenceDiagram
    participant gen as generate.py
    participant model as BaseModel
    participant mm as VLLMMultimodalModel
    participant api as vLLM API
    participant fs as File System
    
    gen->>gen: Extract data_dir, output_dir<br/>from eval_config
    gen->>mm: Instantiate with data_dir,<br/>output_dir
    mm->>mm: Create output_dir/audio
    mm->>api: Get chat completion
    api-->>mm: Return response with audio
    mm->>mm: _parse_chat_completion_response()
    mm->>mm: Call base _parse_chat_completion_response()
    mm->>mm: Check for audio in response
    alt Audio present
        mm->>mm: _process_audio_response()
        mm->>mm: Decode base64 audio
        mm->>fs: Save as WAV file
        fs-->>mm: File saved
        mm->>mm: Return audio metadata
    else No audio or no output_dir
        mm->>mm: Include base64 in metadata
    end
    mm-->>gen: Return parsed response<br/>with audio dict

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Parameter propagation consistency: Verify data_dir and output_dir are correctly threaded through all three model instantiation paths (get_code_execution_model, get_tool_calling_model, get_model)
Audio processing logic: Review base64 decoding, WAV file writing, error handling, and graceful fallback when save fails
Registry integration: Confirm VLLMMultimodalModel is properly imported and registered, and that existing model instantiation logic is preserved

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: introducing a vLLM_multimodal model with the primary purpose of saving multimodal outputs, which aligns with the PR objectives of extracting and saving audio data separately.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch vmendelev/2512_vllm_multimodal

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

nemo_skills/inference/model/vllm_multimodal.py (1)
52-82: Consider using dynamic file extension based on format.

The filename is hardcoded as {response_id}.wav (line 67), but the format is extracted from audio_data.format (line 55) which may not always be "wav". This creates a mismatch between the file extension and the actual audio format.
🔎 Proposed fix
         if self.output_audio_dir:
             try:
                 audio_bytes = base64.b64decode(audio_base64)
-                filename = f"{response_id}.wav"
+                audio_format = audio_info["format"]
+                filename = f"{response_id}.{audio_format}"
                 filepath = os.path.join(self.output_audio_dir, filename)
nemo_skills/inference/generate.py (1)
402-406: Simplify the None check.

The condition on line 403 uses isinstance(self.cfg.eval_config.get("data_dir"), type(None)), which is verbose and unconventional. Consider using a more idiomatic approach.
🔎 Proposed refactor
         self.data_dir = None
-        if "data_dir" in self.cfg.eval_config and not isinstance(self.cfg.eval_config.get("data_dir"), type(None)):
+        if self.cfg.eval_config.get("data_dir") is not None:
             self.data_dir = self.cfg.eval_config["data_dir"]

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7205c43 and 246fa01.

📒 Files selected for processing (4)

nemo_skills/inference/generate.py (2 hunks)
nemo_skills/inference/model/__init__.py (2 hunks)
nemo_skills/inference/model/base.py (1 hunks)
nemo_skills/inference/model/vllm_multimodal.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

nemo_skills/inference/generate.py (1)

nemo_skills/inference/model/__init__.py (2)

get_code_execution_model (85-91)

get_model (62-82)

nemo_skills/inference/model/vllm_multimodal.py (3)

nemo_skills/utils.py (1)

get_logger_name (39-43)

nemo_skills/inference/model/vllm.py (1)

VLLMModel (27-148)

nemo_skills/inference/model/base.py (1)

_parse_chat_completion_response (356-391)

nemo_skills/inference/model/__init__.py (1)

nemo_skills/inference/model/vllm_multimodal.py (1)

VLLMMultimodalModel (26-82)

🪛 Ruff (0.14.8)

nemo_skills/inference/model/vllm_multimodal.py

76-76: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (6)

nemo_skills/inference/model/vllm_multimodal.py (3)

34-40: LGTM!

The initialization logic correctly creates the audio output directory when output_dir is provided. The use of exist_ok=True ensures safe directory creation.

76-78: Broad exception handling is appropriate here.

While static analysis flags the broad Exception catch, this is intentional to ensure graceful degradation. The code logs a warning and falls back to including base64 data, which preserves functionality when file operations fail. This pattern is acceptable for non-critical operations with a safe fallback.

42-50: Add defensive programming for response.id access.

The code assumes response.id exists without checking. While vLLM is OpenAI-compatible and should provide this field as standard, defensive programming would prevent failures if the field is missing in edge cases or non-standard configurations. Consider using getattr(response, 'id', fallback_value) or add error handling to gracefully handle cases where response.id is unavailable.

nemo_skills/inference/model/base.py (1)

78-85: LGTM!

The addition of data_dir and output_dir parameters is clean and follows Python conventions. The defaults ensure backward compatibility, and the type hints are clear.

nemo_skills/inference/model/__init__.py (1)

42-42: LGTM!

The import and registration of VLLMMultimodalModel follow the established pattern for other model types. This enables instantiation via server_type="vllm_multimodal" through the standard model loading pathway.

Also applies to: 55-55

nemo_skills/inference/generate.py (1)

408-430: LGTM!

The propagation of data_dir and output_dir to all model creation paths is consistent and correct. The pattern self.data_dir or "" ensures the default empty string is used when data_dir is None, aligning with the BaseModel signature.

Signed-off-by: Valentin Mendelev <[email protected]>

karpnv

LGTM

…12_vllm_multimodal

greptile-apps · 2026-01-06T00:24:53Z

Greptile Summary

Added a new VLLMMultimodalModel class that extends VLLMModel to handle multimodal outputs, specifically audio responses from OpenAI API servers. The implementation extracts audio data from responses, saves it to separate WAV files, and replaces base64 data with file paths to keep output.jsonl small.

Key changes:

Added data_dir and output_dir parameters to BaseModel constructor and propagated them through model initialization
Created VLLMMultimodalModel that overrides _parse_chat_completion_response() to extract and save audio responses
Implemented extraction of debug_info from response content using regex pattern matching
Audio files are saved to {output_dir}/audio/ directory with response ID as filename
Strips audio base64 data and debug_info from serialized output to reduce file size

Confidence Score: 4/5

This PR is safe to merge with low risk - adds new functionality without breaking existing code
Score reflects clean implementation of a new model class that extends existing functionality. Two minor style issues found: unusual null check pattern and missing audio data validation. Core logic is sound - properly inherits from VLLMModel, handles errors gracefully, and implements the multimodal output extraction correctly. No breaking changes to existing code paths.
No files require special attention - issues found are minor style improvements

Important Files Changed

Filename	Overview
nemo_skills/inference/generate.py	Added `data_dir` and `output_dir` parameters to model initialization calls. Minor concern with `isinstance(x, type(None))` check.
nemo_skills/inference/model/init.py	Simple registration of new `VLLMMultimodalModel` class in models dictionary. No issues.
nemo_skills/inference/model/base.py	Added `data_dir` and `output_dir` parameters to BaseModel constructor. Clean implementation.
nemo_skills/inference/model/vllm_multimodal.py	New model class for saving audio responses to disk. Extracts debug_info and processes audio from responses. Minor concern with base64 decode error handling.

Sequence Diagram

sequenceDiagram
    participant User
    participant GenerationTask
    participant VLLMMultimodalModel
    participant OpenAI_API
    participant FileSystem

    User->>GenerationTask: Initialize with config
    GenerationTask->>GenerationTask: setup_llm()
    GenerationTask->>GenerationTask: Extract data_dir from eval_config
    GenerationTask->>GenerationTask: Determine output_dir from output_file
    GenerationTask->>VLLMMultimodalModel: get_model(data_dir, output_dir, ...)
    VLLMMultimodalModel->>FileSystem: Create audio output directory
    
    User->>VLLMMultimodalModel: generate_async(prompt)
    VLLMMultimodalModel->>OpenAI_API: Send request
    OpenAI_API-->>VLLMMultimodalModel: Response with audio + debug_info
    VLLMMultimodalModel->>VLLMMultimodalModel: _parse_chat_completion_response()
    VLLMMultimodalModel->>VLLMMultimodalModel: Extract debug_info from content
    VLLMMultimodalModel->>VLLMMultimodalModel: _process_audio_response()
    VLLMMultimodalModel->>VLLMMultimodalModel: Base64 decode audio
    VLLMMultimodalModel->>FileSystem: Save audio as WAV file
    VLLMMultimodalModel->>VLLMMultimodalModel: Strip audio data from serialized_output
    VLLMMultimodalModel->>VLLMMultimodalModel: Strip debug_info from content
    VLLMMultimodalModel-->>User: Return result with audio path

greptile-apps

_{4 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-06T00:24:52Z

nemo_skills/inference/generate.py

        self.sandbox = get_sandbox(**self.cfg.sandbox) if self.cfg.sandbox is not None else None

+        self.data_dir = None
+        if "data_dir" in self.cfg.eval_config and not isinstance(self.cfg.eval_config.get("data_dir"), type(None)):


style: Use is not None instead of isinstance(x, type(None)) for null checks

Suggested change

if "data_dir" in self.cfg.eval_config and not isinstance(self.cfg.eval_config.get("data_dir"), type(None)):

if "data_dir" in self.cfg.eval_config and self.cfg.eval_config.get("data_dir") is not None:

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-06T00:24:53Z

nemo_skills/inference/model/vllm_multimodal.py

+            try:
+                audio_bytes = base64.b64decode(audio_base64)
+                filename = f"{response_id}.wav"
+                filepath = os.path.join(self.output_audio_dir, filename)
+
+                with open(filepath, "wb") as f:
+                    f.write(audio_bytes)


style: Base64 decode wrapped in try/catch at line 94, but decoded data isn't validated before writing to file - malformed audio data could be written

…eMo#1136) Signed-off-by: Valentin Mendelev <[email protected]> Co-authored-by: Nikolay Karpov <[email protected]> Signed-off-by: dlord <[email protected]>

Signed-off-by: Valentin Mendelev <[email protected]> Co-authored-by: Nikolay Karpov <[email protected]> Signed-off-by: Cheng-Ping Hsieh <[email protected]>

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

Intorduced vLLM_multimodal model to save multimodal outputs

7bd17f2

Signed-off-by: Valentin Mendelev <[email protected]>

vmendelev force-pushed the vmendelev/2512_vllm_multimodal branch from 246fa01 to 7bd17f2 Compare December 21, 2025 09:35

karpnv approved these changes Jan 6, 2026

View reviewed changes

Merge branch 'main' of github.com:Kipok/NeMo-Skills into vmendelev/25…

0b392e4

…12_vllm_multimodal

greptile-apps bot reviewed Jan 6, 2026

View reviewed changes

karpnv enabled auto-merge (squash) January 6, 2026 00:34

karpnv merged commit 7c039f5 into main Jan 6, 2026
6 checks passed

karpnv deleted the vmendelev/2512_vllm_multimodal branch January 6, 2026 00:37

coderabbitai bot mentioned this pull request Jan 9, 2026

Audio input output integration #1157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduced vLLM_multimodal model to save multimodal outputs #1136

Introduced vLLM_multimodal model to save multimodal outputs #1136

vmendelev commented Dec 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 19, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

karpnv left a comment

Uh oh!

greptile-apps bot commented Jan 6, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 6, 2026

Uh oh!

greptile-apps bot Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if "data_dir" in self.cfg.eval_config and not isinstance(self.cfg.eval_config.get("data_dir"), type(None)):
	if "data_dir" in self.cfg.eval_config and self.cfg.eval_config.get("data_dir") is not None:

Introduced vLLM_multimodal model to save multimodal outputs #1136

Introduced vLLM_multimodal model to save multimodal outputs #1136

Conversation

vmendelev commented Dec 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 19, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

karpnv left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 6, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vmendelev commented Dec 19, 2025 •

edited by coderabbitai bot

Loading