Skip to content

Conversation

@vmendelev
Copy link
Collaborator

@vmendelev vmendelev commented Dec 19, 2025

This model class is needed to enable parsing of the OpenAI API server output to get non-text (audio in this case) data and saving them into separate files in order to keep output.jsonl small.

Summary by CodeRabbit

Release Notes

  • New Features

    • Audio response extraction with automatic WAV file storage capability added
    • New multimodal model variant now available
  • Improvements

    • Model configuration now supports data directory and output directory parameters for enhanced workflow management

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

📝 Walkthrough

Walkthrough

The PR extends the inference system to support audio responses. It propagates data_dir and output_dir parameters through model initialization paths and introduces a new VLLMMultimodalModel class that processes and persists audio data from chat completion responses.

Changes

Cohort / File(s) Summary
Model initialization parameter propagation
nemo_skills/inference/generate.py, nemo_skills/inference/model/base.py
Extract data_dir and output_dir from config and propagate through get_code_execution_model(), get_tool_calling_model(), and get_model(). Add corresponding instance attributes to BaseModel.__init__ with defaults.
Multimodal model support
nemo_skills/inference/model/vllm_multimodal.py
Introduce VLLMMultimodalModel subclass of VLLMModel with methods to parse chat completion responses, extract audio data, decode base64 audio, and save as WAV files to the output directory.
Model registry update
nemo_skills/inference/model/__init__.py
Import and register VLLMMultimodalModel in the models mapping to enable get_model() support for the new server type.

Sequence Diagram(s)

sequenceDiagram
    participant gen as generate.py
    participant model as BaseModel
    participant mm as VLLMMultimodalModel
    participant api as vLLM API
    participant fs as File System
    
    gen->>gen: Extract data_dir, output_dir<br/>from eval_config
    gen->>mm: Instantiate with data_dir,<br/>output_dir
    mm->>mm: Create output_dir/audio
    mm->>api: Get chat completion
    api-->>mm: Return response with audio
    mm->>mm: _parse_chat_completion_response()
    mm->>mm: Call base _parse_chat_completion_response()
    mm->>mm: Check for audio in response
    alt Audio present
        mm->>mm: _process_audio_response()
        mm->>mm: Decode base64 audio
        mm->>fs: Save as WAV file
        fs-->>mm: File saved
        mm->>mm: Return audio metadata
    else No audio or no output_dir
        mm->>mm: Include base64 in metadata
    end
    mm-->>gen: Return parsed response<br/>with audio dict
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Parameter propagation consistency: Verify data_dir and output_dir are correctly threaded through all three model instantiation paths (get_code_execution_model, get_tool_calling_model, get_model)
  • Audio processing logic: Review base64 decoding, WAV file writing, error handling, and graceful fallback when save fails
  • Registry integration: Confirm VLLMMultimodalModel is properly imported and registered, and that existing model instantiation logic is preserved

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: introducing a vLLM_multimodal model with the primary purpose of saving multimodal outputs, which aligns with the PR objectives of extracting and saving audio data separately.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch vmendelev/2512_vllm_multimodal

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
nemo_skills/inference/model/vllm_multimodal.py (1)

52-82: Consider using dynamic file extension based on format.

The filename is hardcoded as {response_id}.wav (line 67), but the format is extracted from audio_data.format (line 55) which may not always be "wav". This creates a mismatch between the file extension and the actual audio format.

🔎 Proposed fix
         if self.output_audio_dir:
             try:
                 audio_bytes = base64.b64decode(audio_base64)
-                filename = f"{response_id}.wav"
+                audio_format = audio_info["format"]
+                filename = f"{response_id}.{audio_format}"
                 filepath = os.path.join(self.output_audio_dir, filename)
nemo_skills/inference/generate.py (1)

402-406: Simplify the None check.

The condition on line 403 uses isinstance(self.cfg.eval_config.get("data_dir"), type(None)), which is verbose and unconventional. Consider using a more idiomatic approach.

🔎 Proposed refactor
         self.data_dir = None
-        if "data_dir" in self.cfg.eval_config and not isinstance(self.cfg.eval_config.get("data_dir"), type(None)):
+        if self.cfg.eval_config.get("data_dir") is not None:
             self.data_dir = self.cfg.eval_config["data_dir"]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7205c43 and 246fa01.

📒 Files selected for processing (4)
  • nemo_skills/inference/generate.py (2 hunks)
  • nemo_skills/inference/model/__init__.py (2 hunks)
  • nemo_skills/inference/model/base.py (1 hunks)
  • nemo_skills/inference/model/vllm_multimodal.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_skills/inference/generate.py (1)
nemo_skills/inference/model/__init__.py (2)
  • get_code_execution_model (85-91)
  • get_model (62-82)
nemo_skills/inference/model/vllm_multimodal.py (3)
nemo_skills/utils.py (1)
  • get_logger_name (39-43)
nemo_skills/inference/model/vllm.py (1)
  • VLLMModel (27-148)
nemo_skills/inference/model/base.py (1)
  • _parse_chat_completion_response (356-391)
nemo_skills/inference/model/__init__.py (1)
nemo_skills/inference/model/vllm_multimodal.py (1)
  • VLLMMultimodalModel (26-82)
🪛 Ruff (0.14.8)
nemo_skills/inference/model/vllm_multimodal.py

76-76: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (6)
nemo_skills/inference/model/vllm_multimodal.py (3)

34-40: LGTM!

The initialization logic correctly creates the audio output directory when output_dir is provided. The use of exist_ok=True ensures safe directory creation.


76-78: Broad exception handling is appropriate here.

While static analysis flags the broad Exception catch, this is intentional to ensure graceful degradation. The code logs a warning and falls back to including base64 data, which preserves functionality when file operations fail. This pattern is acceptable for non-critical operations with a safe fallback.


42-50: Add defensive programming for response.id access.

The code assumes response.id exists without checking. While vLLM is OpenAI-compatible and should provide this field as standard, defensive programming would prevent failures if the field is missing in edge cases or non-standard configurations. Consider using getattr(response, 'id', fallback_value) or add error handling to gracefully handle cases where response.id is unavailable.

nemo_skills/inference/model/base.py (1)

78-85: LGTM!

The addition of data_dir and output_dir parameters is clean and follows Python conventions. The defaults ensure backward compatibility, and the type hints are clear.

nemo_skills/inference/model/__init__.py (1)

42-42: LGTM!

The import and registration of VLLMMultimodalModel follow the established pattern for other model types. This enables instantiation via server_type="vllm_multimodal" through the standard model loading pathway.

Also applies to: 55-55

nemo_skills/inference/generate.py (1)

408-430: LGTM!

The propagation of data_dir and output_dir to all model creation paths is consistent and correct. The pattern self.data_dir or "" ensures the default empty string is used when data_dir is None, aligning with the BaseModel signature.

@vmendelev vmendelev force-pushed the vmendelev/2512_vllm_multimodal branch from 246fa01 to 7bd17f2 Compare December 21, 2025 09:35
Copy link
Collaborator

@karpnv karpnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 6, 2026

Greptile Summary

Added a new VLLMMultimodalModel class that extends VLLMModel to handle multimodal outputs, specifically audio responses from OpenAI API servers. The implementation extracts audio data from responses, saves it to separate WAV files, and replaces base64 data with file paths to keep output.jsonl small.

Key changes:

  • Added data_dir and output_dir parameters to BaseModel constructor and propagated them through model initialization
  • Created VLLMMultimodalModel that overrides _parse_chat_completion_response() to extract and save audio responses
  • Implemented extraction of debug_info from response content using regex pattern matching
  • Audio files are saved to {output_dir}/audio/ directory with response ID as filename
  • Strips audio base64 data and debug_info from serialized output to reduce file size

Confidence Score: 4/5

  • This PR is safe to merge with low risk - adds new functionality without breaking existing code
  • Score reflects clean implementation of a new model class that extends existing functionality. Two minor style issues found: unusual null check pattern and missing audio data validation. Core logic is sound - properly inherits from VLLMModel, handles errors gracefully, and implements the multimodal output extraction correctly. No breaking changes to existing code paths.
  • No files require special attention - issues found are minor style improvements

Important Files Changed

Filename Overview
nemo_skills/inference/generate.py Added data_dir and output_dir parameters to model initialization calls. Minor concern with isinstance(x, type(None)) check.
nemo_skills/inference/model/init.py Simple registration of new VLLMMultimodalModel class in models dictionary. No issues.
nemo_skills/inference/model/base.py Added data_dir and output_dir parameters to BaseModel constructor. Clean implementation.
nemo_skills/inference/model/vllm_multimodal.py New model class for saving audio responses to disk. Extracts debug_info and processes audio from responses. Minor concern with base64 decode error handling.

Sequence Diagram

sequenceDiagram
    participant User
    participant GenerationTask
    participant VLLMMultimodalModel
    participant OpenAI_API
    participant FileSystem

    User->>GenerationTask: Initialize with config
    GenerationTask->>GenerationTask: setup_llm()
    GenerationTask->>GenerationTask: Extract data_dir from eval_config
    GenerationTask->>GenerationTask: Determine output_dir from output_file
    GenerationTask->>VLLMMultimodalModel: get_model(data_dir, output_dir, ...)
    VLLMMultimodalModel->>FileSystem: Create audio output directory
    
    User->>VLLMMultimodalModel: generate_async(prompt)
    VLLMMultimodalModel->>OpenAI_API: Send request
    OpenAI_API-->>VLLMMultimodalModel: Response with audio + debug_info
    VLLMMultimodalModel->>VLLMMultimodalModel: _parse_chat_completion_response()
    VLLMMultimodalModel->>VLLMMultimodalModel: Extract debug_info from content
    VLLMMultimodalModel->>VLLMMultimodalModel: _process_audio_response()
    VLLMMultimodalModel->>VLLMMultimodalModel: Base64 decode audio
    VLLMMultimodalModel->>FileSystem: Save audio as WAV file
    VLLMMultimodalModel->>VLLMMultimodalModel: Strip audio data from serialized_output
    VLLMMultimodalModel->>VLLMMultimodalModel: Strip debug_info from content
    VLLMMultimodalModel-->>User: Return result with audio path
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

self.sandbox = get_sandbox(**self.cfg.sandbox) if self.cfg.sandbox is not None else None

self.data_dir = None
if "data_dir" in self.cfg.eval_config and not isinstance(self.cfg.eval_config.get("data_dir"), type(None)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Use is not None instead of isinstance(x, type(None)) for null checks

Suggested change
if "data_dir" in self.cfg.eval_config and not isinstance(self.cfg.eval_config.get("data_dir"), type(None)):
if "data_dir" in self.cfg.eval_config and self.cfg.eval_config.get("data_dir") is not None:

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +93 to +99
try:
audio_bytes = base64.b64decode(audio_base64)
filename = f"{response_id}.wav"
filepath = os.path.join(self.output_audio_dir, filename)

with open(filepath, "wb") as f:
f.write(audio_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Base64 decode wrapped in try/catch at line 94, but decoded data isn't validated before writing to file - malformed audio data could be written

@karpnv karpnv enabled auto-merge (squash) January 6, 2026 00:34
@karpnv karpnv merged commit 7c039f5 into main Jan 6, 2026
6 checks passed
@karpnv karpnv deleted the vmendelev/2512_vllm_multimodal branch January 6, 2026 00:37
blahblahasdf pushed a commit to blahblahasdf/Skills that referenced this pull request Jan 8, 2026
…eMo#1136)

Signed-off-by: Valentin Mendelev <[email protected]>
Co-authored-by: Nikolay Karpov <[email protected]>
Signed-off-by: dlord <[email protected]>
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: Valentin Mendelev <[email protected]>
Co-authored-by: Nikolay Karpov <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants