Skip to content

feat: support qwen-omni grpo training recipe#2073

Open
yuekaizhang wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
yuekaizhang:qwen_omni
Open

feat: support qwen-omni grpo training recipe#2073
yuekaizhang wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
yuekaizhang:qwen_omni

Conversation

@yuekaizhang
Copy link

@yuekaizhang yuekaizhang commented Mar 6, 2026

Conditional PR: NVIDIA-NeMo/Megatron-Bridge#2634, NVIDIA-NeMo/Megatron-Bridge#2342

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Added audio support for multimodal environments and data processing pipelines.
    • Introduced AISHELL dataset for automatic speech recognition training.
    • Introduced AVQA dataset for audio question-answering fine-tuning.
    • Added example configurations for audio GRPO and audio language model training with Megatron backend.
    • Enhanced multimodal content handling to process audio alongside images and videos.

Signed-off-by: root <zhangyuekai@foxmail.com>
@yuekaizhang yuekaizhang requested review from a team as code owners March 6, 2026 04:41
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

Adds audio training support by introducing AISHELL and AVQA dataset wrappers with audio preprocessing, audio-enabled configuration files for GRPO and SFT training, and extends multimodal data handling across collation, processing, and generation pipelines to support audio modality alongside images and text.

Changes

Cohort / File(s) Summary
Audio Dataset Implementations
nemo_rl/data/datasets/response_datasets/aishell.py, nemo_rl/data/datasets/response_datasets/avqa.py, nemo_rl/data/datasets/response_datasets/__init__.py
Adds AishellDataset and AVQADataset classes with audio resampling, question parsing, and OpenAI-style message formatting. Registers both datasets in DATASET_REGISTRY and exports them via all.
Audio Training Configurations
examples/configs/audio_grpo_3B_megatron.yaml, examples/configs/sft_audio_lm_megatron.yaml, examples/configs/sft_openmathinstruct2.yaml
Introduces comprehensive GRPO and SFT configuration files for audio-based training with Megatron backend (Qwen2.5Omni and Qwen2-Audio), plus minor processor specification update to OpenMathInstruct config.
Audio Data Pipeline
nemo_rl/data/collate_fn.py, nemo_rl/data/processors.py, nemo_rl/data/multimodal_utils.py
Extends collation and processing logic to collect and forward vllm_audios; adds audio content handling in vlm_hf_data_processor alongside images/text; includes processor.model_input_names in multimodal key aggregation.
Audio in Generation & Rollouts
nemo_rl/experience/rollouts.py, nemo_rl/models/generation/vllm/utils.py
Propagates vllm_audios through rollout generation and generalizes vLLM multimodal data handling to support both images and audios in a unified multi_modal_data dictionary.
Infrastructure & Utilities
nemo_rl/environments/utils.py, nemo_rl/models/megatron/setup.py, nemo_rl/utils/logger.py, examples/prompts/avqa_cot.txt
Registers "avqa" environment in ENV_REGISTRY; adds VLM wrapper unwrapping for thinker module access in MoE router setup; improves numpy array serialization in JSONL logging; adds empty AVQA prompt template file.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Possibly related PRs

  • PR #2016 — Modifies the same multimodal data-loading and vLLM audio handling codepaths (processors, multimodal_utils, vLLM generation).
  • PR #1649 — Refactors dataset registry and loader interfaces in response_datasets, directly affected by new dataset registrations in this PR.
  • PR #1334 — Both modify vLLM integration code for multimodal handling (generation/vllm modules).

Suggested labels

CI:L1

Suggested reviewers

  • yuki-97
  • terrykong
  • cuichenx
🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR contains major changes (~666 lines) with new datasets and audio processing, but lacks experiment results, logs, and documentation despite being marked WIP with incomplete TODOs. Complete comprehensive testing of new datasets and training recipes, document test results in PR description, attach experiment logs as planned, and fix identified bugs before merging.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main purpose of the pull request: adding support for Qwen-Omni GRPO training recipe with new audio datasets, configurations, and processors.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
examples/prompts/avqa_cot.txt (1)

1-1: Clarify the intent of the empty prompt template.

The file contains only {} which provides no prompt formatting. If this is intentional (e.g., AVQA dataset already contains formatted prompts), consider adding a comment explaining this. If it's a placeholder, the TODO in the PR checklist should track completing it.

📝 Proposed documentation
-{}
+{
+  // Empty template: AVQA dataset messages are pre-formatted.
+  // The user message content is passed through without additional prompt wrapping.
+}

Or if JSON comments aren't supported, create a companion README or use the prompt file itself:

-{}
+{question}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/prompts/avqa_cot.txt` at line 1, The file
examples/prompts/avqa_cot.txt currently contains only "{}", which is ambiguous;
update the file to clarify intent by either replacing "{}" with the intended
prompt template for AVQA chain-of-thought (or a clear placeholder template) or
add a top-line comment explaining that "{}" is intentional because prompts are
provided externally by the AVQA dataset and link to the dataset/source; if this
is a temporary placeholder, add a TODO with an issue/PR reference in the file
(or create a companion README) to indicate who will complete the template and
when.
nemo_rl/models/megatron/setup.py (1)

696-700: Consider adding thinker unwrapping to MoEFloat16Module.re_enable_float32_expert_bias() for consistency.

The freeze_moe_router function now unwraps models with a thinker attribute (line 696-697) before accessing language_model. However, MoEFloat16Module.re_enable_float32_expert_bias() (lines 1051-1054) only checks for language_model:

# Line 1051-1054
if hasattr(module, "language_model"):
    module = module.language_model

If this wrapper is used with Qwen2.5-Omni models, it may fail to properly access the decoder layers.

♻️ Proposed fix for consistency
 def re_enable_float32_expert_bias(self) -> None:
     ...
     module = self.module
+    # Handle VLM models where thinker wraps the language model
+    if hasattr(module, "thinker"):
+        module = module.thinker
     # Handle VLM models where language model is nested
     if hasattr(module, "language_model"):
         module = module.language_model
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/models/megatron/setup.py` around lines 696 - 700, The method
MoEFloat16Module.re_enable_float32_expert_bias currently only unwraps modules
via the language_model attribute but freeze_moe_router also unwraps a thinker
wrapper first; update re_enable_float32_expert_bias to mirror that logic by
checking hasattr(module, "thinker") and setting module = module.thinker before
the existing hasattr(module, "language_model") unwrap so it reliably reaches
module.decoder.layers for wrapped models (e.g., Qwen2.5-Omni).
nemo_rl/data/datasets/response_datasets/avqa.py (1)

103-107: Verify that list rendering for choices is intentional.

_parse_question returns choices as a list (e.g., ["3", "One", "4", "2"]), and DEFAULT_TEMPLATE.format(choices=choices) will render it as "['3', 'One', '4', '2']" in the prompt. This might produce awkward prompts like:

"How many animals...? Please choose from: ['3', 'One', '4', '2']."

Consider formatting choices explicitly:

Suggested fix
+        choices_str = ", ".join(choices) if choices else ""
-        prompt_text = DEFAULT_TEMPLATE.format(question=question, choices=choices)
+        prompt_text = DEFAULT_TEMPLATE.format(question=question, choices=choices_str)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_rl/data/datasets/response_datasets/avqa.py` around lines 103 - 107, The
prompt currently inserts the raw list returned by _parse_question into
DEFAULT_TEMPLATE, producing Python-list style output (e.g., "['3','One',...]");
before formatting the template convert choices into a human-friendly string
(e.g., choices_str = ", ".join(choices) or another desired separator/labeling)
and use that string when building prompt_text (i.e., pass choices=choices_str to
DEFAULT_TEMPLATE.format), keeping the rest of the logic (question replacement
and prompt_text creation) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/configs/audio_grpo_3B_megatron.yaml`:
- Around line 63-66: The config hardcodes a local path for policy.model_name
which is user-specific; update policy.model_name to a HuggingFace model
identifier or a clear placeholder (e.g., "qwen/qwen-2.5-omni" or
"<HF_MODEL_ID>") and ensure tokenizer.name references the same identifier
(tokenizer.name: ${policy.model_name}) so others can run the example without the
local filesystem path.
- Line 140: The YAML sets converter_type: Qwen2_5OmniForConditionalGeneration
which is unsupported by Megatron-Bridge; update the converter_type entry to a
supported converter (e.g., Qwen2, Qwen2.5, Qwen2.5-VL, or a Qwen3 variant) or
remove the converter_type line and wire in a custom bridge implementation if
Omni (audio/video/speech) support is required; look for the converter_type key
in the file and replace Qwen2_5OmniForConditionalGeneration with the appropriate
supported converter name or add a note to implement a custom Megatron-Bridge
converter for Omni models.

In `@examples/configs/sft_audio_lm_megatron.yaml`:
- Around line 24-26: The config's policy.model_name is set to a user-local path
(/workspace_yuekai/HF/Qwen2-Audio-7B); replace it with a reproducible
HuggingFace model identifier or a clear placeholder (e.g., "Qwen2-Audio-7B" or
"<HF_MODEL_ID>") so other users can run the example, and ensure the
corresponding tokenizer field under policy is set to a matching tokenizer ID or
placeholder as well.

In `@nemo_rl/data/datasets/response_datasets/aishell.py`:
- Line 42: The load_dataset invocation in the constructor incorrectly hardcodes
split="test" and passes the validated split as a positional arg, causing the
user-provided split to be ignored; update the load_dataset call referenced by
self.dataset to use the split variable (e.g., pass split as the keyword
split=split or as the single positional split) and remove the hardcoded
split="test" so the requested split parameter is honored.
- Line 33: vlm_hf_data_processor is missing a handler for task_name "aishell",
causing a ValueError; update the dispatcher in vlm_hf_data_processor (in
nemo_rl/data/processors.py) to add a branch for task_name == "aishell" that
mirrors the AVQA pass-through behavior (i.e., return the input examples/records
unchanged or call the same helper used by AVQA), referencing the task_name
"aishell" string and the vlm_hf_data_processor function name so the aishell
dataset in nemo_rl/data/datasets/response_datasets/aishell.py is processed
without error.

In `@nemo_rl/data/datasets/response_datasets/avqa.py`:
- Line 84: Replace the hardcoded path passed to load_dataset with a configurable
parameter: accept a data_path (or dataset_id) from the constructor kwargs or
config, default to a public HuggingFace dataset identifier if not provided, and
use that value when calling load_dataset to set self.dataset; update the
constructor signature and any callers to forward data_path and ensure the code
uses load_dataset(data_path_or_id, split=split) instead of the
developer-specific "/workspace_yuekai/HF/avqa-processed".

---

Nitpick comments:
In `@examples/prompts/avqa_cot.txt`:
- Line 1: The file examples/prompts/avqa_cot.txt currently contains only "{}",
which is ambiguous; update the file to clarify intent by either replacing "{}"
with the intended prompt template for AVQA chain-of-thought (or a clear
placeholder template) or add a top-line comment explaining that "{}" is
intentional because prompts are provided externally by the AVQA dataset and link
to the dataset/source; if this is a temporary placeholder, add a TODO with an
issue/PR reference in the file (or create a companion README) to indicate who
will complete the template and when.

In `@nemo_rl/data/datasets/response_datasets/avqa.py`:
- Around line 103-107: The prompt currently inserts the raw list returned by
_parse_question into DEFAULT_TEMPLATE, producing Python-list style output (e.g.,
"['3','One',...]"); before formatting the template convert choices into a
human-friendly string (e.g., choices_str = ", ".join(choices) or another desired
separator/labeling) and use that string when building prompt_text (i.e., pass
choices=choices_str to DEFAULT_TEMPLATE.format), keeping the rest of the logic
(question replacement and prompt_text creation) unchanged.

In `@nemo_rl/models/megatron/setup.py`:
- Around line 696-700: The method MoEFloat16Module.re_enable_float32_expert_bias
currently only unwraps modules via the language_model attribute but
freeze_moe_router also unwraps a thinker wrapper first; update
re_enable_float32_expert_bias to mirror that logic by checking hasattr(module,
"thinker") and setting module = module.thinker before the existing
hasattr(module, "language_model") unwrap so it reliably reaches
module.decoder.layers for wrapped models (e.g., Qwen2.5-Omni).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b678d307-04c5-4bc1-8c8e-4d1cd5f8e056

📥 Commits

Reviewing files that changed from the base of the PR and between c4f8e1c and ad1c0b6.

📒 Files selected for processing (15)
  • examples/configs/audio_grpo_3B_megatron.yaml
  • examples/configs/sft_audio_lm_megatron.yaml
  • examples/configs/sft_openmathinstruct2.yaml
  • examples/prompts/avqa_cot.txt
  • nemo_rl/data/collate_fn.py
  • nemo_rl/data/datasets/response_datasets/__init__.py
  • nemo_rl/data/datasets/response_datasets/aishell.py
  • nemo_rl/data/datasets/response_datasets/avqa.py
  • nemo_rl/data/multimodal_utils.py
  • nemo_rl/data/processors.py
  • nemo_rl/environments/utils.py
  • nemo_rl/experience/rollouts.py
  • nemo_rl/models/generation/vllm/utils.py
  • nemo_rl/models/megatron/setup.py
  • nemo_rl/utils/logger.py

Signed-off-by: root <zhangyuekai@foxmail.com>
Signed-off-by: root <zhangyuekai@foxmail.com>
@yuekaizhang yuekaizhang changed the title [WIP] support qwen-omni grpo training recipe feat: support qwen-omni grpo training recipe Mar 10, 2026
@yuekaizhang
Copy link
Author

yuekaizhang commented Mar 10, 2026

@snowmanwwg Hi, I was wondering if you know someone could help review the PR, many thanks.

I have verified the PR with the below training results:

Model MMAU (v05.15.25)
Qwen2.5-Omni-3B 69.8
+ HF GRPO 71.6
+ Nemo-RL GRPO (This PR) 72.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant