feat: support qwen-omni grpo training recipe#2073
feat: support qwen-omni grpo training recipe#2073yuekaizhang wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: root <zhangyuekai@foxmail.com>
📝 WalkthroughWalkthroughAdds audio training support by introducing AISHELL and AVQA dataset wrappers with audio preprocessing, audio-enabled configuration files for GRPO and SFT training, and extends multimodal data handling across collation, processing, and generation pipelines to support audio modality alongside images and text. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~35 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
examples/prompts/avqa_cot.txt (1)
1-1: Clarify the intent of the empty prompt template.The file contains only
{}which provides no prompt formatting. If this is intentional (e.g., AVQA dataset already contains formatted prompts), consider adding a comment explaining this. If it's a placeholder, the TODO in the PR checklist should track completing it.📝 Proposed documentation
-{} +{ + // Empty template: AVQA dataset messages are pre-formatted. + // The user message content is passed through without additional prompt wrapping. +}Or if JSON comments aren't supported, create a companion README or use the prompt file itself:
-{} +{question}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/prompts/avqa_cot.txt` at line 1, The file examples/prompts/avqa_cot.txt currently contains only "{}", which is ambiguous; update the file to clarify intent by either replacing "{}" with the intended prompt template for AVQA chain-of-thought (or a clear placeholder template) or add a top-line comment explaining that "{}" is intentional because prompts are provided externally by the AVQA dataset and link to the dataset/source; if this is a temporary placeholder, add a TODO with an issue/PR reference in the file (or create a companion README) to indicate who will complete the template and when.nemo_rl/models/megatron/setup.py (1)
696-700: Consider addingthinkerunwrapping toMoEFloat16Module.re_enable_float32_expert_bias()for consistency.The
freeze_moe_routerfunction now unwraps models with athinkerattribute (line 696-697) before accessinglanguage_model. However,MoEFloat16Module.re_enable_float32_expert_bias()(lines 1051-1054) only checks forlanguage_model:# Line 1051-1054 if hasattr(module, "language_model"): module = module.language_modelIf this wrapper is used with Qwen2.5-Omni models, it may fail to properly access the decoder layers.
♻️ Proposed fix for consistency
def re_enable_float32_expert_bias(self) -> None: ... module = self.module + # Handle VLM models where thinker wraps the language model + if hasattr(module, "thinker"): + module = module.thinker # Handle VLM models where language model is nested if hasattr(module, "language_model"): module = module.language_model🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/models/megatron/setup.py` around lines 696 - 700, The method MoEFloat16Module.re_enable_float32_expert_bias currently only unwraps modules via the language_model attribute but freeze_moe_router also unwraps a thinker wrapper first; update re_enable_float32_expert_bias to mirror that logic by checking hasattr(module, "thinker") and setting module = module.thinker before the existing hasattr(module, "language_model") unwrap so it reliably reaches module.decoder.layers for wrapped models (e.g., Qwen2.5-Omni).nemo_rl/data/datasets/response_datasets/avqa.py (1)
103-107: Verify that list rendering forchoicesis intentional.
_parse_questionreturnschoicesas a list (e.g.,["3", "One", "4", "2"]), andDEFAULT_TEMPLATE.format(choices=choices)will render it as"['3', 'One', '4', '2']"in the prompt. This might produce awkward prompts like:"How many animals...? Please choose from: ['3', 'One', '4', '2']."
Consider formatting choices explicitly:
Suggested fix
+ choices_str = ", ".join(choices) if choices else "" - prompt_text = DEFAULT_TEMPLATE.format(question=question, choices=choices) + prompt_text = DEFAULT_TEMPLATE.format(question=question, choices=choices_str)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/data/datasets/response_datasets/avqa.py` around lines 103 - 107, The prompt currently inserts the raw list returned by _parse_question into DEFAULT_TEMPLATE, producing Python-list style output (e.g., "['3','One',...]"); before formatting the template convert choices into a human-friendly string (e.g., choices_str = ", ".join(choices) or another desired separator/labeling) and use that string when building prompt_text (i.e., pass choices=choices_str to DEFAULT_TEMPLATE.format), keeping the rest of the logic (question replacement and prompt_text creation) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/configs/audio_grpo_3B_megatron.yaml`:
- Around line 63-66: The config hardcodes a local path for policy.model_name
which is user-specific; update policy.model_name to a HuggingFace model
identifier or a clear placeholder (e.g., "qwen/qwen-2.5-omni" or
"<HF_MODEL_ID>") and ensure tokenizer.name references the same identifier
(tokenizer.name: ${policy.model_name}) so others can run the example without the
local filesystem path.
- Line 140: The YAML sets converter_type: Qwen2_5OmniForConditionalGeneration
which is unsupported by Megatron-Bridge; update the converter_type entry to a
supported converter (e.g., Qwen2, Qwen2.5, Qwen2.5-VL, or a Qwen3 variant) or
remove the converter_type line and wire in a custom bridge implementation if
Omni (audio/video/speech) support is required; look for the converter_type key
in the file and replace Qwen2_5OmniForConditionalGeneration with the appropriate
supported converter name or add a note to implement a custom Megatron-Bridge
converter for Omni models.
In `@examples/configs/sft_audio_lm_megatron.yaml`:
- Around line 24-26: The config's policy.model_name is set to a user-local path
(/workspace_yuekai/HF/Qwen2-Audio-7B); replace it with a reproducible
HuggingFace model identifier or a clear placeholder (e.g., "Qwen2-Audio-7B" or
"<HF_MODEL_ID>") so other users can run the example, and ensure the
corresponding tokenizer field under policy is set to a matching tokenizer ID or
placeholder as well.
In `@nemo_rl/data/datasets/response_datasets/aishell.py`:
- Line 42: The load_dataset invocation in the constructor incorrectly hardcodes
split="test" and passes the validated split as a positional arg, causing the
user-provided split to be ignored; update the load_dataset call referenced by
self.dataset to use the split variable (e.g., pass split as the keyword
split=split or as the single positional split) and remove the hardcoded
split="test" so the requested split parameter is honored.
- Line 33: vlm_hf_data_processor is missing a handler for task_name "aishell",
causing a ValueError; update the dispatcher in vlm_hf_data_processor (in
nemo_rl/data/processors.py) to add a branch for task_name == "aishell" that
mirrors the AVQA pass-through behavior (i.e., return the input examples/records
unchanged or call the same helper used by AVQA), referencing the task_name
"aishell" string and the vlm_hf_data_processor function name so the aishell
dataset in nemo_rl/data/datasets/response_datasets/aishell.py is processed
without error.
In `@nemo_rl/data/datasets/response_datasets/avqa.py`:
- Line 84: Replace the hardcoded path passed to load_dataset with a configurable
parameter: accept a data_path (or dataset_id) from the constructor kwargs or
config, default to a public HuggingFace dataset identifier if not provided, and
use that value when calling load_dataset to set self.dataset; update the
constructor signature and any callers to forward data_path and ensure the code
uses load_dataset(data_path_or_id, split=split) instead of the
developer-specific "/workspace_yuekai/HF/avqa-processed".
---
Nitpick comments:
In `@examples/prompts/avqa_cot.txt`:
- Line 1: The file examples/prompts/avqa_cot.txt currently contains only "{}",
which is ambiguous; update the file to clarify intent by either replacing "{}"
with the intended prompt template for AVQA chain-of-thought (or a clear
placeholder template) or add a top-line comment explaining that "{}" is
intentional because prompts are provided externally by the AVQA dataset and link
to the dataset/source; if this is a temporary placeholder, add a TODO with an
issue/PR reference in the file (or create a companion README) to indicate who
will complete the template and when.
In `@nemo_rl/data/datasets/response_datasets/avqa.py`:
- Around line 103-107: The prompt currently inserts the raw list returned by
_parse_question into DEFAULT_TEMPLATE, producing Python-list style output (e.g.,
"['3','One',...]"); before formatting the template convert choices into a
human-friendly string (e.g., choices_str = ", ".join(choices) or another desired
separator/labeling) and use that string when building prompt_text (i.e., pass
choices=choices_str to DEFAULT_TEMPLATE.format), keeping the rest of the logic
(question replacement and prompt_text creation) unchanged.
In `@nemo_rl/models/megatron/setup.py`:
- Around line 696-700: The method MoEFloat16Module.re_enable_float32_expert_bias
currently only unwraps modules via the language_model attribute but
freeze_moe_router also unwraps a thinker wrapper first; update
re_enable_float32_expert_bias to mirror that logic by checking hasattr(module,
"thinker") and setting module = module.thinker before the existing
hasattr(module, "language_model") unwrap so it reliably reaches
module.decoder.layers for wrapped models (e.g., Qwen2.5-Omni).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b678d307-04c5-4bc1-8c8e-4d1cd5f8e056
📒 Files selected for processing (15)
examples/configs/audio_grpo_3B_megatron.yamlexamples/configs/sft_audio_lm_megatron.yamlexamples/configs/sft_openmathinstruct2.yamlexamples/prompts/avqa_cot.txtnemo_rl/data/collate_fn.pynemo_rl/data/datasets/response_datasets/__init__.pynemo_rl/data/datasets/response_datasets/aishell.pynemo_rl/data/datasets/response_datasets/avqa.pynemo_rl/data/multimodal_utils.pynemo_rl/data/processors.pynemo_rl/environments/utils.pynemo_rl/experience/rollouts.pynemo_rl/models/generation/vllm/utils.pynemo_rl/models/megatron/setup.pynemo_rl/utils/logger.py
Signed-off-by: root <zhangyuekai@foxmail.com>
|
@snowmanwwg Hi, I was wondering if you know someone could help review the PR, many thanks. I have verified the PR with the below training results:
|
Conditional PR: NVIDIA-NeMo/Megatron-Bridge#2634, NVIDIA-NeMo/Megatron-Bridge#2342
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit