Skip to content

add support for qwen3.5 vl model#70

Merged
RobotSail merged 14 commits intoRed-Hat-AI-Innovation-Team:mainfrom
RobotSail:add-qwen3-vl-support
Mar 6, 2026
Merged

add support for qwen3.5 vl model#70
RobotSail merged 14 commits intoRed-Hat-AI-Innovation-Team:mainfrom
RobotSail:add-qwen3-vl-support

Conversation

@RobotSail
Copy link
Collaborator

@RobotSail RobotSail commented Mar 3, 2026

This PR allows the mini trainer backend to be compatible with the Qwen3.5 VL model.

Summary by CodeRabbit

  • New Features

    • VLM utilities to detect VLMs, extract CausalLM text backbones, or load VLMs directly for text-only training.
    • Added Qwen 3.5 and Ministral3 variants (and VLM entries like Qwen3VL, Gemma3n) to supported fine-tuning targets.
  • Improvements

    • Expanded model-name mappings and multi-step fallback resolution for VLM-wrapped models.
    • More robust SFT/OSFT loading for VLM cases, improved attention/kernel handling (M‑RoPE), vision-tower tweaks, and clearer fallback logging.

@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Add VLM-aware model loading and extraction to obtain CausalLM text backbones from Vision‑Language Models, detect M‑RoPE to adjust attention implementation, introduce Qwen3_5 model config/name mappings, expand supported OSFT/SFT model lists, and expose a classmethod on the dynamic OSFT wrapper.

Changes

Cohort / File(s) Summary
VLM utilities
src/mini_trainer/vlm_utils.py
New module implementing is_vlm_with_causal_lm, is_vlm_for_direct_loading, _find_text_backbone, extract_causal_lm_from_vlm, load_vlm_for_text_training, and has_mrope — detect VLMs, load/extract text backbones, filter quantization_config=None, and log progress.
Model setup & loading
src/mini_trainer/setup_model_for_training.py
Integrated VLM-aware loading flows for SFT/OSFT/standard loads: conditional full VLM load + extraction or direct VLM-for-text loading; added imports for VLM helpers; expanded model lists (Ministral3ForCausalLM, Qwen3_5ForCausalLM, Qwen3VLForConditionalGeneration, Gemma3nForConditionalGeneration); M‑RoPE, SDPA, timm-vision, and hybrid kernel handling added.
OSFT utilities & config
src/mini_trainer/osft_utils.py
Adjusted OSFT loading to detect/extract CausalLM backbones from VLMs for memory-efficient loading; added qwen3_5 config patterns and name mapping (qwen3.5qwen3_5); added classmethod @classmethod _can_set_experts_implementation on dynamic ModelWithOSFT.
Model-class resolution
src/mini_trainer/utils.py
get_model_class_from_config now falls back to config.text_config mapping and then MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING if direct CausalLM mapping is missing, to support VLM-wrapped or direct-load VLM classes.
Misc / logging / kernels
src/mini_trainer/...
Broadened logging and guards around VLM extraction paths, filtered quantization_config=None when calling AutoModel loaders, patched local hybrid kernel mappings with fallback logging when imports fail.

Sequence Diagram(s)

sequenceDiagram
    participant Loader as Loader
    participant Hub as Model Hub
    participant AutoVLM as AutoModelForImageTextToText
    participant Base as CausalLM (reconstructed)
    participant Cache as Cache/Device

    Loader->>Hub: request model_path + load_kwargs
    alt config indicates VLM wrapping a CausalLM
        Loader->>AutoVLM: load full VLM
        AutoVLM-->>Loader: VLM instance
        Loader->>AutoVLM: locate text backbone (language_model / text_model / llm)
        Loader->>Base: build CausalLM from backbone config
        Loader->>Base: transfer weights and lm_head
        Loader->>Cache: free VLM objects, clear device/cache
        Loader-->>Base: return standalone CausalLM
    else VLM direct-loadable (no CausalLM)
        Loader->>AutoVLM: load VLM via AutoModelForImageTextToText
        AutoVLM-->>Loader: VLM as text model
        Loader-->>Base: return VLM-based text model
    else direct CausalLM
        Loader->>Hub: load via AutoModelForCausalLM
        Hub-->>Loader: CausalLM
        Loader-->>Base: return CausalLM
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through configs, threads, and beam,
Pulled text from a VLM, stitched it like a seam,
Qwen3_5 hopped in with patterns brand new,
M‑RoPE nudged attention — SDPA too,
Joyful hops for models, tidy and true.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add support for qwen3.5 vl model' is directly related to the PR's primary objective of adding Qwen3.5 VL model compatibility, though the changeset also adds support for other VLMs (Gemma3N, Ministral3) and includes broader VLM infrastructure.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 39.35185% with 131 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/mini_trainer/vlm_utils.py 37.62% 63 Missing ⚠️
src/mini_trainer/setup_model_for_training.py 36.58% 52 Missing ⚠️
src/mini_trainer/utils.py 0.00% 9 Missing ⚠️
src/mini_trainer/osft_utils.py 65.00% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mini_trainer/osft_utils.py`:
- Around line 361-362: The "qwen3.5" entry is being shadowed by the broader
"qwen" matcher in the mapping used by _get_model_patterns_from_name(), so ensure
the more specific "qwen3.5" key (mapping to qwen3_5) is checked before the
general "qwen" key; reorder the model-name->pattern mapping (or adjust the
matching logic in _get_model_patterns_from_name() to prefer exact/specific
matches like "qwen3.5" before substring matches such as "qwen") so names like
"qwen3.5-*" resolve to qwen3_5 instead of the generic qwen pattern.

In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 912-917: The M-RoPE SDPA fallback is incorrectly placed inside the
flash_attn import try block; move the _uses_mrope check so it executes before
any attempt to import flash_attn. Specifically, ensure the branch that sets
base_model_args["attn_implementation"] = "sdpa" and calls log_rank_0(f"Using
SDPA for {model_name_or_path} ...") runs unconditionally when _uses_mrope is
True (and before the flash_attn import/try block) so SDPA can be selected even
if the flash_attn package is absent.
- Around line 899-907: The code assumes rope_parameters is a dict and uses
`"mrope_section" in _rope_params`, which can raise TypeError when
rope_parameters is a RopeParameters object; update the check to handle both dict
and object types by normalizing or testing appropriately: retrieve
model_config.text_config (as _text_config), get rope_parameters into
_rope_params, then if _rope_params is a dict check membership with
"mrope_section" in _rope_params, else if it's an object check for attribute
presence (hasattr(_rope_params, "mrope_section") or getattr(_rope_params,
"mrope_section", None) is not None); set _uses_mrope accordingly so the
subsequent logic that uses _uses_mrope, model_config, and _text_config behaves
correctly.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d6dc87 and 03aa99c.

📒 Files selected for processing (2)
  • src/mini_trainer/osft_utils.py
  • src/mini_trainer/setup_model_for_training.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
src/mini_trainer/setup_model_for_training.py (2)

1008-1010: ⚠️ Potential issue | 🟠 Major

Handle rope_parameters as object-or-dict before checking mrope_section.

Line 1010 assumes dict membership; for object-shaped rope_parameters, this can throw TypeError and skip M-RoPE handling.

💡 Proposed fix
-    _rope_params = getattr(_text_config, "rope_parameters", {}) or {}
-    _uses_mrope = "mrope_section" in _rope_params
+    _rope_params = getattr(_text_config, "rope_parameters", None)
+    if isinstance(_rope_params, dict):
+        _uses_mrope = "mrope_section" in _rope_params
+    else:
+        _uses_mrope = getattr(_rope_params, "mrope_section", None) is not None
In transformers 4.57.1, what types can Qwen3_5TextConfig.rope_parameters take (dict vs object), and what is the recommended way to detect mrope_section?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/setup_model_for_training.py` around lines 1008 - 1010, The
code assumes rope_parameters is a dict and does "mrope_section" in _rope_params
which can raise TypeError for object-shaped rope_parameters; change the check to
handle both types by testing the type and using attribute access for objects:
for example, after _rope_params = getattr(_text_config, "rope_parameters", {})
or {}, set _uses_mrope = ("mrope_section" in _rope_params) if
isinstance(_rope_params, dict) else bool(getattr(_rope_params, "mrope_section",
False)); update references to model_config, _text_config, _rope_params, and
_uses_mrope accordingly.

1013-1041: ⚠️ Potential issue | 🟠 Major

M-RoPE SDPA fallback should be evaluated before flash_attn import.

Right now, if flash_attn is missing, control hits ImportError before the M-RoPE SDPA branch runs, which blocks intended compatibility for M-RoPE models.

💡 Proposed fix
-    try:
-        import flash_attn as _  # noqa: F401
-
-        if _uses_mrope:
-            base_model_args["attn_implementation"] = "sdpa"
-            log_rank_0(
-                f"Using SDPA for {model_name_or_path} (M-RoPE model incompatible with Flash Attention 2 varlen path)"
-            )
-        elif is_gpt_oss:
+    if _uses_mrope:
+        base_model_args["attn_implementation"] = "sdpa"
+        log_rank_0(
+            f"Using SDPA for {model_name_or_path} (M-RoPE model incompatible with Flash Attention 2 varlen path)"
+        )
+    else:
+        try:
+            import flash_attn as _  # noqa: F401
+            if is_gpt_oss:
                 major, _ = torch.cuda.get_device_capability(0)
                 if major >= 9:
                     base_model_args["attn_implementation"] = "kernels-community/vllm-flash-attn3"
                     log_rank_0("Set attention implementation to vllm-flash-attn3 for GPT-OSS")
                 else:
                     base_model_args["attn_implementation"] = "eager"
                     log_rank_0(
                         f"GPT-OSS: flash-attn3 requires Hopper (SM 9.0+) GPUs, "
                         f"but found SM {major}.x. Using eager attention instead."
                     )
-        else:
-            base_model_args["attn_implementation"] = "flash_attention_2"
-
-    except ImportError as e:
-        if os.environ.get("TESTING", "false").lower() == "true":
-            base_model_args["attn_implementation"] = "sdpa"
-        else:
-            raise e
+            else:
+                base_model_args["attn_implementation"] = "flash_attention_2"
+        except ImportError as e:
+            if os.environ.get("TESTING", "false").lower() == "true":
+                base_model_args["attn_implementation"] = "sdpa"
+            else:
+                raise e
Does Hugging Face Transformers require the flash_attn package when attn_implementation is set to "sdpa"?

Based on learnings: in this repo, flash-attention is optional (cuda extras), so M-RoPE SDPA selection should not depend on importing flash_attn.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/setup_model_for_training.py` around lines 1013 - 1041, The
M-RoPE SDPA fallback is currently gated by the try import of flash_attn so an
ImportError prevents selecting "sdpa" for _uses_mrope models; to fix, evaluate
the _uses_mrope branch before attempting to import flash_attn: if _uses_mrope is
True, set base_model_args["attn_implementation"] = "sdpa" and log for
model_name_or_path, then skip the flash_attn import/handling entirely; otherwise
proceed with the existing try/except import flash_attn block that handles
is_gpt_oss and sets "vllm-flash-attn3", "eager", or "flash_attention_2" and
preserves the ImportError behavior for TESTING env.
src/mini_trainer/osft_utils.py (1)

361-362: ⚠️ Potential issue | 🟠 Major

qwen3.5 mapping is still shadowed by the generic qwen matcher.

Because matching is first-hit substring order, qwen wins before qwen3.5, so the new qwen3_5 patterns won’t be selected for names like qwen3.5-*.

💡 Proposed fix (prefer more specific identifiers first)
 def _get_model_patterns_from_name(name: str) -> list:
@@
-    for identifier, config_key in MODEL_NAME_MAPPINGS.items():
+    for identifier, config_key in sorted(
+        MODEL_NAME_MAPPINGS.items(),
+        key=lambda item: len(item[0]),
+        reverse=True,
+    ):
         if identifier in name.lower():
             return MODEL_CONFIGS[config_key]["patterns"]

Also applies to: 615-617

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/osft_utils.py` around lines 361 - 362, The current
model-name mapping places the generic "qwen" entry before the more specific
"qwen3.5"/"qwen3_5" entries so substring-first matching picks "qwen" for names
like "qwen3.5-*"; reorder the mapping so "qwen3.5" and "qwen3_5" appear before
the generic "qwen" key (and apply the same reordering for the other occurrence
around the lines with keys at 615-617) so the specific matcher wins; locate and
update the dictionary that contains the "qwen", "qwen3.5", and "qwen3_5" keys
and move the specific entries earlier in that dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 77-82: The vlm_kwargs dictionary includes
"pretrained_model_name_or_path" from load_kwargs/base_model_args which conflicts
with passing model_path positionally to
AutoModelForImageTextToText.from_pretrained; remove that key before calling
from_pretrained. Update the construction of vlm_kwargs (derived from
load_kwargs) to pop or filter out "pretrained_model_name_or_path" (and any
duplicate arg) so the call
AutoModelForImageTextToText.from_pretrained(model_path, **vlm_kwargs) does not
receive the same parameter twice; apply the same fix to the other VLM load sites
that use vlm_kwargs/load_kwargs in the distributed and non-distributed load
paths.

---

Duplicate comments:
In `@src/mini_trainer/osft_utils.py`:
- Around line 361-362: The current model-name mapping places the generic "qwen"
entry before the more specific "qwen3.5"/"qwen3_5" entries so substring-first
matching picks "qwen" for names like "qwen3.5-*"; reorder the mapping so
"qwen3.5" and "qwen3_5" appear before the generic "qwen" key (and apply the same
reordering for the other occurrence around the lines with keys at 615-617) so
the specific matcher wins; locate and update the dictionary that contains the
"qwen", "qwen3.5", and "qwen3_5" keys and move the specific entries earlier in
that dict.

In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 1008-1010: The code assumes rope_parameters is a dict and does
"mrope_section" in _rope_params which can raise TypeError for object-shaped
rope_parameters; change the check to handle both types by testing the type and
using attribute access for objects: for example, after _rope_params =
getattr(_text_config, "rope_parameters", {}) or {}, set _uses_mrope =
("mrope_section" in _rope_params) if isinstance(_rope_params, dict) else
bool(getattr(_rope_params, "mrope_section", False)); update references to
model_config, _text_config, _rope_params, and _uses_mrope accordingly.
- Around line 1013-1041: The M-RoPE SDPA fallback is currently gated by the try
import of flash_attn so an ImportError prevents selecting "sdpa" for _uses_mrope
models; to fix, evaluate the _uses_mrope branch before attempting to import
flash_attn: if _uses_mrope is True, set base_model_args["attn_implementation"] =
"sdpa" and log for model_name_or_path, then skip the flash_attn import/handling
entirely; otherwise proceed with the existing try/except import flash_attn block
that handles is_gpt_oss and sets "vllm-flash-attn3", "eager", or
"flash_attention_2" and preserves the ImportError behavior for TESTING env.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2978983a-ebca-4679-8ce0-1ca4a89690df

📥 Commits

Reviewing files that changed from the base of the PR and between 03aa99c and eab4097.

📒 Files selected for processing (3)
  • src/mini_trainer/osft_utils.py
  • src/mini_trainer/setup_model_for_training.py
  • src/mini_trainer/utils.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/mini_trainer/setup_model_for_training.py (1)

946-974: ⚠️ Potential issue | 🟠 Major

M-RoPE SDPA selection still depends on flash_attn import succeeding.

The past review flagged that M-RoPE models should be able to use SDPA without requiring flash_attn to be installed. Currently, _uses_mrope check (line 949) is inside the try block. If flash_attn import fails, the except block only sets SDPA when TESTING=true, not when M-RoPE is detected.

💡 Proposed fix (handle M-RoPE before flash-attn import)
+    # Handle M-RoPE models first - they need SDPA regardless of flash_attn availability
+    if _uses_mrope:
+        base_model_args["attn_implementation"] = "sdpa"
+        log_rank_0(
+            f"Using SDPA for {model_name_or_path} (M-RoPE model incompatible with Flash Attention 2 varlen path)"
+        )
     # Check if flash_attn is available and set appropriate attention implementation
-    try:
+    elif not _uses_mrope:
+      try:
         import flash_attn as _  # noqa: F401
 
-        if _uses_mrope:
-            base_model_args["attn_implementation"] = "sdpa"
-            log_rank_0(
-                f"Using SDPA for {model_name_or_path} (M-RoPE model incompatible with Flash Attention 2 varlen path)"
-            )
-        elif is_gpt_oss:
+        if is_gpt_oss:
             # ... rest of GPT-OSS handling
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/setup_model_for_training.py` around lines 946 - 974, The
M-RoPE branch (_uses_mrope) must be evaluated and set before attempting to
import flash_attn so SDPA is chosen even if flash_attn is absent: move the
_uses_mrope check out of the try block (or evaluate it immediately before the
try) and, if True, set base_model_args["attn_implementation"]="sdpa" and call
log_rank_0(...) for the M-RoPE message without relying on the flash_attn import;
keep the existing flash_attn import/other branches for non-M-RoPE models and
ensure the ImportError except block does not override an already-set SDPA choice
for _uses_mrope.
♻️ Duplicate comments (1)
src/mini_trainer/osft_utils.py (1)

362-363: ⚠️ Potential issue | 🟠 Major

qwen3.5 mapping is shadowed by the existing qwen matcher.

This issue was flagged in a previous review. The _get_model_patterns_from_name() function performs substring matching in dict iteration order. Since "qwen" appears earlier (line 353), model names like "qwen3.5-72b" will match "qwen" first and never reach the new "qwen3.5" entry.

💡 Proposed fix (prefer longer/more-specific identifiers first)
 def _get_model_patterns_from_name(name: str) -> list:
@@
-    for identifier, config_key in MODEL_NAME_MAPPINGS.items():
+    # Sort by identifier length descending to match more specific patterns first
+    for identifier, config_key in sorted(
+        MODEL_NAME_MAPPINGS.items(),
+        key=lambda item: len(item[0]),
+        reverse=True,
+    ):
         if identifier in name.lower():
             return MODEL_CONFIGS[config_key]["patterns"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/osft_utils.py` around lines 362 - 363, The
"_get_model_patterns_from_name()" matching logic currently iterates the
model-key mapping in insertion order so the short key "qwen" shadows "qwen3.5"
and prevents names like "qwen3.5-72b" from matching the more specific pattern;
update the function to prefer longer/more-specific keys first (e.g., sort the
mapping keys by descending length before matching or iterate a pre-sorted list
of keys), or adjust the match logic to use full-token/regex boundaries for keys
"qwen" and "qwen3.5" so "qwen3.5" is tested before "qwen" and properly selected.
Ensure changes reference the mapping entries "qwen" and "qwen3.5" and the
function "_get_model_patterns_from_name()".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mini_trainer/osft_utils.py`:
- Around line 783-801: The VLM extraction branch may pass a duplicate
pretrained_model_name_or_path via final_base_kwargs; before calling
extract_causal_lm_from_vlm, remove or pop the key
"pretrained_model_name_or_path" (and any alias like "model_name_or_path" if
present) from final_base_kwargs to avoid duplicate-argument errors, then call
extract_causal_lm_from_vlm(pretrained_model_name_or_path, final_base_kwargs);
keep the existing base_model_class.from_pretrained(...) usage unchanged for the
non-VLM path and ensure the mutation is local (copy final_base_kwargs if needed)
to not affect callers.
- Around line 918-921: The current classmethod _can_set_experts_implementation
calls base_cls._can_set_experts_implementation() unguarded, which raises
AttributeError for non-MoE base classes; change it to check for the attribute on
base_cls (e.g., using hasattr or getattr with a default) and return False when
the method is missing so non-MoE bases are handled safely; update the
implementation that references base_cls and _can_set_experts_implementation
accordingly.

In `@src/mini_trainer/vlm_utils.py`:
- Around line 93-97: vlm_kwargs currently filters only None quantization_config
but may still include pretrained_model_name_or_path from load_kwargs, causing a
duplicate-argument TypeError when calling
AutoModelForImageTextToText.from_pretrained(model_path, **vlm_kwargs); update
the vlm_kwargs comprehension to also exclude the "pretrained_model_name_or_path"
key (i.e., drop any k == "pretrained_model_name_or_path") so that
from_pretrained is called with model_path positionally and no conflicting
keyword in vlm_kwargs.
- Around line 143-146: The loop that checks for M-RoPE only treats
rope_scaling/rope_parameters as dicts, missing cases where rope_parameters is a
RopeParameters object; update the check in the for-loop that uses rope_dict
(from getattr(cfg, attr, None)) so it returns True if either (a) rope_dict is a
dict and contains the "mrope_section" key or (b) rope_dict is an object that has
an attribute named "mrope_section" (use hasattr or getattr(rope_dict,
"mrope_section", None)). Keep the existing return True behavior when
mrope_section is found.

---

Outside diff comments:
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 946-974: The M-RoPE branch (_uses_mrope) must be evaluated and set
before attempting to import flash_attn so SDPA is chosen even if flash_attn is
absent: move the _uses_mrope check out of the try block (or evaluate it
immediately before the try) and, if True, set
base_model_args["attn_implementation"]="sdpa" and call log_rank_0(...) for the
M-RoPE message without relying on the flash_attn import; keep the existing
flash_attn import/other branches for non-M-RoPE models and ensure the
ImportError except block does not override an already-set SDPA choice for
_uses_mrope.

---

Duplicate comments:
In `@src/mini_trainer/osft_utils.py`:
- Around line 362-363: The "_get_model_patterns_from_name()" matching logic
currently iterates the model-key mapping in insertion order so the short key
"qwen" shadows "qwen3.5" and prevents names like "qwen3.5-72b" from matching the
more specific pattern; update the function to prefer longer/more-specific keys
first (e.g., sort the mapping keys by descending length before matching or
iterate a pre-sorted list of keys), or adjust the match logic to use
full-token/regex boundaries for keys "qwen" and "qwen3.5" so "qwen3.5" is tested
before "qwen" and properly selected. Ensure changes reference the mapping
entries "qwen" and "qwen3.5" and the function "_get_model_patterns_from_name()".

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f4a3fb4-f91e-4d9d-bc7f-e49f98a722aa

📥 Commits

Reviewing files that changed from the base of the PR and between eab4097 and f6fadc8.

📒 Files selected for processing (4)
  • src/mini_trainer/osft_utils.py
  • src/mini_trainer/setup_model_for_training.py
  • src/mini_trainer/utils.py
  • src/mini_trainer/vlm_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/mini_trainer/utils.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
src/mini_trainer/vlm_utils.py (3)

92-96: ⚠️ Potential issue | 🟠 Major

Filter out pretrained_model_name_or_path to prevent duplicate argument error.

When load_kwargs originates from base_model_args (which contains pretrained_model_name_or_path), passing it to from_pretrained(model_path, **filtered_kwargs) causes a TypeError for duplicate keyword argument since model_path is already passed positionally.

🐛 Proposed fix
     filtered_kwargs = {
         k: v for k, v in load_kwargs.items()
-        if not (k == "quantization_config" and v is None)
+        if k != "pretrained_model_name_or_path"
+        and not (k == "quantization_config" and v is None)
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/vlm_utils.py` around lines 92 - 96, The call to
AutoModelForImageTextToText.from_pretrained(model_path, **filtered_kwargs) can
receive a duplicate pretrained_model_name_or_path from load_kwargs, causing a
TypeError; update the filtered_kwargs construction (used before the
from_pretrained call) to also remove any "pretrained_model_name_or_path" key (in
addition to the existing quantization_config None filter) so filtered_kwargs
does not contain that key when calling
AutoModelForImageTextToText.from_pretrained with model_path.

159-162: ⚠️ Potential issue | 🟠 Major

Filter out pretrained_model_name_or_path to prevent duplicate argument error.

Same issue as load_vlm_for_text_training: load_kwargs from base_model_args contains pretrained_model_name_or_path, causing a TypeError when passed to from_pretrained(model_path, **vlm_kwargs).

🐛 Proposed fix
     vlm_kwargs = {
         k: v for k, v in load_kwargs.items()
-        if not (k == "quantization_config" and v is None)
+        if k != "pretrained_model_name_or_path"
+        and not (k == "quantization_config" and v is None)
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/vlm_utils.py` around lines 159 - 162, The vlm_kwargs dict
comprehension in vlm_utils.py is passing through pretrained_model_name_or_path
from load_kwargs, causing a duplicate-argument TypeError when calling
from_pretrained; update the comprehension that builds vlm_kwargs (the current
expression creating vlm_kwargs) to also exclude the key
"pretrained_model_name_or_path" in addition to the existing exclusion for
("quantization_config" and None) so that from_pretrained(model_path,
**vlm_kwargs) no longer receives a duplicate pretrained_model_name_or_path.

209-212: ⚠️ Potential issue | 🟠 Major

Handle RopeParameters objects in addition to dicts.

Per Transformers 4.57.1, rope_parameters can be a RopeParameters object (not just a dict). The current isinstance(rope_dict, dict) check will miss M-RoPE detection when rope_parameters is an object with mrope_section as an attribute.

🐛 Proposed fix
         for attr in ("rope_scaling", "rope_parameters"):
             rope_dict = getattr(cfg, attr, None)
-            if isinstance(rope_dict, dict) and "mrope_section" in rope_dict:
-                return True
+            if rope_dict is not None:
+                if isinstance(rope_dict, dict) and "mrope_section" in rope_dict:
+                    return True
+                elif hasattr(rope_dict, "mrope_section") and getattr(rope_dict, "mrope_section", None) is not None:
+                    return True
     return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/vlm_utils.py` around lines 209 - 212, The loop that checks
for M-RoPE currently only treats rope_scaling/rope_parameters as dicts and
misses when rope_parameters is a Transformers RopeParameters object; update the
check in the loop (iterating attrs "rope_scaling" and "rope_parameters" and
using the local variable rope_dict) to detect M-RoPE for both dicts and objects
by: if rope_dict is a dict, check "mrope_section" in rope_dict; otherwise check
hasattr(rope_dict, "mrope_section") or getattr(rope_dict, "mrope_section", None)
is not None (optionally import or reference the RopeParameters type for an
isinstance check). Ensure the code still returns True when an M-RoPE section is
found.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 915-939: The comment is misleading and the try/except only catches
ImportError, risking partial patching of _KERNEL_MODULE_MAPPING if other errors
occur; update the comment to accurately describe the real compatibility concern
(e.g., build/PyTorch/CUDA mismatches) and make the patching defensive by
separating imports from mapping/attribute assignment: first import
causal_conv1d, mamba_ssm and the triton symbols (selective_state_update,
mamba_chunk_scan_combined, mamba_split_conv1d_scan_combined) inside a try that
catches ImportError and logs the error; then perform the assignments to
mamba_ssm and updates to _KERNEL_MODULE_MAPPING inside a second try that catches
Exception, logs the full exception via log_rank_0 (including stack/str), and on
failure undoes any partial changes to _KERNEL_MODULE_MAPPING and any attributes
set on mamba_ssm to keep state consistent (or only update the mapping atomically
after all attributes succeed).

---

Duplicate comments:
In `@src/mini_trainer/vlm_utils.py`:
- Around line 92-96: The call to
AutoModelForImageTextToText.from_pretrained(model_path, **filtered_kwargs) can
receive a duplicate pretrained_model_name_or_path from load_kwargs, causing a
TypeError; update the filtered_kwargs construction (used before the
from_pretrained call) to also remove any "pretrained_model_name_or_path" key (in
addition to the existing quantization_config None filter) so filtered_kwargs
does not contain that key when calling
AutoModelForImageTextToText.from_pretrained with model_path.
- Around line 159-162: The vlm_kwargs dict comprehension in vlm_utils.py is
passing through pretrained_model_name_or_path from load_kwargs, causing a
duplicate-argument TypeError when calling from_pretrained; update the
comprehension that builds vlm_kwargs (the current expression creating
vlm_kwargs) to also exclude the key "pretrained_model_name_or_path" in addition
to the existing exclusion for ("quantization_config" and None) so that
from_pretrained(model_path, **vlm_kwargs) no longer receives a duplicate
pretrained_model_name_or_path.
- Around line 209-212: The loop that checks for M-RoPE currently only treats
rope_scaling/rope_parameters as dicts and misses when rope_parameters is a
Transformers RopeParameters object; update the check in the loop (iterating
attrs "rope_scaling" and "rope_parameters" and using the local variable
rope_dict) to detect M-RoPE for both dicts and objects by: if rope_dict is a
dict, check "mrope_section" in rope_dict; otherwise check hasattr(rope_dict,
"mrope_section") or getattr(rope_dict, "mrope_section", None) is not None
(optionally import or reference the RopeParameters type for an isinstance
check). Ensure the code still returns True when an M-RoPE section is found.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3ab2fc72-5a5c-40b5-8347-b4e8242b04e0

📥 Commits

Reviewing files that changed from the base of the PR and between f6fadc8 and 2ffdc1a.

📒 Files selected for processing (3)
  • src/mini_trainer/setup_model_for_training.py
  • src/mini_trainer/utils.py
  • src/mini_trainer/vlm_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/mini_trainer/utils.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/mini_trainer/vlm_utils.py (2)

92-96: ⚠️ Potential issue | 🔴 Critical

Exclude pretrained_model_name_or_path before forwarding these kwargs.

setup_model() always seeds base_model_args with that key, so both helpers pass the model id twice: once positionally and once via **load_kwargs. That raises TypeError before any direct-load or extract path can start, which blocks the new VLM support entirely.

🐛 Proposed fix
     filtered_kwargs = {
         k: v for k, v in load_kwargs.items()
-        if not (k == "quantization_config" and v is None)
+        if k != "pretrained_model_name_or_path"
+        and not (k == "quantization_config" and v is None)
     }
     vlm_kwargs = {
         k: v for k, v in load_kwargs.items()
-        if not (k == "quantization_config" and v is None)
+        if k != "pretrained_model_name_or_path"
+        and not (k == "quantization_config" and v is None)
     }
#!/bin/bash
set -euo pipefail

echo "Relevant helper code:"
sed -n '90,97p;157,164p' src/mini_trainer/vlm_utils.py

echo
echo "Caller populating pretrained_model_name_or_path:"
sed -n '907,910p' src/mini_trainer/setup_model_for_training.py

echo
echo "Python raises on positional + duplicate keyword:"
python - <<'PY'
def from_pretrained(pretrained_model_name_or_path=None, **kwargs):
    return pretrained_model_name_or_path, kwargs

try:
    from_pretrained("model-id", pretrained_model_name_or_path="model-id")
except TypeError as exc:
    print(type(exc).__name__ + ":", exc)
PY

Also applies to: 159-163

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/vlm_utils.py` around lines 92 - 96, The call to
AutoModelForImageTextToText.from_pretrained receives duplicate model id because
setup_model() seeds base_model_args with pretrained_model_name_or_path and
load_kwargs also contains it; update the kwargs filtering (the filtered_kwargs
built from load_kwargs) to drop the "pretrained_model_name_or_path" key (in
addition to the existing quantization_config check) before calling
AutoModelForImageTextToText.from_pretrained, and apply the same removal in the
other helper block that builds filtered_kwargs (the second region around the
existing filtered_kwargs usage) so the model id is only passed positionally.

209-212: ⚠️ Potential issue | 🟠 Major

has_mrope() still misses non-dict rope_parameters.

For Qwen3.5 configs, rope_parameters is not guaranteed to stay a plain dict. If it comes through as a structured object, this returns False, needs_sdpa() misses the M-RoPE case, and the model can still be pushed onto Flash Attention 2.

In transformers 4.57.1, can `Qwen3_5TextConfig.rope_parameters` be a `RopeParameters` object or other non-plain-dict type, and where is `mrope_section` stored for M-RoPE configs?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/vlm_utils.py` around lines 209 - 212, has_mrope() currently
only detects M-RoPE when rope_parameters/rope_scaling are plain dicts; update it
to also detect structured objects (like transformers' RopeParameters) by
checking for an attribute named "mrope_section" on non-dict values: for each
attr in ("rope_scaling","rope_parameters") get the value via getattr(cfg, attr,
None), if it's a dict check "mrope_section" in the dict, otherwise check
hasattr(value, "mrope_section") or getattr(value, "mrope_section", None) is not
None and return True; ensure needs_sdpa() (or callers relying on has_mrope())
will then correctly detect M-RoPE from both dicts and structured objects.
src/mini_trainer/setup_model_for_training.py (1)

956-990: ⚠️ Potential issue | 🟠 Major

Pick the SDPA fallback before importing flash_attn.

_needs_sdpa is computed first, but the branch that uses it still lives inside the flash_attn import block. On a machine without flash_attn, Qwen3.5-VL / timm-vision models raise ImportError instead of taking the SDPA/eager path this PR adds.

💡 Suggested control-flow change
-    try:
-        import flash_attn as _  # noqa: F401
-
-        if _needs_sdpa:
-            base_model_args["attn_implementation"] = "sdpa"
-            log_rank_0(
-                f"Using SDPA for {model_name_or_path} (model incompatible with Flash Attention 2)"
-            )
-        elif is_gpt_oss:
+    if _needs_sdpa:
+        base_model_args["attn_implementation"] = "sdpa"
+        log_rank_0(
+            f"Using SDPA for {model_name_or_path} (model incompatible with Flash Attention 2)"
+        )
+    else:
+        try:
+            import flash_attn as _  # noqa: F401
+            if is_gpt_oss:
                 ...
-        else:
-            base_model_args["attn_implementation"] = "flash_attention_2"
-
-    except ImportError as e:
-        if os.environ.get("TESTING", "false").lower() == "true":
-            base_model_args["attn_implementation"] = "sdpa"
-        else:
-            raise e
+            else:
+                base_model_args["attn_implementation"] = "flash_attention_2"
+        except ImportError as e:
+            if os.environ.get("TESTING", "false").lower() == "true":
+                base_model_args["attn_implementation"] = "sdpa"
+            else:
+                raise e
Does Hugging Face Transformers support `attn_implementation="sdpa"` without the `flash_attn` package installed, using PyTorch SDPA alone?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mini_trainer/setup_model_for_training.py` around lines 956 - 990, Compute
and apply the SDPA fallback before trying to import flash_attn: if _needs_sdpa
is True, set base_model_args["attn_implementation"]="sdpa" and log via
log_rank_0, and skip the flash_attn import/branches entirely; otherwise proceed
to the try: import flash_attn block and keep the existing is_gpt_oss /
flash_attention_2 logic there, and keep the existing ImportError TESTING
fallback that sets attn_implementation to "sdpa".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 956-990: Compute and apply the SDPA fallback before trying to
import flash_attn: if _needs_sdpa is True, set
base_model_args["attn_implementation"]="sdpa" and log via log_rank_0, and skip
the flash_attn import/branches entirely; otherwise proceed to the try: import
flash_attn block and keep the existing is_gpt_oss / flash_attention_2 logic
there, and keep the existing ImportError TESTING fallback that sets
attn_implementation to "sdpa".

In `@src/mini_trainer/vlm_utils.py`:
- Around line 92-96: The call to AutoModelForImageTextToText.from_pretrained
receives duplicate model id because setup_model() seeds base_model_args with
pretrained_model_name_or_path and load_kwargs also contains it; update the
kwargs filtering (the filtered_kwargs built from load_kwargs) to drop the
"pretrained_model_name_or_path" key (in addition to the existing
quantization_config check) before calling
AutoModelForImageTextToText.from_pretrained, and apply the same removal in the
other helper block that builds filtered_kwargs (the second region around the
existing filtered_kwargs usage) so the model id is only passed positionally.
- Around line 209-212: has_mrope() currently only detects M-RoPE when
rope_parameters/rope_scaling are plain dicts; update it to also detect
structured objects (like transformers' RopeParameters) by checking for an
attribute named "mrope_section" on non-dict values: for each attr in
("rope_scaling","rope_parameters") get the value via getattr(cfg, attr, None),
if it's a dict check "mrope_section" in the dict, otherwise check hasattr(value,
"mrope_section") or getattr(value, "mrope_section", None) is not None and return
True; ensure needs_sdpa() (or callers relying on has_mrope()) will then
correctly detect M-RoPE from both dicts and structured objects.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da8d3b88-93ba-4439-bc1e-81f89bce14ee

📥 Commits

Reviewing files that changed from the base of the PR and between 2ffdc1a and e572a22.

📒 Files selected for processing (2)
  • src/mini_trainer/setup_model_for_training.py
  • src/mini_trainer/vlm_utils.py

RobotSail and others added 8 commits March 6, 2026 05:58
- Reorder MODEL_NAME_MAPPINGS for correct substring matching
- Filter pretrained_model_name_or_path from VLM load kwargs
- Move SDPA decision outside flash_attn import block

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use spec=[] on mock model.model to prevent hasattr from
falsely detecting language_model attribute in wrap_fsdp2 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- osft_utils.py: filter pretrained_model_name_or_path from VLM kwargs
  in OSFT path to prevent duplicate argument error
- osft_utils.py: add hasattr guard for _can_set_experts_implementation
  on non-MoE base classes
- vlm_utils.py: handle RopeParameters objects (not just dicts) in
  mrope detection via hasattr fallback
- Fix isort ordering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wrap AutoConfig.from_pretrained in try/except in
_load_model_memory_efficient so mock/dummy model paths
don't crash the VLM detection check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Catch AttributeError in addition to ImportError to prevent partial
  patching of _KERNEL_MODULE_MAPPING
- Update comment to accurately describe the compatibility concern
  (PyTorch/CUDA ABI mismatch, not C API incompatibility)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reformat 8 test/source files to match CI ruff version
- Fix UP038: use X | Y in isinstance call

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unnecessary OSFT guard for direct VLMs (patterns match fine)
- Add _get_text_config() helper for VLM config fallback in
  align_model_and_tokenizer (vocab_size, pad/bos/eos_token_id)
- Fix model.config.pad_token_id access in train.py for VLM configs
- Skip activation checkpointing for direct VLM models (M-RoPE layers
  produce non-deterministic tensor counts during reentrant recomputation)
- Use dynamic ports (_get_free_port) in model_validation.py to prevent
  port conflicts between sequential tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@Maxusmusti Maxusmusti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@RobotSail RobotSail merged commit 3c1e9d7 into Red-Hat-AI-Innovation-Team:main Mar 6, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants