add support for qwen3.5 vl model#70
add support for qwen3.5 vl model#70RobotSail merged 14 commits intoRed-Hat-AI-Innovation-Team:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdd VLM-aware model loading and extraction to obtain CausalLM text backbones from Vision‑Language Models, detect M‑RoPE to adjust attention implementation, introduce Qwen3_5 model config/name mappings, expand supported OSFT/SFT model lists, and expose a classmethod on the dynamic OSFT wrapper. Changes
Sequence Diagram(s)sequenceDiagram
participant Loader as Loader
participant Hub as Model Hub
participant AutoVLM as AutoModelForImageTextToText
participant Base as CausalLM (reconstructed)
participant Cache as Cache/Device
Loader->>Hub: request model_path + load_kwargs
alt config indicates VLM wrapping a CausalLM
Loader->>AutoVLM: load full VLM
AutoVLM-->>Loader: VLM instance
Loader->>AutoVLM: locate text backbone (language_model / text_model / llm)
Loader->>Base: build CausalLM from backbone config
Loader->>Base: transfer weights and lm_head
Loader->>Cache: free VLM objects, clear device/cache
Loader-->>Base: return standalone CausalLM
else VLM direct-loadable (no CausalLM)
Loader->>AutoVLM: load VLM via AutoModelForImageTextToText
AutoVLM-->>Loader: VLM as text model
Loader-->>Base: return VLM-based text model
else direct CausalLM
Loader->>Hub: load via AutoModelForCausalLM
Hub-->>Loader: CausalLM
Loader-->>Base: return CausalLM
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mini_trainer/osft_utils.py`:
- Around line 361-362: The "qwen3.5" entry is being shadowed by the broader
"qwen" matcher in the mapping used by _get_model_patterns_from_name(), so ensure
the more specific "qwen3.5" key (mapping to qwen3_5) is checked before the
general "qwen" key; reorder the model-name->pattern mapping (or adjust the
matching logic in _get_model_patterns_from_name() to prefer exact/specific
matches like "qwen3.5" before substring matches such as "qwen") so names like
"qwen3.5-*" resolve to qwen3_5 instead of the generic qwen pattern.
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 912-917: The M-RoPE SDPA fallback is incorrectly placed inside the
flash_attn import try block; move the _uses_mrope check so it executes before
any attempt to import flash_attn. Specifically, ensure the branch that sets
base_model_args["attn_implementation"] = "sdpa" and calls log_rank_0(f"Using
SDPA for {model_name_or_path} ...") runs unconditionally when _uses_mrope is
True (and before the flash_attn import/try block) so SDPA can be selected even
if the flash_attn package is absent.
- Around line 899-907: The code assumes rope_parameters is a dict and uses
`"mrope_section" in _rope_params`, which can raise TypeError when
rope_parameters is a RopeParameters object; update the check to handle both dict
and object types by normalizing or testing appropriately: retrieve
model_config.text_config (as _text_config), get rope_parameters into
_rope_params, then if _rope_params is a dict check membership with
"mrope_section" in _rope_params, else if it's an object check for attribute
presence (hasattr(_rope_params, "mrope_section") or getattr(_rope_params,
"mrope_section", None) is not None); set _uses_mrope accordingly so the
subsequent logic that uses _uses_mrope, model_config, and _text_config behaves
correctly.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
src/mini_trainer/setup_model_for_training.py (2)
1008-1010:⚠️ Potential issue | 🟠 MajorHandle
rope_parametersas object-or-dict before checkingmrope_section.Line 1010 assumes dict membership; for object-shaped
rope_parameters, this can throwTypeErrorand skip M-RoPE handling.💡 Proposed fix
- _rope_params = getattr(_text_config, "rope_parameters", {}) or {} - _uses_mrope = "mrope_section" in _rope_params + _rope_params = getattr(_text_config, "rope_parameters", None) + if isinstance(_rope_params, dict): + _uses_mrope = "mrope_section" in _rope_params + else: + _uses_mrope = getattr(_rope_params, "mrope_section", None) is not NoneIn transformers 4.57.1, what types can Qwen3_5TextConfig.rope_parameters take (dict vs object), and what is the recommended way to detect mrope_section?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/setup_model_for_training.py` around lines 1008 - 1010, The code assumes rope_parameters is a dict and does "mrope_section" in _rope_params which can raise TypeError for object-shaped rope_parameters; change the check to handle both types by testing the type and using attribute access for objects: for example, after _rope_params = getattr(_text_config, "rope_parameters", {}) or {}, set _uses_mrope = ("mrope_section" in _rope_params) if isinstance(_rope_params, dict) else bool(getattr(_rope_params, "mrope_section", False)); update references to model_config, _text_config, _rope_params, and _uses_mrope accordingly.
1013-1041:⚠️ Potential issue | 🟠 MajorM-RoPE SDPA fallback should be evaluated before
flash_attnimport.Right now, if
flash_attnis missing, control hitsImportErrorbefore the M-RoPE SDPA branch runs, which blocks intended compatibility for M-RoPE models.💡 Proposed fix
- try: - import flash_attn as _ # noqa: F401 - - if _uses_mrope: - base_model_args["attn_implementation"] = "sdpa" - log_rank_0( - f"Using SDPA for {model_name_or_path} (M-RoPE model incompatible with Flash Attention 2 varlen path)" - ) - elif is_gpt_oss: + if _uses_mrope: + base_model_args["attn_implementation"] = "sdpa" + log_rank_0( + f"Using SDPA for {model_name_or_path} (M-RoPE model incompatible with Flash Attention 2 varlen path)" + ) + else: + try: + import flash_attn as _ # noqa: F401 + if is_gpt_oss: major, _ = torch.cuda.get_device_capability(0) if major >= 9: base_model_args["attn_implementation"] = "kernels-community/vllm-flash-attn3" log_rank_0("Set attention implementation to vllm-flash-attn3 for GPT-OSS") else: base_model_args["attn_implementation"] = "eager" log_rank_0( f"GPT-OSS: flash-attn3 requires Hopper (SM 9.0+) GPUs, " f"but found SM {major}.x. Using eager attention instead." ) - else: - base_model_args["attn_implementation"] = "flash_attention_2" - - except ImportError as e: - if os.environ.get("TESTING", "false").lower() == "true": - base_model_args["attn_implementation"] = "sdpa" - else: - raise e + else: + base_model_args["attn_implementation"] = "flash_attention_2" + except ImportError as e: + if os.environ.get("TESTING", "false").lower() == "true": + base_model_args["attn_implementation"] = "sdpa" + else: + raise eDoes Hugging Face Transformers require the flash_attn package when attn_implementation is set to "sdpa"?Based on learnings: in this repo, flash-attention is optional (cuda extras), so M-RoPE SDPA selection should not depend on importing
flash_attn.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/setup_model_for_training.py` around lines 1013 - 1041, The M-RoPE SDPA fallback is currently gated by the try import of flash_attn so an ImportError prevents selecting "sdpa" for _uses_mrope models; to fix, evaluate the _uses_mrope branch before attempting to import flash_attn: if _uses_mrope is True, set base_model_args["attn_implementation"] = "sdpa" and log for model_name_or_path, then skip the flash_attn import/handling entirely; otherwise proceed with the existing try/except import flash_attn block that handles is_gpt_oss and sets "vllm-flash-attn3", "eager", or "flash_attention_2" and preserves the ImportError behavior for TESTING env.src/mini_trainer/osft_utils.py (1)
361-362:⚠️ Potential issue | 🟠 Major
qwen3.5mapping is still shadowed by the genericqwenmatcher.Because matching is first-hit substring order,
qwenwins beforeqwen3.5, so the newqwen3_5patterns won’t be selected for names likeqwen3.5-*.💡 Proposed fix (prefer more specific identifiers first)
def _get_model_patterns_from_name(name: str) -> list: @@ - for identifier, config_key in MODEL_NAME_MAPPINGS.items(): + for identifier, config_key in sorted( + MODEL_NAME_MAPPINGS.items(), + key=lambda item: len(item[0]), + reverse=True, + ): if identifier in name.lower(): return MODEL_CONFIGS[config_key]["patterns"]Also applies to: 615-617
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/osft_utils.py` around lines 361 - 362, The current model-name mapping places the generic "qwen" entry before the more specific "qwen3.5"/"qwen3_5" entries so substring-first matching picks "qwen" for names like "qwen3.5-*"; reorder the mapping so "qwen3.5" and "qwen3_5" appear before the generic "qwen" key (and apply the same reordering for the other occurrence around the lines with keys at 615-617) so the specific matcher wins; locate and update the dictionary that contains the "qwen", "qwen3.5", and "qwen3_5" keys and move the specific entries earlier in that dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 77-82: The vlm_kwargs dictionary includes
"pretrained_model_name_or_path" from load_kwargs/base_model_args which conflicts
with passing model_path positionally to
AutoModelForImageTextToText.from_pretrained; remove that key before calling
from_pretrained. Update the construction of vlm_kwargs (derived from
load_kwargs) to pop or filter out "pretrained_model_name_or_path" (and any
duplicate arg) so the call
AutoModelForImageTextToText.from_pretrained(model_path, **vlm_kwargs) does not
receive the same parameter twice; apply the same fix to the other VLM load sites
that use vlm_kwargs/load_kwargs in the distributed and non-distributed load
paths.
---
Duplicate comments:
In `@src/mini_trainer/osft_utils.py`:
- Around line 361-362: The current model-name mapping places the generic "qwen"
entry before the more specific "qwen3.5"/"qwen3_5" entries so substring-first
matching picks "qwen" for names like "qwen3.5-*"; reorder the mapping so
"qwen3.5" and "qwen3_5" appear before the generic "qwen" key (and apply the same
reordering for the other occurrence around the lines with keys at 615-617) so
the specific matcher wins; locate and update the dictionary that contains the
"qwen", "qwen3.5", and "qwen3_5" keys and move the specific entries earlier in
that dict.
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 1008-1010: The code assumes rope_parameters is a dict and does
"mrope_section" in _rope_params which can raise TypeError for object-shaped
rope_parameters; change the check to handle both types by testing the type and
using attribute access for objects: for example, after _rope_params =
getattr(_text_config, "rope_parameters", {}) or {}, set _uses_mrope =
("mrope_section" in _rope_params) if isinstance(_rope_params, dict) else
bool(getattr(_rope_params, "mrope_section", False)); update references to
model_config, _text_config, _rope_params, and _uses_mrope accordingly.
- Around line 1013-1041: The M-RoPE SDPA fallback is currently gated by the try
import of flash_attn so an ImportError prevents selecting "sdpa" for _uses_mrope
models; to fix, evaluate the _uses_mrope branch before attempting to import
flash_attn: if _uses_mrope is True, set base_model_args["attn_implementation"] =
"sdpa" and log for model_name_or_path, then skip the flash_attn import/handling
entirely; otherwise proceed with the existing try/except import flash_attn block
that handles is_gpt_oss and sets "vllm-flash-attn3", "eager", or
"flash_attention_2" and preserves the ImportError behavior for TESTING env.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2978983a-ebca-4679-8ce0-1ca4a89690df
📒 Files selected for processing (3)
src/mini_trainer/osft_utils.pysrc/mini_trainer/setup_model_for_training.pysrc/mini_trainer/utils.py
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/mini_trainer/setup_model_for_training.py (1)
946-974:⚠️ Potential issue | 🟠 MajorM-RoPE SDPA selection still depends on
flash_attnimport succeeding.The past review flagged that M-RoPE models should be able to use SDPA without requiring
flash_attnto be installed. Currently,_uses_mropecheck (line 949) is inside thetryblock. Ifflash_attnimport fails, theexceptblock only sets SDPA whenTESTING=true, not when M-RoPE is detected.💡 Proposed fix (handle M-RoPE before flash-attn import)
+ # Handle M-RoPE models first - they need SDPA regardless of flash_attn availability + if _uses_mrope: + base_model_args["attn_implementation"] = "sdpa" + log_rank_0( + f"Using SDPA for {model_name_or_path} (M-RoPE model incompatible with Flash Attention 2 varlen path)" + ) # Check if flash_attn is available and set appropriate attention implementation - try: + elif not _uses_mrope: + try: import flash_attn as _ # noqa: F401 - if _uses_mrope: - base_model_args["attn_implementation"] = "sdpa" - log_rank_0( - f"Using SDPA for {model_name_or_path} (M-RoPE model incompatible with Flash Attention 2 varlen path)" - ) - elif is_gpt_oss: + if is_gpt_oss: # ... rest of GPT-OSS handling🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/setup_model_for_training.py` around lines 946 - 974, The M-RoPE branch (_uses_mrope) must be evaluated and set before attempting to import flash_attn so SDPA is chosen even if flash_attn is absent: move the _uses_mrope check out of the try block (or evaluate it immediately before the try) and, if True, set base_model_args["attn_implementation"]="sdpa" and call log_rank_0(...) for the M-RoPE message without relying on the flash_attn import; keep the existing flash_attn import/other branches for non-M-RoPE models and ensure the ImportError except block does not override an already-set SDPA choice for _uses_mrope.
♻️ Duplicate comments (1)
src/mini_trainer/osft_utils.py (1)
362-363:⚠️ Potential issue | 🟠 Major
qwen3.5mapping is shadowed by the existingqwenmatcher.This issue was flagged in a previous review. The
_get_model_patterns_from_name()function performs substring matching in dict iteration order. Since"qwen"appears earlier (line 353), model names like"qwen3.5-72b"will match"qwen"first and never reach the new"qwen3.5"entry.💡 Proposed fix (prefer longer/more-specific identifiers first)
def _get_model_patterns_from_name(name: str) -> list: @@ - for identifier, config_key in MODEL_NAME_MAPPINGS.items(): + # Sort by identifier length descending to match more specific patterns first + for identifier, config_key in sorted( + MODEL_NAME_MAPPINGS.items(), + key=lambda item: len(item[0]), + reverse=True, + ): if identifier in name.lower(): return MODEL_CONFIGS[config_key]["patterns"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/osft_utils.py` around lines 362 - 363, The "_get_model_patterns_from_name()" matching logic currently iterates the model-key mapping in insertion order so the short key "qwen" shadows "qwen3.5" and prevents names like "qwen3.5-72b" from matching the more specific pattern; update the function to prefer longer/more-specific keys first (e.g., sort the mapping keys by descending length before matching or iterate a pre-sorted list of keys), or adjust the match logic to use full-token/regex boundaries for keys "qwen" and "qwen3.5" so "qwen3.5" is tested before "qwen" and properly selected. Ensure changes reference the mapping entries "qwen" and "qwen3.5" and the function "_get_model_patterns_from_name()".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mini_trainer/osft_utils.py`:
- Around line 783-801: The VLM extraction branch may pass a duplicate
pretrained_model_name_or_path via final_base_kwargs; before calling
extract_causal_lm_from_vlm, remove or pop the key
"pretrained_model_name_or_path" (and any alias like "model_name_or_path" if
present) from final_base_kwargs to avoid duplicate-argument errors, then call
extract_causal_lm_from_vlm(pretrained_model_name_or_path, final_base_kwargs);
keep the existing base_model_class.from_pretrained(...) usage unchanged for the
non-VLM path and ensure the mutation is local (copy final_base_kwargs if needed)
to not affect callers.
- Around line 918-921: The current classmethod _can_set_experts_implementation
calls base_cls._can_set_experts_implementation() unguarded, which raises
AttributeError for non-MoE base classes; change it to check for the attribute on
base_cls (e.g., using hasattr or getattr with a default) and return False when
the method is missing so non-MoE bases are handled safely; update the
implementation that references base_cls and _can_set_experts_implementation
accordingly.
In `@src/mini_trainer/vlm_utils.py`:
- Around line 93-97: vlm_kwargs currently filters only None quantization_config
but may still include pretrained_model_name_or_path from load_kwargs, causing a
duplicate-argument TypeError when calling
AutoModelForImageTextToText.from_pretrained(model_path, **vlm_kwargs); update
the vlm_kwargs comprehension to also exclude the "pretrained_model_name_or_path"
key (i.e., drop any k == "pretrained_model_name_or_path") so that
from_pretrained is called with model_path positionally and no conflicting
keyword in vlm_kwargs.
- Around line 143-146: The loop that checks for M-RoPE only treats
rope_scaling/rope_parameters as dicts, missing cases where rope_parameters is a
RopeParameters object; update the check in the for-loop that uses rope_dict
(from getattr(cfg, attr, None)) so it returns True if either (a) rope_dict is a
dict and contains the "mrope_section" key or (b) rope_dict is an object that has
an attribute named "mrope_section" (use hasattr or getattr(rope_dict,
"mrope_section", None)). Keep the existing return True behavior when
mrope_section is found.
---
Outside diff comments:
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 946-974: The M-RoPE branch (_uses_mrope) must be evaluated and set
before attempting to import flash_attn so SDPA is chosen even if flash_attn is
absent: move the _uses_mrope check out of the try block (or evaluate it
immediately before the try) and, if True, set
base_model_args["attn_implementation"]="sdpa" and call log_rank_0(...) for the
M-RoPE message without relying on the flash_attn import; keep the existing
flash_attn import/other branches for non-M-RoPE models and ensure the
ImportError except block does not override an already-set SDPA choice for
_uses_mrope.
---
Duplicate comments:
In `@src/mini_trainer/osft_utils.py`:
- Around line 362-363: The "_get_model_patterns_from_name()" matching logic
currently iterates the model-key mapping in insertion order so the short key
"qwen" shadows "qwen3.5" and prevents names like "qwen3.5-72b" from matching the
more specific pattern; update the function to prefer longer/more-specific keys
first (e.g., sort the mapping keys by descending length before matching or
iterate a pre-sorted list of keys), or adjust the match logic to use
full-token/regex boundaries for keys "qwen" and "qwen3.5" so "qwen3.5" is tested
before "qwen" and properly selected. Ensure changes reference the mapping
entries "qwen" and "qwen3.5" and the function "_get_model_patterns_from_name()".
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0f4a3fb4-f91e-4d9d-bc7f-e49f98a722aa
📒 Files selected for processing (4)
src/mini_trainer/osft_utils.pysrc/mini_trainer/setup_model_for_training.pysrc/mini_trainer/utils.pysrc/mini_trainer/vlm_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/mini_trainer/utils.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
src/mini_trainer/vlm_utils.py (3)
92-96:⚠️ Potential issue | 🟠 MajorFilter out
pretrained_model_name_or_pathto prevent duplicate argument error.When
load_kwargsoriginates frombase_model_args(which containspretrained_model_name_or_path), passing it tofrom_pretrained(model_path, **filtered_kwargs)causes aTypeErrorfor duplicate keyword argument sincemodel_pathis already passed positionally.🐛 Proposed fix
filtered_kwargs = { k: v for k, v in load_kwargs.items() - if not (k == "quantization_config" and v is None) + if k != "pretrained_model_name_or_path" + and not (k == "quantization_config" and v is None) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/vlm_utils.py` around lines 92 - 96, The call to AutoModelForImageTextToText.from_pretrained(model_path, **filtered_kwargs) can receive a duplicate pretrained_model_name_or_path from load_kwargs, causing a TypeError; update the filtered_kwargs construction (used before the from_pretrained call) to also remove any "pretrained_model_name_or_path" key (in addition to the existing quantization_config None filter) so filtered_kwargs does not contain that key when calling AutoModelForImageTextToText.from_pretrained with model_path.
159-162:⚠️ Potential issue | 🟠 MajorFilter out
pretrained_model_name_or_pathto prevent duplicate argument error.Same issue as
load_vlm_for_text_training:load_kwargsfrombase_model_argscontainspretrained_model_name_or_path, causing aTypeErrorwhen passed tofrom_pretrained(model_path, **vlm_kwargs).🐛 Proposed fix
vlm_kwargs = { k: v for k, v in load_kwargs.items() - if not (k == "quantization_config" and v is None) + if k != "pretrained_model_name_or_path" + and not (k == "quantization_config" and v is None) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/vlm_utils.py` around lines 159 - 162, The vlm_kwargs dict comprehension in vlm_utils.py is passing through pretrained_model_name_or_path from load_kwargs, causing a duplicate-argument TypeError when calling from_pretrained; update the comprehension that builds vlm_kwargs (the current expression creating vlm_kwargs) to also exclude the key "pretrained_model_name_or_path" in addition to the existing exclusion for ("quantization_config" and None) so that from_pretrained(model_path, **vlm_kwargs) no longer receives a duplicate pretrained_model_name_or_path.
209-212:⚠️ Potential issue | 🟠 MajorHandle
RopeParametersobjects in addition to dicts.Per Transformers 4.57.1,
rope_parameterscan be aRopeParametersobject (not just a dict). The currentisinstance(rope_dict, dict)check will miss M-RoPE detection whenrope_parametersis an object withmrope_sectionas an attribute.🐛 Proposed fix
for attr in ("rope_scaling", "rope_parameters"): rope_dict = getattr(cfg, attr, None) - if isinstance(rope_dict, dict) and "mrope_section" in rope_dict: - return True + if rope_dict is not None: + if isinstance(rope_dict, dict) and "mrope_section" in rope_dict: + return True + elif hasattr(rope_dict, "mrope_section") and getattr(rope_dict, "mrope_section", None) is not None: + return True return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/vlm_utils.py` around lines 209 - 212, The loop that checks for M-RoPE currently only treats rope_scaling/rope_parameters as dicts and misses when rope_parameters is a Transformers RopeParameters object; update the check in the loop (iterating attrs "rope_scaling" and "rope_parameters" and using the local variable rope_dict) to detect M-RoPE for both dicts and objects by: if rope_dict is a dict, check "mrope_section" in rope_dict; otherwise check hasattr(rope_dict, "mrope_section") or getattr(rope_dict, "mrope_section", None) is not None (optionally import or reference the RopeParameters type for an isinstance check). Ensure the code still returns True when an M-RoPE section is found.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 915-939: The comment is misleading and the try/except only catches
ImportError, risking partial patching of _KERNEL_MODULE_MAPPING if other errors
occur; update the comment to accurately describe the real compatibility concern
(e.g., build/PyTorch/CUDA mismatches) and make the patching defensive by
separating imports from mapping/attribute assignment: first import
causal_conv1d, mamba_ssm and the triton symbols (selective_state_update,
mamba_chunk_scan_combined, mamba_split_conv1d_scan_combined) inside a try that
catches ImportError and logs the error; then perform the assignments to
mamba_ssm and updates to _KERNEL_MODULE_MAPPING inside a second try that catches
Exception, logs the full exception via log_rank_0 (including stack/str), and on
failure undoes any partial changes to _KERNEL_MODULE_MAPPING and any attributes
set on mamba_ssm to keep state consistent (or only update the mapping atomically
after all attributes succeed).
---
Duplicate comments:
In `@src/mini_trainer/vlm_utils.py`:
- Around line 92-96: The call to
AutoModelForImageTextToText.from_pretrained(model_path, **filtered_kwargs) can
receive a duplicate pretrained_model_name_or_path from load_kwargs, causing a
TypeError; update the filtered_kwargs construction (used before the
from_pretrained call) to also remove any "pretrained_model_name_or_path" key (in
addition to the existing quantization_config None filter) so filtered_kwargs
does not contain that key when calling
AutoModelForImageTextToText.from_pretrained with model_path.
- Around line 159-162: The vlm_kwargs dict comprehension in vlm_utils.py is
passing through pretrained_model_name_or_path from load_kwargs, causing a
duplicate-argument TypeError when calling from_pretrained; update the
comprehension that builds vlm_kwargs (the current expression creating
vlm_kwargs) to also exclude the key "pretrained_model_name_or_path" in addition
to the existing exclusion for ("quantization_config" and None) so that
from_pretrained(model_path, **vlm_kwargs) no longer receives a duplicate
pretrained_model_name_or_path.
- Around line 209-212: The loop that checks for M-RoPE currently only treats
rope_scaling/rope_parameters as dicts and misses when rope_parameters is a
Transformers RopeParameters object; update the check in the loop (iterating
attrs "rope_scaling" and "rope_parameters" and using the local variable
rope_dict) to detect M-RoPE for both dicts and objects by: if rope_dict is a
dict, check "mrope_section" in rope_dict; otherwise check hasattr(rope_dict,
"mrope_section") or getattr(rope_dict, "mrope_section", None) is not None
(optionally import or reference the RopeParameters type for an isinstance
check). Ensure the code still returns True when an M-RoPE section is found.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3ab2fc72-5a5c-40b5-8347-b4e8242b04e0
📒 Files selected for processing (3)
src/mini_trainer/setup_model_for_training.pysrc/mini_trainer/utils.pysrc/mini_trainer/vlm_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/mini_trainer/utils.py
There was a problem hiding this comment.
♻️ Duplicate comments (3)
src/mini_trainer/vlm_utils.py (2)
92-96:⚠️ Potential issue | 🔴 CriticalExclude
pretrained_model_name_or_pathbefore forwarding these kwargs.
setup_model()always seedsbase_model_argswith that key, so both helpers pass the model id twice: once positionally and once via**load_kwargs. That raisesTypeErrorbefore any direct-load or extract path can start, which blocks the new VLM support entirely.🐛 Proposed fix
filtered_kwargs = { k: v for k, v in load_kwargs.items() - if not (k == "quantization_config" and v is None) + if k != "pretrained_model_name_or_path" + and not (k == "quantization_config" and v is None) }vlm_kwargs = { k: v for k, v in load_kwargs.items() - if not (k == "quantization_config" and v is None) + if k != "pretrained_model_name_or_path" + and not (k == "quantization_config" and v is None) }#!/bin/bash set -euo pipefail echo "Relevant helper code:" sed -n '90,97p;157,164p' src/mini_trainer/vlm_utils.py echo echo "Caller populating pretrained_model_name_or_path:" sed -n '907,910p' src/mini_trainer/setup_model_for_training.py echo echo "Python raises on positional + duplicate keyword:" python - <<'PY' def from_pretrained(pretrained_model_name_or_path=None, **kwargs): return pretrained_model_name_or_path, kwargs try: from_pretrained("model-id", pretrained_model_name_or_path="model-id") except TypeError as exc: print(type(exc).__name__ + ":", exc) PYAlso applies to: 159-163
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/vlm_utils.py` around lines 92 - 96, The call to AutoModelForImageTextToText.from_pretrained receives duplicate model id because setup_model() seeds base_model_args with pretrained_model_name_or_path and load_kwargs also contains it; update the kwargs filtering (the filtered_kwargs built from load_kwargs) to drop the "pretrained_model_name_or_path" key (in addition to the existing quantization_config check) before calling AutoModelForImageTextToText.from_pretrained, and apply the same removal in the other helper block that builds filtered_kwargs (the second region around the existing filtered_kwargs usage) so the model id is only passed positionally.
209-212:⚠️ Potential issue | 🟠 Major
has_mrope()still misses non-dictrope_parameters.For Qwen3.5 configs,
rope_parametersis not guaranteed to stay a plain dict. If it comes through as a structured object, this returnsFalse,needs_sdpa()misses the M-RoPE case, and the model can still be pushed onto Flash Attention 2.In transformers 4.57.1, can `Qwen3_5TextConfig.rope_parameters` be a `RopeParameters` object or other non-plain-dict type, and where is `mrope_section` stored for M-RoPE configs?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/vlm_utils.py` around lines 209 - 212, has_mrope() currently only detects M-RoPE when rope_parameters/rope_scaling are plain dicts; update it to also detect structured objects (like transformers' RopeParameters) by checking for an attribute named "mrope_section" on non-dict values: for each attr in ("rope_scaling","rope_parameters") get the value via getattr(cfg, attr, None), if it's a dict check "mrope_section" in the dict, otherwise check hasattr(value, "mrope_section") or getattr(value, "mrope_section", None) is not None and return True; ensure needs_sdpa() (or callers relying on has_mrope()) will then correctly detect M-RoPE from both dicts and structured objects.src/mini_trainer/setup_model_for_training.py (1)
956-990:⚠️ Potential issue | 🟠 MajorPick the SDPA fallback before importing
flash_attn.
_needs_sdpais computed first, but the branch that uses it still lives inside theflash_attnimport block. On a machine withoutflash_attn, Qwen3.5-VL / timm-vision models raiseImportErrorinstead of taking the SDPA/eager path this PR adds.💡 Suggested control-flow change
- try: - import flash_attn as _ # noqa: F401 - - if _needs_sdpa: - base_model_args["attn_implementation"] = "sdpa" - log_rank_0( - f"Using SDPA for {model_name_or_path} (model incompatible with Flash Attention 2)" - ) - elif is_gpt_oss: + if _needs_sdpa: + base_model_args["attn_implementation"] = "sdpa" + log_rank_0( + f"Using SDPA for {model_name_or_path} (model incompatible with Flash Attention 2)" + ) + else: + try: + import flash_attn as _ # noqa: F401 + if is_gpt_oss: ... - else: - base_model_args["attn_implementation"] = "flash_attention_2" - - except ImportError as e: - if os.environ.get("TESTING", "false").lower() == "true": - base_model_args["attn_implementation"] = "sdpa" - else: - raise e + else: + base_model_args["attn_implementation"] = "flash_attention_2" + except ImportError as e: + if os.environ.get("TESTING", "false").lower() == "true": + base_model_args["attn_implementation"] = "sdpa" + else: + raise eDoes Hugging Face Transformers support `attn_implementation="sdpa"` without the `flash_attn` package installed, using PyTorch SDPA alone?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/mini_trainer/setup_model_for_training.py` around lines 956 - 990, Compute and apply the SDPA fallback before trying to import flash_attn: if _needs_sdpa is True, set base_model_args["attn_implementation"]="sdpa" and log via log_rank_0, and skip the flash_attn import/branches entirely; otherwise proceed to the try: import flash_attn block and keep the existing is_gpt_oss / flash_attention_2 logic there, and keep the existing ImportError TESTING fallback that sets attn_implementation to "sdpa".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/mini_trainer/setup_model_for_training.py`:
- Around line 956-990: Compute and apply the SDPA fallback before trying to
import flash_attn: if _needs_sdpa is True, set
base_model_args["attn_implementation"]="sdpa" and log via log_rank_0, and skip
the flash_attn import/branches entirely; otherwise proceed to the try: import
flash_attn block and keep the existing is_gpt_oss / flash_attention_2 logic
there, and keep the existing ImportError TESTING fallback that sets
attn_implementation to "sdpa".
In `@src/mini_trainer/vlm_utils.py`:
- Around line 92-96: The call to AutoModelForImageTextToText.from_pretrained
receives duplicate model id because setup_model() seeds base_model_args with
pretrained_model_name_or_path and load_kwargs also contains it; update the
kwargs filtering (the filtered_kwargs built from load_kwargs) to drop the
"pretrained_model_name_or_path" key (in addition to the existing
quantization_config check) before calling
AutoModelForImageTextToText.from_pretrained, and apply the same removal in the
other helper block that builds filtered_kwargs (the second region around the
existing filtered_kwargs usage) so the model id is only passed positionally.
- Around line 209-212: has_mrope() currently only detects M-RoPE when
rope_parameters/rope_scaling are plain dicts; update it to also detect
structured objects (like transformers' RopeParameters) by checking for an
attribute named "mrope_section" on non-dict values: for each attr in
("rope_scaling","rope_parameters") get the value via getattr(cfg, attr, None),
if it's a dict check "mrope_section" in the dict, otherwise check hasattr(value,
"mrope_section") or getattr(value, "mrope_section", None) is not None and return
True; ensure needs_sdpa() (or callers relying on has_mrope()) will then
correctly detect M-RoPE from both dicts and structured objects.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: da8d3b88-93ba-4439-bc1e-81f89bce14ee
📒 Files selected for processing (2)
src/mini_trainer/setup_model_for_training.pysrc/mini_trainer/vlm_utils.py
- Reorder MODEL_NAME_MAPPINGS for correct substring matching - Filter pretrained_model_name_or_path from VLM load kwargs - Move SDPA decision outside flash_attn import block Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use spec=[] on mock model.model to prevent hasattr from falsely detecting language_model attribute in wrap_fsdp2 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- osft_utils.py: filter pretrained_model_name_or_path from VLM kwargs in OSFT path to prevent duplicate argument error - osft_utils.py: add hasattr guard for _can_set_experts_implementation on non-MoE base classes - vlm_utils.py: handle RopeParameters objects (not just dicts) in mrope detection via hasattr fallback - Fix isort ordering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wrap AutoConfig.from_pretrained in try/except in _load_model_memory_efficient so mock/dummy model paths don't crash the VLM detection check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Catch AttributeError in addition to ImportError to prevent partial patching of _KERNEL_MODULE_MAPPING - Update comment to accurately describe the compatibility concern (PyTorch/CUDA ABI mismatch, not C API incompatibility) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Reformat 8 test/source files to match CI ruff version - Fix UP038: use X | Y in isinstance call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove unnecessary OSFT guard for direct VLMs (patterns match fine) - Add _get_text_config() helper for VLM config fallback in align_model_and_tokenizer (vocab_size, pad/bos/eos_token_id) - Fix model.config.pad_token_id access in train.py for VLM configs - Skip activation checkpointing for direct VLM models (M-RoPE layers produce non-deterministic tensor counts during reentrant recomputation) - Use dynamic ports (_get_free_port) in model_validation.py to prevent port conflicts between sequential tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3c1e9d7
into
Red-Hat-AI-Innovation-Team:main
This PR allows the mini trainer backend to be compatible with the Qwen3.5 VL model.
Summary by CodeRabbit
New Features
Improvements