Skip to content

Conversation

@Edwardf0t1
Copy link
Contributor

@Edwardf0t1 Edwardf0t1 commented Sep 19, 2025

What does this PR do?

Type of change: New model support

Overview: This PR is to enable Nemotron nano vlm v1&v2 nvfp4 PTQ flow.

  • Added support for Nemotron Nano VL v1 (.chat) & v2 (.generate) model inference with image inputs in the PTQ flow.
  • Added a utils script for VLMs in llm_ptq example.
  • Added utils functions to handle VLMs properly in hf_ptq and export.
  • TODO:
    • Consolidate with vlm_ptq example,
    • Enable FP8 kv cache in export,
    • Calibration with image data.

Usage

python hf_ptq.py --pyt_ckpt_path /home/scratch.omniml_data_2/models/Nemotron-Nano-VL-12B-V2 --qformat nvfp4 --export_path /home/scratch.omniml_data_2/zhiyuc/checkpoints/Nemotron-Nano-VL-12B-V2-nvfp4-1017 --trust_remote_code --kv_cache_qformat none --attn_implementation eager

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: Yes
  • Did you add or update any necessary documentation?: No
  • Did you update Changelog?: Yes

Additional Information

Summary by CodeRabbit

  • New Features

    • Added Vision‑Language (VL) model support across quantization, export and preview generation, including image preview and text‑only preview flows for multimodal models.
    • Improved multimodal detection and explicit device placement for better inference behavior.
  • Chores

    • Export now preserves image/text processor assets for multimodal models.
    • CLI default attention implementation changed to "eager".
  • Documentation

    • Changelog updated to note multimodal (VL) model support.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 19, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Edwardf0t1 Edwardf0t1 self-assigned this Sep 19, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 19, 2025

Walkthrough

Adds multimodal (vision‑language) support across PTQ/export/runtime: multimodal config detection and device handling, Nemotron VL detection and VL‑specific quantization/export flows, VLM preview and text‑only generation helpers, language_model extraction for exports, and changelog/CLI default update.

Changes

Cohort / File(s) Summary
Multimodal detection & device handling
examples/llm_ptq/example_utils.py
Added _is_multimodal_config(config); preload config in get_model() to detect VL models, disable automatic device_map for multimodal configs (set to None on load failure), warn on config load errors, and explicitly move constructed models to the target device when device_map was disabled.
PTQ pipeline with VLM support
examples/llm_ptq/hf_ptq.py
Added Nemotron VL detection (is_nemotron_vl) and VL branches: adjust quant_cfg to skip vision/visual components during calibration/quantization, run run_text_only_generation and run_vl_preview_generation (v1/v2 aware) for previews, rewire quantized language_model back into parent VL models post‑PTQ, preserve/save VL processors during export, change CLI default --attn_implementation to "eager", and modify output/printing for VL text responses and diagnostics.
VLM utility functions (new module)
examples/llm_ptq/vlm_utils.py
New module adding run_vl_preview_generation(model, tokenizer, model_path, stage_name) and run_text_only_generation(model, tokenizer, question, generation_config, model_path) supporting both v1 (chat) and v2 (generate) APIs, loading processors with trust_remote_code, moving tensors to model.device, handling errors, and returning generated text or None.
Export optimization for VLMs
modelopt/torch/export/unified_export_hf.py
Added VL detection (is_multimodal_model) and a branch that attempts to locate and run a language_model subcomponent for multimodal (non‑encoder‑decoder) models during export optimization; runs the subcomponent with fake inputs, logs success/failure, and continues on exceptions.
Language model extraction utility
modelopt/torch/export/model_utils.py
Added public get_language_model_from_vl(model) to extract a VL model's language_model and its parent; updated __all__ to include the new function.
Changelog
CHANGELOG.rst
Documented support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant hf_ptq
    participant vlm_utils
    participant Model

    Note over CLI,hf_ptq: PTQ start — detect multimodal / Nemotron VL
    CLI->>hf_ptq: parse args, load config
    hf_ptq->>hf_ptq: detect _is_multimodal_config / is_nemotron_vl
    alt Nemotron VL
        hf_ptq->>hf_ptq: adjust quant_cfg (disable vision/image components)
        hf_ptq->>vlm_utils: run_text_only_generation(...)
        vlm_utils->>Model: chat() or generate() (text-only)
        vlm_utils-->>hf_ptq: text preview (or None)
        hf_ptq->>vlm_utils: run_vl_preview_generation(...)
        vlm_utils->>Model: generate() with image
        vlm_utils-->>hf_ptq: vl preview (or None)
        hf_ptq->>hf_ptq: quantize language_model only
        hf_ptq->>hf_ptq: rewire quantized language_model into parent VL model
    else Non-VL
        hf_ptq->>hf_ptq: standard PTQ quantize whole model
    end
    hf_ptq->>CLI: export model + processors (save image processor if VL)
Loading
sequenceDiagram
    participant vlm_utils
    participant Processor
    participant Model

    Note over vlm_utils,Model: run_text_only_generation (v1 vs v2)
    vlm_utils->>vlm_utils: check if model has chat
    alt v1 (chat)
        vlm_utils->>Model: model.chat(question, gen_config)
        Model-->>vlm_utils: text response
    else v2 (generate)
        vlm_utils->>Processor: AutoProcessor.from_pretrained(model_path)
        Processor-->>vlm_utils: processed inputs (text tensors)
        vlm_utils->>Model: model.generate(inputs, gen_config)
        Model-->>vlm_utils: token ids
        vlm_utils->>Processor: processor.decode(ids)
        Processor-->>vlm_utils: text response
    end
    vlm_utils-->>caller: return text or None
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I found a scene of sight and speech,

Words and pixels in one small reach.
I nudged the language, gave it a spin,
Quantized the heart and rewired the grin.
Nemotron VL — hop, let's begin!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "Enable Nemotron nano vlm v1&v2 nvfp4 PTQ workflow" directly and accurately reflects the main objective of the changeset. According to the PR objectives and changelog, the primary change is to add support for Nemotron Nano Vision-Language Models (v1 and v2) with NVFP4 quantization in the PTQ workflow. The title is concise (8 words, 49 characters), avoids vague language or noise, and clearly communicates the specific models, quantization format, and workflow being enabled. A teammate reviewing the repository history would immediately understand that this PR introduces support for quantizing specific Nemotron Nano VLM variants using NVFP4.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch zhiyu/support-nemotron-nano-vlm-v1-nvfp4

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Sep 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.40%. Comparing base (b8dbfc0) to head (4352ab6).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #347      +/-   ##
==========================================
- Coverage   73.41%   73.40%   -0.01%     
==========================================
  Files         180      180              
  Lines       18053    18047       -6     
==========================================
- Hits        13253    13248       -5     
+ Misses       4800     4799       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Edwardf0t1 Edwardf0t1 changed the title Support nemotron nano vlm v1 nvfp4 quantize + export Enable Nemotron nano vlm v1&v2 nvfp4 quantization and export Oct 23, 2025
@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/support-nemotron-nano-vlm-v1-nvfp4 branch from d5597e2 to 16b9f8f Compare October 23, 2025 00:43
@Edwardf0t1 Edwardf0t1 changed the title Enable Nemotron nano vlm v1&v2 nvfp4 quantization and export Enable Nemotron nano vlm v1&v2 nvfp4 quantization and export workflow Oct 23, 2025
@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/support-nemotron-nano-vlm-v1-nvfp4 branch from 4b6b388 to 1f31103 Compare October 23, 2025 01:51
@Edwardf0t1 Edwardf0t1 marked this pull request as ready for review October 23, 2025 01:51
@Edwardf0t1 Edwardf0t1 requested review from a team as code owners October 23, 2025 01:51
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (10)
examples/llm_ptq/example_utils.py (3)

42-53: Avoid drift: centralize multimodal detection criteria

This config-only detector partially duplicates logic in modelopt.torch.export.model_utils.is_multimodal_model. Consider defining a single source of truth (shared helper or shared predicates) to prevent divergence across files. Also consider covering common flags like vision_tower/mm_projector seen in LLaVA-like configs.


348-417: Hugging Face cache resolution: use HF Hub constants and harden fallbacks

TRANSFORMERS_CACHE may be deprecated or not set; prefer huggingface_hub constants and broaden cache discovery.

Apply this diff to select cache root more robustly:

-        from transformers.utils import TRANSFORMERS_CACHE
-
-        # Look for the model in the cache directory
-        cache_pattern = os.path.join(TRANSFORMERS_CACHE, "models--*")
+        try:
+            from huggingface_hub.constants import HF_HUB_CACHE as _CACHE_ROOT
+        except Exception:
+            from transformers.utils import TRANSFORMERS_CACHE as _CACHE_ROOT
+
+        # Look for the model in the cache directory
+        cache_pattern = os.path.join(_CACHE_ROOT, "models--*")

Optionally, prefer snapshot_download() when available to avoid brittle cache walking (already attempted above).

Test with both HF Hub env var and default cache to ensure path resolution is correct.


419-483: Copy custom files recursively; avoid missing subpackages

Current glob() only matches the top level. Some repos place custom modeling code in subfolders.

Apply this diff:

-    for pattern in custom_file_patterns:
-        for file_path in source_dir.glob(pattern):
+    for pattern in custom_file_patterns:
+        for file_path in source_dir.rglob(pattern):
             if file_path.is_file():
                 # Skip config.json and model.safetensors.index.json as they're handled separately
                 if file_path.name in ["config.json", "model.safetensors.index.json"]:
                     continue
                 dest_path = export_dir / file_path.name

If name collisions are possible across subdirs, consider mirroring relative paths under export_dir instead of flattening.

Run a quick dry-run listing on a VL repo with nested modeling_*.py to confirm coverage.

modelopt/torch/export/unified_export_hf.py (2)

138-144: Use shared multimodal detector; avoid string heuristics

This re-implements VL detection and special-cases "nemotron" by name. Prefer modelopt.torch.export.model_utils.is_multimodal_model(model) to keep behavior consistent and reduce false positives/negatives.

Apply this diff:

-        is_vl_model = (
-            hasattr(model.config, "vision_config")
-            or hasattr(model, "vision_model")
-            or "nemotron" in getattr(model, "name_or_path", "").lower()
-        )
+        from modelopt.torch.export.model_utils import is_multimodal_model
+        is_vl_model = is_multimodal_model(model)

160-189: Call language_model with explicit kwargs and basic mask

Passing fake_input positionally may break on some architectures. Use explicit input_ids and a minimal attention_mask.

Apply this diff:

-                    try:
-                        language_model(fake_input)
+                    try:
+                        attn_mask = torch.ones_like(fake_input, dtype=torch.long, device=fake_input.device)
+                        language_model(input_ids=fake_input, attention_mask=attn_mask)

Keeps hooks effective while reducing accidental signature mismatches.

Spot-check with a Nemotron VL and a LLaVA-style model.

examples/llm_ptq/hf_ptq.py (3)

289-290: Nemotron VL detection looks reasonable; add model.name_or_path fallback

If users pass a resolved local path, args.pyt_ckpt_path may not contain "nemotron". You can also check getattr(full_model, "name_or_path", "") to improve detection.


481-523: Preview path: robust, but guard missing tokenizer earlier

The path already raises if tokenizer is None. All good. Consider truncating decoded question to a shorter seed to reduce prompt length variability across datasets.


682-710: Image processor export: add return_tensors safety and avoid re-imports

Minor: os/shutil are available globally in this module; import once at top for style. No functional issue.

examples/llm_ptq/vlm_utils.py (2)

75-83: Ensure tensors from image processor

Some AutoImageProcessor variants return PIL or lists if return_tensors is omitted. Add return_tensors="pt" for consistency.

Apply this diff:

-            image_features = image_processor([image])  # Pass as list with single image
+            image_features = image_processor(images=[image], return_tensors="pt")

204-211: Processor text-only path: prefer images=[] over None

Some processors validate field presence; using images=[] improves compatibility.

Apply this diff:

-            inputs = processor(
-                text=[prompt],
-                images=None,  # No images for text-only
-                return_tensors="pt",
-            )
+            inputs = processor(text=[prompt], images=[], return_tensors="pt")

Validate against both v1/v2 Nemotron pipelines.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2fd67cc and 1f31103.

📒 Files selected for processing (4)
  • examples/llm_ptq/example_utils.py (3 hunks)
  • examples/llm_ptq/hf_ptq.py (6 hunks)
  • examples/llm_ptq/vlm_utils.py (1 hunks)
  • modelopt/torch/export/unified_export_hf.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
examples/llm_ptq/vlm_utils.py (2)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
  • to (114-122)
modelopt/torch/speculative/utils.py (1)
  • tokenize (213-230)
examples/llm_ptq/example_utils.py (1)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
  • to (114-122)
examples/llm_ptq/hf_ptq.py (3)
examples/llm_ptq/vlm_utils.py (2)
  • run_text_only_generation (161-233)
  • run_vl_preview_generation (24-158)
modelopt/torch/export/model_utils.py (1)
  • is_multimodal_model (74-111)
modelopt/torch/export/unified_export_megatron.py (1)
  • save_pretrained (298-462)
🔇 Additional comments (7)
examples/llm_ptq/example_utils.py (2)

313-317: Manual model.to(device) after device_map=None looks OK

This is compatible with QTensorWrapper.to and custom quant tensors. No change needed.

If any submodules remain on CPU after .to(), consider logging a per-parameter summary in debug mode to catch stragglers during calibration.


204-216: Respect sequential device mapping for vision-language models to reduce OOM risk

The review identifies a valid issue: unconditionally setting device_map=None for vision-language models prevents the use_seq_device_map flag from being respected. The current code sets device_map=None at line 212 for all VL models, but the check for use_seq_device_map (line 242) only applies in the non-VILA branch, leaving VILA+VL models without the optimization.

The proposed fix correctly moves the logic inside the VL detection block to conditionally apply "sequential" device mapping when requested, which would benefit both VILA and non-VILA vision-language models.

Apply the suggested diff at lines 204-212.

examples/llm_ptq/hf_ptq.py (4)

661-666: Saving AutoConfig with attn_implementation: good addition

This captures runtime-attention choice in export.


669-680: Skip AutoProcessor for Nemotron VL: acceptable

Nemotron uses separate image processor. Logs are helpful.


897-901: Default --attn_implementation='eager' is a behavior change

Good for stability, but it changes defaults; document in changelog and confirm it doesn’t regress throughput on common models.

Please benchmark a small LLM before/after to ensure acceptable perf, and mention this change in the PR notes.


540-544: Review comment is unnecessary and based on unfounded assumptions

The proposed property detection fix appears to address a non-existent issue. language_model is not a read-only property in HuggingFace Transformers — it's a regular module attribute. Nemotron VL models on HuggingFace expose the underlying text LM as the language_model attribute you can inspect or replace.

The current code pattern with hasattr(full_model, "language_model") already safely handles cases where the attribute doesn't exist. The theoretical pattern of checking type().__dict__.get("language_model") for property detection introduces unnecessary complexity for a scenario not documented in the Nemotron VL model implementations.

If the assignment fails, the root cause would more likely be that full_model lacks the attribute entirely (caught by hasattr) rather than it being a read-only property.

Likely an incorrect or invalid review comment.

examples/llm_ptq/vlm_utils.py (1)

155-159: Graceful failure path is good

Clear error helps during preview smoke tests.

Comment on lines 466 to 467
# For Nemotron VL models, disable quantization of vision components
if is_nemotron_vl:
print("Disabling quantization for vision components in Nemotron VL model")
quant_cfg["quant_cfg"]["*vision*"] = {"enable": False}
quant_cfg["quant_cfg"]["*image*"] = {"enable": False}
# Also disable radio model components specifically
quant_cfg["quant_cfg"]["*radio*"] = {"enable": False}
quant_cfg["quant_cfg"]["*visual*"] = {"enable": False}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Bug: mutating quant_cfg when auto_quantize_bits is set (quant_cfg={})

build_quant_cfg returns {} for auto-quant mode, so writing quant_cfg["quant_cfg"][...] raises KeyError and vision modules aren’t disabled. In auto-quant, disable modules via disabled_layers.

Apply this diff to cover both code paths:

-        if is_nemotron_vl:
-            print("Disabling quantization for vision components in Nemotron VL model")
-            quant_cfg["quant_cfg"]["*vision*"] = {"enable": False}
-            quant_cfg["quant_cfg"]["*image*"] = {"enable": False}
-            # Also disable radio model components specifically
-            quant_cfg["quant_cfg"]["*radio*"] = {"enable": False}
-            quant_cfg["quant_cfg"]["*visual*"] = {"enable": False}
+        if is_nemotron_vl:
+            print("Disabling quantization for vision components in Nemotron VL model")
+            nemotron_disabled = {
+                "*vision*": {"enable": False},
+                "*image*": {"enable": False},
+                "*radio*": {"enable": False},
+                "*visual*": {"enable": False},
+            }
+            if not args.auto_quantize_bits:
+                quant_cfg.setdefault("quant_cfg", {}).update(nemotron_disabled)
+            else:
+                # Pass patterns to auto_quant to skip these modules during search
+                extra_disabled_patterns = list(nemotron_disabled.keys())

And extend auto_quantize() to accept these patterns:
(addition outside this hunk)

 def auto_quantize(
-    model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1
+    model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1,
+    extra_disabled_layers=None,
 ):
@@
-        disabled_layers=["*lm_head*"],
+        disabled_layers=(["*lm_head*"] + (extra_disabled_layers or [])),

Finally, pass extra_disabled_patterns when calling auto_quantize():
(addition near the call site)

-        model = auto_quantize(
+        model = auto_quantize(
             model,
             args.qformat,
             args.auto_quantize_bits,
             calib_dataloader,
             calibrate_loop,
-            args.batch_size,
+            args.batch_size,
+            extra_disabled_layers=(extra_disabled_patterns if is_nemotron_vl else None),
         )

Please run a short auto-quant session on a Nemotron VL to confirm vision blocks remain unquantized.


I'll verify the concerns in this review comment by examining the codebase structure and the specific functions involved.

I'll search the codebase to verify the claims about the build_quant_cfg return value and the auto_quantize function signature.


🏁 Script executed:

#!/bin/bash
# Find and examine the relevant function definitions and code
echo "=== Checking file structure and examining hf_ptq.py ==="
fd "hf_ptq.py" --type f

echo -e "\n=== Examining lines 460-480 to see context ==="
sed -n '460,480p' examples/llm_ptq/hf_ptq.py

echo -e "\n=== Finding build_quant_cfg function ==="
rg -n "def build_quant_cfg" --type py

echo -e "\n=== Finding auto_quantize function signature ==="
rg -n "def auto_quantize" --type py -A 5

echo -e "\n=== Finding calls to auto_quantize ==="
rg -n "auto_quantize\(" --type py -B 2 -A 5

echo -e "\n=== Searching for args.auto_quantize_bits usage ==="
rg -n "auto_quantize_bits" examples/llm_ptq/hf_ptq.py --type py -B 2 -A 2

Length of output: 19066


🏁 Script executed:

#!/bin/bash
# Examine the key functions and usage patterns

echo "=== Lines 460-480 in hf_ptq.py ==="
sed -n '460,480p' examples/llm_ptq/hf_ptq.py

echo -e "\n=== Finding build_quant_cfg function definition ==="
rg -n "def build_quant_cfg" --type py -A 20

echo -e "\n=== Finding auto_quantize function definition ==="
rg -n "def auto_quantize" --type py -A 10

echo -e "\n=== Finding calls to auto_quantize ==="
rg -n "model = auto_quantize" --type py -B 3 -A 8

echo -e "\n=== Checking quant_cfg initialization and conditional branches ==="
rg -n "if.*auto_quantize_bits" examples/llm_ptq/hf_ptq.py --type py -B 3 -A 8

Length of output: 10706



🏁 Script executed:

#!/bin/bash
# Get complete build_quant_cfg function
echo "=== Full build_quant_cfg function ==="
sed -n '55,100p' examples/llm_ptq/example_utils.py

echo -e "\n=== Check control flow around Nemotron VL code (lines 450-480) ==="
sed -n '450,480p' examples/llm_ptq/hf_ptq.py

echo -e "\n=== Full auto_quantize function signature in hf_ptq.py ==="
sed -n '96,120p' examples/llm_ptq/hf_ptq.py

echo -e "\n=== Check disabled_layers usage in auto_quantize ==="
rg -n "disabled_layers" examples/llm_ptq/hf_ptq.py -B 2 -A 2

Length of output: 4324


Bug confirmed: KeyError when accessing quant_cfg["quant_cfg"] in auto-quantize mode

The issue is real. When args.auto_quantize_bits is set, build_quant_cfg() returns an empty dict {} (examples/llm_ptq/example_utils.py:64-65). The Nemotron VL code at lines 466-474 unconditionally accesses quant_cfg["quant_cfg"][...] without checking if the key exists, causing a KeyError. Additionally, the auto_quantize() function (line 96) has no mechanism to accept and pass through extra disabled layer patterns—disabled_layers is hardcoded to ["*lm_head*"] at line 135.

The proposed diff correctly addresses this by:

  1. Adding conditional logic to handle both auto-quant and non-auto-quant paths
  2. Extending auto_quantize() to accept extra_disabled_layers parameter
  3. Passing the Nemotron VL patterns when calling auto_quantize()
🤖 Prompt for AI Agents
In examples/llm_ptq/hf_ptq.py around lines 466 to 474, the code unconditionally
indexes quant_cfg["quant_cfg"] which raises KeyError in auto-quantize mode
(build_quant_cfg returns {}); update the Nemotron VL branch to first detect if
args.auto_quantize_bits is set and, if so, call auto_quantize(model, bits,
extra_disabled_layers=patterns) passing the vision/image/radio/visual patterns;
otherwise ensure quant_cfg has a "quant_cfg" dict (create one if missing) before
assigning the pattern keys and values. Also modify the auto_quantize(...)
function signature to accept an extra_disabled_layers param that merges with the
existing ["*lm_head*"] list and uses it when building disabled_layers so the
Nemotron VL disables those extra patterns during auto quantization.

@Edwardf0t1 Edwardf0t1 changed the title Enable Nemotron nano vlm v1&v2 nvfp4 quantization and export workflow Enable Nemotron nano vlm v1&v2 nvfp4 PTQ workflow Oct 23, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
examples/llm_ptq/vlm_utils.py (1)

24-35: Do not hardcode trust_remote_code=True; plumb from caller

Both helpers bypass the user’s CLI choice by forcing trust_remote_code=True. Accept a flag and pass it through. Also fix v1 image processing and image resource handling.

-def run_vl_preview_generation(model, tokenizer, model_path, stage_name):
+def run_vl_preview_generation(model, tokenizer, model_path, stage_name, trust_remote_code: bool = False):
@@
-        image = Image.open(img_path)
+        # Open and normalize image safely
+        with Image.open(img_path) as img:
+            image = img.convert("RGB")
@@
-            image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True)
+            image_processor = AutoImageProcessor.from_pretrained(
+                model_path, trust_remote_code=trust_remote_code
+            )
@@
-            image_features = image_processor([image])  # Pass as list with single image
+            image_features = image_processor(
+                images=[image], return_tensors="pt"
+            )  # single image as list
@@
-            processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+            processor = AutoProcessor.from_pretrained(
+                model_path, trust_remote_code=trust_remote_code
+            )
-def run_text_only_generation(model, tokenizer, question, generation_config, model_path):
+def run_text_only_generation(
+    model, tokenizer, question, generation_config, model_path, trust_remote_code: bool = False
+):
@@
-            processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+            processor = AutoProcessor.from_pretrained(
+                model_path, trust_remote_code=trust_remote_code
+            )

Please update call sites in examples/llm_ptq/hf_ptq.py to pass args.trust_remote_code. I can provide those diffs below.

Also applies to: 73-84, 94-125, 183-209

examples/llm_ptq/hf_ptq.py (2)

490-515: Plumb trust_remote_code to VL helpers (security/compliance)

Forward the CLI flag to avoid forcing trust in remote code.

-                    text_response = run_text_only_generation(
-                        full_model, tokenizer, question, generation_config, args.pyt_ckpt_path
-                    )
+                    text_response = run_text_only_generation(
+                        full_model, tokenizer, question, generation_config, args.pyt_ckpt_path, args.trust_remote_code
+                    )
@@
-                run_vl_preview_generation(
-                    full_model, tokenizer, args.pyt_ckpt_path, "before quantization (VL test)"
-                )
+                run_vl_preview_generation(
+                    full_model, tokenizer, args.pyt_ckpt_path, "before quantization (VL test)", args.trust_remote_code
+                )
@@
-                    text_response = run_text_only_generation(
-                        full_model, tokenizer, question, generation_config, args.pyt_ckpt_path
-                    )
+                    text_response = run_text_only_generation(
+                        full_model, tokenizer, question, generation_config, args.pyt_ckpt_path, args.trust_remote_code
+                    )
@@
-                run_vl_preview_generation(
-                    full_model, tokenizer, args.pyt_ckpt_path, "after quantization (VL test)"
-                )
+                run_vl_preview_generation(
+                    full_model, tokenizer, args.pyt_ckpt_path, "after quantization (VL test)", args.trust_remote_code
+                )
#!/bin/bash
# Ensure no hardcoded trust_remote_code=True remains in vl utilities
rg -n "trust_remote_code\s*=\s*True" examples/llm_ptq/vlm_utils.py

Also applies to: 563-582


459-467: Fix KeyError in auto-quant mode and ensure vision blocks are disabled during search

When args.auto_quantize_bits is set, build_quant_cfg returns {}, so quant_cfg["quant_cfg"][...] raises KeyError and vision isn’t excluded. Handle both code paths and plumb disabled patterns to auto_quantize().

@@
-        if is_nemotron_vl:
-            print("Disabling quantization for vision components in Nemotron VL model")
-            quant_cfg["quant_cfg"]["*vision*"] = {"enable": False}
-            quant_cfg["quant_cfg"]["*image*"] = {"enable": False}
-            # Also disable radio model components specifically
-            quant_cfg["quant_cfg"]["*radio*"] = {"enable": False}
-            quant_cfg["quant_cfg"]["*visual*"] = {"enable": False}
+        if is_nemotron_vl:
+            print("Disabling quantization for vision components in Nemotron VL model")
+            nemotron_disabled = {
+                "*vision*": {"enable": False},
+                "*image*": {"enable": False},
+                "*radio*": {"enable": False},
+                "*visual*": {"enable": False},
+            }
+            if not args.auto_quantize_bits:
+                quant_cfg.setdefault("quant_cfg", {}).update(nemotron_disabled)
+            else:
+                extra_disabled_patterns = list(nemotron_disabled.keys())

Extend auto_quantize to accept these patterns and merge with existing disabled layers:

-def auto_quantize(
-    model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1
-):
+def auto_quantize(
+    model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1,
+    extra_disabled_layers=None,
+):
@@
-        disabled_layers=["*lm_head*"],
+        disabled_layers=(["*lm_head*"] + (extra_disabled_layers or [])),

Pass the patterns at the call site:

-        model = auto_quantize(
+        model = auto_quantize(
             model,
             args.qformat,
             args.auto_quantize_bits,
             calib_dataloader,
             calibrate_loop,
-            args.batch_size,
+            args.batch_size,
+            extra_disabled_layers=(extra_disabled_patterns if is_nemotron_vl else None),
         )
🧹 Nitpick comments (4)
modelopt/torch/export/model_utils.py (1)

114-157: Setter-less property edge case when reattaching the quantized language model

Function correctly locates language_model across common layouts. Downstream code may fail if parent.language_model is a @Property without a setter. Recommend either:

  • Document that callers should handle setter-less properties, or
  • Return an additional hint (e.g., the attribute path to set), or
  • Provide a small helper set_language_model(parent, lm) that falls back to parent.model.language_model if needed.

If preferred, I can add the helper and update callers.

modelopt/torch/export/unified_export_hf.py (1)

139-180: Use logging instead of prints in export path

Replace prints with a module logger to avoid noisy stdout in libraries and to enable level control.

+import logging
+logger = logging.getLogger(__name__)
@@
-        is_vl_model = is_multimodal_model(model)
+        is_vl_model = is_multimodal_model(model)
@@
-                    print(
-                        "Found language_model component - running optimization on language model only"
-                    )
+                    logger.info("Found language_model – optimizing language model only")
@@
-                    print(
-                        f"Running optimization on language model with fake_input shape: {fake_input.shape}"
-                    )
+                    logger.debug("Optimizing language model with fake_input shape: %s", tuple(fake_input.shape))
@@
-                        language_model(fake_input)
-                        print("✅ Language model optimization completed successfully")
+                        language_model(fake_input)
+                        logger.info("Language model optimization completed")
@@
-                        print(f"Language model optimization failed: {e}")
-                        print("Continuing with export...")
+                        logger.warning("Language model optimization failed: %s. Continuing export…", e)
@@
-                    print("Warning: No language_model found in VL model - skipping optimization")
-                    print("This is unexpected for most VL models")
+                    logger.warning("No language_model found in VL model – skipping optimization (unexpected for most VLs)")

If desired, I can propagate this pattern to nearby messages as well.

examples/llm_ptq/hf_ptq.py (2)

533-539: Reattaching language_model: guard for getter-only properties

If parent_model.language_model is a @Property without a setter, assignment will fail. Add a fallback to parent_model.model.language_model when present, else raise a clear error.

-                if parent_model is not None:
-                    print("Updating full_model with quantized language_model...")
-                    parent_model.language_model = model
+                if parent_model is not None:
+                    print("Updating full_model with quantized language_model...")
+                    prop = type(parent_model).__dict__.get("language_model")
+                    if isinstance(prop, property) and prop.fset is None:
+                        # common VLM layout: nested language_model under .model
+                        if hasattr(parent_model, "model") and hasattr(parent_model.model, "language_model"):
+                            parent_model.model.language_model = model
+                        else:
+                            raise AttributeError("language_model is read-only on parent; no fallback path found")
+                    else:
+                        parent_model.language_model = model

96-137: Avoid hidden dependency on global args inside auto_quantize()

auto_quantize() reads args.kv_cache_qformat from outer scope. Prefer passing needed values explicitly to keep the helper reusable and testable.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f31103 and f87a026.

📒 Files selected for processing (5)
  • CHANGELOG.rst (1 hunks)
  • examples/llm_ptq/hf_ptq.py (9 hunks)
  • examples/llm_ptq/vlm_utils.py (1 hunks)
  • modelopt/torch/export/model_utils.py (2 hunks)
  • modelopt/torch/export/unified_export_hf.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
modelopt/torch/export/unified_export_hf.py (1)
modelopt/torch/export/model_utils.py (2)
  • get_language_model_from_vl (114-157)
  • is_multimodal_model (74-111)
examples/llm_ptq/vlm_utils.py (2)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
  • to (114-122)
modelopt/torch/speculative/utils.py (1)
  • tokenize (213-230)
examples/llm_ptq/hf_ptq.py (2)
examples/llm_ptq/vlm_utils.py (2)
  • run_text_only_generation (161-233)
  • run_vl_preview_generation (24-158)
modelopt/torch/export/model_utils.py (2)
  • get_language_model_from_vl (114-157)
  • is_multimodal_model (74-111)
🔇 Additional comments (1)
modelopt/torch/export/model_utils.py (1)

63-63: Public API surface — looks good

Adding get_language_model_from_vl to all is appropriate and consistent.

- Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration.
- Add support for MCore MoE PTQ/QAT/QAD.
- Add support for multi-node PTQ and export with FSDP2 in ``examples/llm_ptq/multinode_ptq.py``. See `examples/llm_ptq/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq#multi-node-post-training-quantization-with-fsdp2>`_ for more details.
- Add support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Forward-dated release entry

0.39 (2025-11-07) is in the future (today is 2025-10-23). Please mark this as Unreleased/TBD to avoid confusion until the release is cut.

-0.39 (2025-11-07)
+0.39 (Unreleased)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In CHANGELOG.rst around line 16, the release entry "0.39 (2025-11-07)" is
forward-dated; change the header to indicate it is not yet released (e.g., "0.39
(Unreleased)" or "0.39 (TBD)") and leave the content line "Add support for
Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow." under that
Unreleased/TBD heading so the changelog does not show a future date.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
examples/llm_ptq/vlm_utils.py (2)

73-73: Security issue remains unaddressed: hardcoded trust_remote_code=True bypasses user's CLI choice.

This critical issue was flagged in a previous review but has not been resolved. Both AutoImageProcessor.from_pretrained (line 73) and AutoProcessor.from_pretrained (line 94) hardcode trust_remote_code=True, forcing this security-sensitive setting regardless of the user's args.trust_remote_code CLI flag.

Apply the diff from the previous review to add a trust_remote_code parameter to the function signature and pass it through to these calls. Callers in hf_ptq.py must also be updated to pass args.trust_remote_code.

Also applies to: 94-94


183-183: Security issue remains unaddressed: hardcoded trust_remote_code=True bypasses user's CLI choice.

This critical issue was flagged in a previous review but has not been resolved. AutoProcessor.from_pretrained hardcodes trust_remote_code=True, forcing this security-sensitive setting regardless of the user's args.trust_remote_code CLI flag.

Apply the diff from the previous review to add a trust_remote_code parameter to the function signature and pass it through to this call. Callers in hf_ptq.py must also be updated to pass args.trust_remote_code.

🧹 Nitpick comments (2)
modelopt/torch/export/unified_export_hf.py (1)

161-179: Consider structured logging for production environments.

The VL optimization path uses multiple print() statements for status updates. While consistent with the current file style, structured logging would provide better observability in production.

Additionally, the emoji (✅) on line 173 may not render correctly in all terminal environments or log aggregation systems.

examples/llm_ptq/vlm_utils.py (1)

180-229: Consider refactoring: v2 path logic is duplicated across both functions.

The v2 generation flow (lines 180-229) shares substantial logic with run_vl_preview_generation lines 92-146: creating messages, applying chat template, processing inputs, moving to device, generating, and decoding. This duplication increases maintenance burden and the risk of inconsistencies.

Consider extracting a helper function like _generate_v2(model, processor, tokenizer, messages, generation_config, images=None) to consolidate the v2 flow and reduce code duplication. However, this can be deferred if the current structure is sufficient for immediate needs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f87a026 and f25a2eb.

📒 Files selected for processing (3)
  • examples/llm_ptq/vlm_utils.py (1 hunks)
  • modelopt/torch/export/model_utils.py (2 hunks)
  • modelopt/torch/export/unified_export_hf.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
modelopt/torch/export/unified_export_hf.py (1)
modelopt/torch/export/model_utils.py (2)
  • get_language_model_from_vl (114-157)
  • is_multimodal_model (74-111)
examples/llm_ptq/vlm_utils.py (2)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
  • to (114-122)
modelopt/torch/speculative/utils.py (1)
  • tokenize (213-230)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
🔇 Additional comments (5)
modelopt/torch/export/model_utils.py (2)

63-63: LGTM: Public API updated correctly.

The addition of "get_language_model_from_vl" to __all__ properly exposes the new public function.


139-150: Code properly handles property delegation with appropriate fallbacks; documentation would clarify intent.

The property detection logic in lines 141-147 uses valid Python introspection to check for properties and handles the case defensively. The assumption that a language_model property points to model.model.language_model is speculative but reasonable for models with property-based delegation patterns. The code correctly falls back to direct attribute access if the nested structure doesn't exist, and Pattern 2 (lines 153-154) properly handles remaining cases where no property exists.

The three-pattern design ensures complete coverage: Pattern 1 handles properties with potential delegation, Pattern 2 catches direct nested access, and Pattern 3 returns None when not found. No changes needed—the implementation is defensive and correct.

modelopt/torch/export/unified_export_hf.py (2)

58-58: LGTM: Import statement is correct.

The added imports for VL model detection and language model extraction are properly sourced from the model_utils module.


138-141: LGTM: VL model detection is well-placed.

The multimodal detection is appropriately positioned after input setup and before model-specific handling branches.

examples/llm_ptq/vlm_utils.py (1)

85-90: Confirm whether the different parameter passing for history is intentional or should be harmonized for consistency.

The review comment correctly identifies a real inconsistency in the codebase:

  • Line 85-90 (VL generation mode): model.chat() is called with keyword arguments and omits the history parameter entirely
  • Line 178 (text-only generation mode): model.chat() is called with positional arguments and explicitly passes history=None

Both calls target the same model.chat() method but use different parameter passing conventions. The question of whether omitting history is equivalent to passing history=None depends on the method's signature and default parameter values, which are defined in the externally loaded model code (via trust_remote_code=True) and cannot be verified locally in this repository.

The developer should confirm:

  1. Whether this difference is intentional (e.g., different modes require different defaults)
  2. If equivalent, whether both calls should use the same explicit style for consistency

Comment on lines 157 to 179
elif is_vl_model:
# For VL models, try to run optimization on just the language model part
language_model, _ = get_language_model_from_vl(model)
if language_model is not None:
print(
"Found language_model component - running optimization on language model only"
)

if language_model is not None:
# Run optimization on just the language model with the same input format as regular LLMs
# Use the same fake_input tensor that regular LLMs use
print(
f"Running optimization on language model with fake_input shape: {fake_input.shape}"
)
try:
language_model(fake_input)
print("✅ Language model optimization completed successfully")
except Exception as e:
print(f"Language model optimization failed: {e}")
print("Continuing with export...")
else:
print("Warning: No language_model found in VL model - skipping optimization")
print("This is unexpected for most VL models")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Consolidate redundant if statements.

Lines 160-163 and 165-176 contain consecutive if language_model is not None: checks. The first block only prints a message before the second block performs the actual work, making the first check redundant.

Apply this diff to consolidate the logic:

 elif is_vl_model:
     # For VL models, try to run optimization on just the language model part
     language_model, _ = get_language_model_from_vl(model)
     if language_model is not None:
         print(
             "Found language_model component - running optimization on language model only"
         )
-
-    if language_model is not None:
         # Run optimization on just the language model with the same input format as regular LLMs
         # Use the same fake_input tensor that regular LLMs use
         print(
             f"Running optimization on language model with fake_input shape: {fake_input.shape}"
         )
         try:
             language_model(fake_input)
             print("✅ Language model optimization completed successfully")
         except Exception as e:
             print(f"Language model optimization failed: {e}")
             print("Continuing with export...")
     else:
         print("Warning: No language_model found in VL model - skipping optimization")
         print("This is unexpected for most VL models")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif is_vl_model:
# For VL models, try to run optimization on just the language model part
language_model, _ = get_language_model_from_vl(model)
if language_model is not None:
print(
"Found language_model component - running optimization on language model only"
)
if language_model is not None:
# Run optimization on just the language model with the same input format as regular LLMs
# Use the same fake_input tensor that regular LLMs use
print(
f"Running optimization on language model with fake_input shape: {fake_input.shape}"
)
try:
language_model(fake_input)
print("✅ Language model optimization completed successfully")
except Exception as e:
print(f"Language model optimization failed: {e}")
print("Continuing with export...")
else:
print("Warning: No language_model found in VL model - skipping optimization")
print("This is unexpected for most VL models")
elif is_vl_model:
# For VL models, try to run optimization on just the language model part
language_model, _ = get_language_model_from_vl(model)
if language_model is not None:
print(
"Found language_model component - running optimization on language model only"
)
# Run optimization on just the language model with the same input format as regular LLMs
# Use the same fake_input tensor that regular LLMs use
print(
f"Running optimization on language model with fake_input shape: {fake_input.shape}"
)
try:
language_model(fake_input)
print("✅ Language model optimization completed successfully")
except Exception as e:
print(f"Language model optimization failed: {e}")
print("Continuing with export...")
else:
print("Warning: No language_model found in VL model - skipping optimization")
print("This is unexpected for most VL models")
🤖 Prompt for AI Agents
In modelopt/torch/export/unified_export_hf.py around lines 157 to 179, there are
two consecutive checks for "if language_model is not None" where the first only
prints a message and the second does the work; consolidate them into a single
check: perform the initial print about finding the language model, then
immediately print the fake_input shape and run the try/except block inside the
same "if language_model is not None" branch; remove the redundant earlier check
and its duplicated None test so behavior remains identical but with a single
conditional guarding the prints and optimization attempt (keep the warning
branch unchanged when language_model is None).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
examples/llm_ptq/hf_ptq.py (2)

458-466: Auto‑quant crash + fragile globals: guard quant_cfg in VL path, pass disabled patterns into auto_quantize, and stop using global args inside auto_quantize.

  • In auto‑quant mode, build_quant_cfg() returns {}, so quant_cfg["quant_cfg"][...] raises KeyError (Lines 458-466).
  • auto_quantize() accesses global args (Lines 137-143), which is brittle and breaks when imported.

Fix by:

  • Guarding the quant_cfg updates when not auto‑quant.
  • Passing Nemotron VL disable patterns to auto_quantize via a new extra_disabled_layers param.
  • Passing kv_cache_qformat into auto_quantize instead of reading global args.

Apply these diffs:

  1. Guard VL quant_cfg edits and collect patterns for auto‑quant:
@@
-        # For Nemotron VL models, disable quantization of vision components
-        if is_nemotron_vl:
-            print("Disabling quantization for vision components in Nemotron VL model")
-            quant_cfg["quant_cfg"]["*vision*"] = {"enable": False}
-            quant_cfg["quant_cfg"]["*image*"] = {"enable": False}
-            # Also disable radio model components specifically
-            quant_cfg["quant_cfg"]["*radio*"] = {"enable": False}
-            quant_cfg["quant_cfg"]["*visual*"] = {"enable": False}
+        # For Nemotron VL models, disable quantization of vision components
+        extra_disabled_patterns = None
+        if is_nemotron_vl:
+            print("Disabling quantization for vision components in Nemotron VL model")
+            nemotron_disabled = {
+                "*vision*": {"enable": False},
+                "*image*": {"enable": False},
+                "*radio*": {"enable": False},
+                "*visual*": {"enable": False},
+            }
+            if not args.auto_quantize_bits:
+                # Non auto-quant: mutate quant_cfg safely
+                quant_cfg.setdefault("quant_cfg", {}).update(nemotron_disabled)
+            else:
+                # Auto-quant: pass patterns down to auto_quantize()
+                extra_disabled_patterns = list(nemotron_disabled.keys())
  1. Plumb patterns and kv_cache_qformat into auto_quantize():
@@ def auto_quantize(
-def auto_quantize(
-    model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1
-):
+def auto_quantize(
+    model,
+    qformat,
+    auto_quantize_bits,
+    calib_dataloader,
+    calibrate_loop,
+    batch_size=1,
+    *,
+    extra_disabled_layers=None,
+    kv_cache_qformat="none",
+):
@@
-    model, _ = mtq.auto_quantize(
+    model, _ = mtq.auto_quantize(
         model,
         constraints={"effective_bits": auto_quantize_bits},
         data_loader=calib_dataloader,
         forward_step=lambda model, batch: model(**batch),
         loss_func=loss_func,
@@
-        disabled_layers=["*lm_head*"],
+        disabled_layers=(["*lm_head*"] + (extra_disabled_layers or [])),
     )
@@
-    enable_quant_kv_cache = args.kv_cache_qformat != "none"
+    enable_quant_kv_cache = kv_cache_qformat != "none"
@@
-        kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]
+        kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[kv_cache_qformat])["quant_cfg"]
  1. Pass the new params at the call site:
@@
-        model = auto_quantize(
+        model = auto_quantize(
             model,
             args.qformat,
             args.auto_quantize_bits,
             calib_dataloader,
             calibrate_loop,
             args.batch_size,
+            extra_disabled_layers=(extra_disabled_patterns if is_nemotron_vl else None),
+            kv_cache_qformat=args.kv_cache_qformat,
         )

Also applies to: 95-136, 137-153, 186-197


488-491: Plumb trust_remote_code flag to VLM helpers (don’t hardcode True).

Pass args.trust_remote_code through to helper calls to honor CLI and avoid unexpected code execution.

@@
-                    text_response = run_text_only_generation(
-                        full_model, tokenizer, question, generation_config, args.pyt_ckpt_path
-                    )
+                    text_response = run_text_only_generation(
+                        full_model, tokenizer, question, generation_config, args.pyt_ckpt_path, args.trust_remote_code
+                    )
@@
-                run_vl_preview_generation(
-                    full_model, tokenizer, args.pyt_ckpt_path, "before quantization (VL test)"
-                )
+                run_vl_preview_generation(
+                    full_model, tokenizer, args.pyt_ckpt_path, "before quantization (VL test)", args.trust_remote_code
+                )
@@
-                    text_response = run_text_only_generation(
-                        full_model, tokenizer, question, generation_config, args.pyt_ckpt_path
-                    )
+                    text_response = run_text_only_generation(
+                        full_model, tokenizer, question, generation_config, args.pyt_ckpt_path, args.trust_remote_code
+                    )
@@
-                run_vl_preview_generation(
-                    full_model, tokenizer, args.pyt_ckpt_path, "after quantization (VL test)"
-                )
+                run_vl_preview_generation(
+                    full_model, tokenizer, args.pyt_ckpt_path, "after quantization (VL test)", args.trust_remote_code
+                )
#!/bin/bash
# Verify no hardcoded trust_remote_code=True remains in helpers or call sites
rg -n "trust_remote_code=True|run_vl_preview_generation\(|run_text_only_generation\(" examples/llm_ptq -n -C1

Also applies to: 512-514, 563-565, 579-581

examples/llm_ptq/vlm_utils.py (2)

86-107: Do not force trust_remote_code=True; accept a flag.

Plumb a trust_remote_code: bool parameter and forward it to from_pretrained calls.

-def run_vl_preview_generation(model, tokenizer, model_path, stage_name):
+def run_vl_preview_generation(model, tokenizer, model_path, stage_name, trust_remote_code: bool = False):
@@
-            image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True)
+            image_processor = AutoImageProcessor.from_pretrained(
+                model_path, trust_remote_code=trust_remote_code
+            )
@@
-            processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+            processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=trust_remote_code)
@@
-def run_text_only_generation(model, tokenizer, question, generation_config, model_path):
+def run_text_only_generation(
+    model, tokenizer, question, generation_config, model_path, trust_remote_code: bool = False
+):
@@
-            processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+            processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=trust_remote_code)
#!/bin/bash
# After applying, confirm no hardcoded True remains here
rg -n "trust_remote_code=True" examples/llm_ptq/vlm_utils.py

Also applies to: 191-191


86-96: Ensure tensors are created and moved to device in v1 path.

AutoImageProcessor(...) without return_tensors produces non‑tensor data; your .to() loop won’t execute, and model.chat may error. Request tensors and move in one step.

-            image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True)
-
-            image_features = image_processor([image])  # Pass as list with single image
-
-            # Move image features to the same device as the model
-            model_device = model.device
-            for key, value in image_features.items():
-                if hasattr(value, "to"):  # Check if it's a tensor
-                    image_features[key] = value.to(model_device)
-                    print(f"    Moved {key} to {model_device}")
+            image_processor = AutoImageProcessor.from_pretrained(
+                model_path, trust_remote_code=trust_remote_code
+            )
+            # Ensure PT tensors are returned and move everything in one go
+            image_features = image_processor([image], return_tensors="pt")
+            model_device = model.device
+            image_features = {
+                k: (v.to(model_device) if hasattr(v, "to") else v)
+                for k, v in image_features.items()
+            }
+            for k, v in image_features.items():
+                if hasattr(v, "device"):
+                    print(f"    Moved {k} to {model_device}")
🧹 Nitpick comments (2)
examples/llm_ptq/hf_ptq.py (2)

287-289: VLM detection: consider model metadata in addition to path.

Path substring checks can misfire. Optionally include config- or arch‑based checks (e.g., getattr(model.config, "architectures", []) or model.config.model_type) for robustness.


319-333: Quantizing only language_model: verify export expectations; consider pattern-based disable.

You quantize children with a disabled config to exclude non‑LM blocks. Confirm unified export relies on quantizer flags, not module identity. Alternatively, disable by pattern to avoid touching modules:

# Option: Use set_quantizer_by_cfg_context around export or quant to disable non-LM modules by pattern.
# e.g., with mtq.set_quantizer_by_cfg_context(model, {"*": {"enable": False}, "language_model*": {"enable": True}}): ...
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f25a2eb and 613ee6c.

📒 Files selected for processing (2)
  • examples/llm_ptq/hf_ptq.py (9 hunks)
  • examples/llm_ptq/vlm_utils.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/llm_ptq/hf_ptq.py (2)
examples/llm_ptq/vlm_utils.py (2)
  • run_text_only_generation (172-241)
  • run_vl_preview_generation (24-169)
modelopt/torch/export/model_utils.py (2)
  • get_language_model_from_vl (114-157)
  • is_multimodal_model (74-111)
examples/llm_ptq/vlm_utils.py (2)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
  • to (114-122)
modelopt/torch/speculative/utils.py (1)
  • tokenize (213-230)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (2)
examples/llm_ptq/hf_ptq.py (2)

655-660: Saving AutoConfig with attn_implementation: confirm Transformers accepts this kwarg.

Some configs may ignore/warn on unknown kwargs. Validate that passing attn_implementation through from_pretrained is supported for your target models.


856-863: Behavior change: default --attn_implementation to "eager".

Previously None; now always "eager". Ensure this is intentional for all supported models and update docs accordingly.

@kevalmorabia97
Copy link
Collaborator

@Edwardf0t1 please make sure internal cicd in gitlab passes for llm_ptq and vlm_ptq before merging

@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner October 23, 2025 17:26
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding my codeowner approval. Didnt review example changes

if getattr(model.config, "is_encoder_decoder", False):
# For encoder-decoder models, we need to pass both the encoder and decoder input ids
model(fake_input, decoder_input_ids=decoder_fake_input)
elif is_vl_model:
Copy link
Collaborator

@cjluo-nv cjluo-nv Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we want to achieve on this code change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle VLMs properly when feed fake inputs to the model, so we can run forward pass without issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we don't need this change before the nemotron nano VLM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the model forward pass signatures. Nemotron VLM has a tightly integrated multimodal architecture where the forward pass signature requires both vision and text inputs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then is this nemotron VL specific. How about we just make it explicit instead of applying this to all multimodal?

@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/support-nemotron-nano-vlm-v1-nvfp4 branch from f8d5bc8 to 57d388e Compare October 24, 2025 03:09
Signed-off-by: Zhiyu Cheng <[email protected]>
model.eval()

# If device_map was disabled (None), manually move model to target device
if device_map is None and device != "cpu":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if device == "cpu"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was handled by HF's device_map="cpu" in L210.


# Generate preview before quantization
if is_nemotron_vl and tokenizer is not None:
print("Running text-only preview generation for Nemotron VL model...")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you abstract lne 476 - 499 to a helper function? It can be re-used 527-559

Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Copy link
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments!

@kevalmorabia97 kevalmorabia97 merged commit 8745a3c into main Oct 24, 2025
26 checks passed
@kevalmorabia97 kevalmorabia97 deleted the zhiyu/support-nemotron-nano-vlm-v1-nvfp4 branch October 24, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants