-
Notifications
You must be signed in to change notification settings - Fork 193
Enable Nemotron nano vlm v1&v2 nvfp4 PTQ workflow #347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Nemotron nano vlm v1&v2 nvfp4 PTQ workflow #347
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
WalkthroughAdds multimodal (vision‑language) support across PTQ/export/runtime: multimodal config detection and device handling, Nemotron VL detection and VL‑specific quantization/export flows, VLM preview and text‑only generation helpers, language_model extraction for exports, and changelog/CLI default update. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI
participant hf_ptq
participant vlm_utils
participant Model
Note over CLI,hf_ptq: PTQ start — detect multimodal / Nemotron VL
CLI->>hf_ptq: parse args, load config
hf_ptq->>hf_ptq: detect _is_multimodal_config / is_nemotron_vl
alt Nemotron VL
hf_ptq->>hf_ptq: adjust quant_cfg (disable vision/image components)
hf_ptq->>vlm_utils: run_text_only_generation(...)
vlm_utils->>Model: chat() or generate() (text-only)
vlm_utils-->>hf_ptq: text preview (or None)
hf_ptq->>vlm_utils: run_vl_preview_generation(...)
vlm_utils->>Model: generate() with image
vlm_utils-->>hf_ptq: vl preview (or None)
hf_ptq->>hf_ptq: quantize language_model only
hf_ptq->>hf_ptq: rewire quantized language_model into parent VL model
else Non-VL
hf_ptq->>hf_ptq: standard PTQ quantize whole model
end
hf_ptq->>CLI: export model + processors (save image processor if VL)
sequenceDiagram
participant vlm_utils
participant Processor
participant Model
Note over vlm_utils,Model: run_text_only_generation (v1 vs v2)
vlm_utils->>vlm_utils: check if model has chat
alt v1 (chat)
vlm_utils->>Model: model.chat(question, gen_config)
Model-->>vlm_utils: text response
else v2 (generate)
vlm_utils->>Processor: AutoProcessor.from_pretrained(model_path)
Processor-->>vlm_utils: processed inputs (text tensors)
vlm_utils->>Model: model.generate(inputs, gen_config)
Model-->>vlm_utils: token ids
vlm_utils->>Processor: processor.decode(ids)
Processor-->>vlm_utils: text response
end
vlm_utils-->>caller: return text or None
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #347 +/- ##
==========================================
- Coverage 73.41% 73.40% -0.01%
==========================================
Files 180 180
Lines 18053 18047 -6
==========================================
- Hits 13253 13248 -5
+ Misses 4800 4799 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d5597e2 to
16b9f8f
Compare
4b6b388 to
1f31103
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (10)
examples/llm_ptq/example_utils.py (3)
42-53: Avoid drift: centralize multimodal detection criteriaThis config-only detector partially duplicates logic in modelopt.torch.export.model_utils.is_multimodal_model. Consider defining a single source of truth (shared helper or shared predicates) to prevent divergence across files. Also consider covering common flags like vision_tower/mm_projector seen in LLaVA-like configs.
348-417: Hugging Face cache resolution: use HF Hub constants and harden fallbacksTRANSFORMERS_CACHE may be deprecated or not set; prefer huggingface_hub constants and broaden cache discovery.
Apply this diff to select cache root more robustly:
- from transformers.utils import TRANSFORMERS_CACHE - - # Look for the model in the cache directory - cache_pattern = os.path.join(TRANSFORMERS_CACHE, "models--*") + try: + from huggingface_hub.constants import HF_HUB_CACHE as _CACHE_ROOT + except Exception: + from transformers.utils import TRANSFORMERS_CACHE as _CACHE_ROOT + + # Look for the model in the cache directory + cache_pattern = os.path.join(_CACHE_ROOT, "models--*")Optionally, prefer snapshot_download() when available to avoid brittle cache walking (already attempted above).
Test with both HF Hub env var and default cache to ensure path resolution is correct.
419-483: Copy custom files recursively; avoid missing subpackagesCurrent glob() only matches the top level. Some repos place custom modeling code in subfolders.
Apply this diff:
- for pattern in custom_file_patterns: - for file_path in source_dir.glob(pattern): + for pattern in custom_file_patterns: + for file_path in source_dir.rglob(pattern): if file_path.is_file(): # Skip config.json and model.safetensors.index.json as they're handled separately if file_path.name in ["config.json", "model.safetensors.index.json"]: continue dest_path = export_dir / file_path.nameIf name collisions are possible across subdirs, consider mirroring relative paths under export_dir instead of flattening.
Run a quick dry-run listing on a VL repo with nested modeling_*.py to confirm coverage.
modelopt/torch/export/unified_export_hf.py (2)
138-144: Use shared multimodal detector; avoid string heuristicsThis re-implements VL detection and special-cases "nemotron" by name. Prefer modelopt.torch.export.model_utils.is_multimodal_model(model) to keep behavior consistent and reduce false positives/negatives.
Apply this diff:
- is_vl_model = ( - hasattr(model.config, "vision_config") - or hasattr(model, "vision_model") - or "nemotron" in getattr(model, "name_or_path", "").lower() - ) + from modelopt.torch.export.model_utils import is_multimodal_model + is_vl_model = is_multimodal_model(model)
160-189: Call language_model with explicit kwargs and basic maskPassing fake_input positionally may break on some architectures. Use explicit input_ids and a minimal attention_mask.
Apply this diff:
- try: - language_model(fake_input) + try: + attn_mask = torch.ones_like(fake_input, dtype=torch.long, device=fake_input.device) + language_model(input_ids=fake_input, attention_mask=attn_mask)Keeps hooks effective while reducing accidental signature mismatches.
Spot-check with a Nemotron VL and a LLaVA-style model.
examples/llm_ptq/hf_ptq.py (3)
289-290: Nemotron VL detection looks reasonable; add model.name_or_path fallbackIf users pass a resolved local path, args.pyt_ckpt_path may not contain "nemotron". You can also check getattr(full_model, "name_or_path", "") to improve detection.
481-523: Preview path: robust, but guard missing tokenizer earlierThe path already raises if tokenizer is None. All good. Consider truncating decoded question to a shorter seed to reduce prompt length variability across datasets.
682-710: Image processor export: add return_tensors safety and avoid re-importsMinor: os/shutil are available globally in this module; import once at top for style. No functional issue.
examples/llm_ptq/vlm_utils.py (2)
75-83: Ensure tensors from image processorSome AutoImageProcessor variants return PIL or lists if return_tensors is omitted. Add return_tensors="pt" for consistency.
Apply this diff:
- image_features = image_processor([image]) # Pass as list with single image + image_features = image_processor(images=[image], return_tensors="pt")
204-211: Processor text-only path: prefer images=[] over NoneSome processors validate field presence; using images=[] improves compatibility.
Apply this diff:
- inputs = processor( - text=[prompt], - images=None, # No images for text-only - return_tensors="pt", - ) + inputs = processor(text=[prompt], images=[], return_tensors="pt")Validate against both v1/v2 Nemotron pipelines.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/llm_ptq/example_utils.py(3 hunks)examples/llm_ptq/hf_ptq.py(6 hunks)examples/llm_ptq/vlm_utils.py(1 hunks)modelopt/torch/export/unified_export_hf.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
examples/llm_ptq/vlm_utils.py (2)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
to(114-122)modelopt/torch/speculative/utils.py (1)
tokenize(213-230)
examples/llm_ptq/example_utils.py (1)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
to(114-122)
examples/llm_ptq/hf_ptq.py (3)
examples/llm_ptq/vlm_utils.py (2)
run_text_only_generation(161-233)run_vl_preview_generation(24-158)modelopt/torch/export/model_utils.py (1)
is_multimodal_model(74-111)modelopt/torch/export/unified_export_megatron.py (1)
save_pretrained(298-462)
🔇 Additional comments (7)
examples/llm_ptq/example_utils.py (2)
313-317: Manual model.to(device) after device_map=None looks OKThis is compatible with QTensorWrapper.to and custom quant tensors. No change needed.
If any submodules remain on CPU after .to(), consider logging a per-parameter summary in debug mode to catch stragglers during calibration.
204-216: Respect sequential device mapping for vision-language models to reduce OOM riskThe review identifies a valid issue: unconditionally setting
device_map=Nonefor vision-language models prevents theuse_seq_device_mapflag from being respected. The current code setsdevice_map=Noneat line 212 for all VL models, but the check foruse_seq_device_map(line 242) only applies in the non-VILA branch, leaving VILA+VL models without the optimization.The proposed fix correctly moves the logic inside the VL detection block to conditionally apply "sequential" device mapping when requested, which would benefit both VILA and non-VILA vision-language models.
Apply the suggested diff at lines 204-212.
examples/llm_ptq/hf_ptq.py (4)
661-666: Saving AutoConfig with attn_implementation: good additionThis captures runtime-attention choice in export.
669-680: Skip AutoProcessor for Nemotron VL: acceptableNemotron uses separate image processor. Logs are helpful.
897-901: Default --attn_implementation='eager' is a behavior changeGood for stability, but it changes defaults; document in changelog and confirm it doesn’t regress throughput on common models.
Please benchmark a small LLM before/after to ensure acceptable perf, and mention this change in the PR notes.
540-544: Review comment is unnecessary and based on unfounded assumptionsThe proposed property detection fix appears to address a non-existent issue. language_model is not a read-only property in HuggingFace Transformers — it's a regular module attribute. Nemotron VL models on HuggingFace expose the underlying text LM as the language_model attribute you can inspect or replace.
The current code pattern with
hasattr(full_model, "language_model")already safely handles cases where the attribute doesn't exist. The theoretical pattern of checkingtype().__dict__.get("language_model")for property detection introduces unnecessary complexity for a scenario not documented in the Nemotron VL model implementations.If the assignment fails, the root cause would more likely be that
full_modellacks the attribute entirely (caught by hasattr) rather than it being a read-only property.Likely an incorrect or invalid review comment.
examples/llm_ptq/vlm_utils.py (1)
155-159: Graceful failure path is goodClear error helps during preview smoke tests.
| # For Nemotron VL models, disable quantization of vision components | ||
| if is_nemotron_vl: | ||
| print("Disabling quantization for vision components in Nemotron VL model") | ||
| quant_cfg["quant_cfg"]["*vision*"] = {"enable": False} | ||
| quant_cfg["quant_cfg"]["*image*"] = {"enable": False} | ||
| # Also disable radio model components specifically | ||
| quant_cfg["quant_cfg"]["*radio*"] = {"enable": False} | ||
| quant_cfg["quant_cfg"]["*visual*"] = {"enable": False} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Bug: mutating quant_cfg when auto_quantize_bits is set (quant_cfg={})
build_quant_cfg returns {} for auto-quant mode, so writing quant_cfg["quant_cfg"][...] raises KeyError and vision modules aren’t disabled. In auto-quant, disable modules via disabled_layers.
Apply this diff to cover both code paths:
- if is_nemotron_vl:
- print("Disabling quantization for vision components in Nemotron VL model")
- quant_cfg["quant_cfg"]["*vision*"] = {"enable": False}
- quant_cfg["quant_cfg"]["*image*"] = {"enable": False}
- # Also disable radio model components specifically
- quant_cfg["quant_cfg"]["*radio*"] = {"enable": False}
- quant_cfg["quant_cfg"]["*visual*"] = {"enable": False}
+ if is_nemotron_vl:
+ print("Disabling quantization for vision components in Nemotron VL model")
+ nemotron_disabled = {
+ "*vision*": {"enable": False},
+ "*image*": {"enable": False},
+ "*radio*": {"enable": False},
+ "*visual*": {"enable": False},
+ }
+ if not args.auto_quantize_bits:
+ quant_cfg.setdefault("quant_cfg", {}).update(nemotron_disabled)
+ else:
+ # Pass patterns to auto_quant to skip these modules during search
+ extra_disabled_patterns = list(nemotron_disabled.keys())And extend auto_quantize() to accept these patterns:
(addition outside this hunk)
def auto_quantize(
- model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1
+ model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1,
+ extra_disabled_layers=None,
):
@@
- disabled_layers=["*lm_head*"],
+ disabled_layers=(["*lm_head*"] + (extra_disabled_layers or [])),Finally, pass extra_disabled_patterns when calling auto_quantize():
(addition near the call site)
- model = auto_quantize(
+ model = auto_quantize(
model,
args.qformat,
args.auto_quantize_bits,
calib_dataloader,
calibrate_loop,
- args.batch_size,
+ args.batch_size,
+ extra_disabled_layers=(extra_disabled_patterns if is_nemotron_vl else None),
)Please run a short auto-quant session on a Nemotron VL to confirm vision blocks remain unquantized.
I'll verify the concerns in this review comment by examining the codebase structure and the specific functions involved.
I'll search the codebase to verify the claims about the build_quant_cfg return value and the auto_quantize function signature.
🏁 Script executed:
#!/bin/bash
# Find and examine the relevant function definitions and code
echo "=== Checking file structure and examining hf_ptq.py ==="
fd "hf_ptq.py" --type f
echo -e "\n=== Examining lines 460-480 to see context ==="
sed -n '460,480p' examples/llm_ptq/hf_ptq.py
echo -e "\n=== Finding build_quant_cfg function ==="
rg -n "def build_quant_cfg" --type py
echo -e "\n=== Finding auto_quantize function signature ==="
rg -n "def auto_quantize" --type py -A 5
echo -e "\n=== Finding calls to auto_quantize ==="
rg -n "auto_quantize\(" --type py -B 2 -A 5
echo -e "\n=== Searching for args.auto_quantize_bits usage ==="
rg -n "auto_quantize_bits" examples/llm_ptq/hf_ptq.py --type py -B 2 -A 2Length of output: 19066
🏁 Script executed:
#!/bin/bash
# Examine the key functions and usage patterns
echo "=== Lines 460-480 in hf_ptq.py ==="
sed -n '460,480p' examples/llm_ptq/hf_ptq.py
echo -e "\n=== Finding build_quant_cfg function definition ==="
rg -n "def build_quant_cfg" --type py -A 20
echo -e "\n=== Finding auto_quantize function definition ==="
rg -n "def auto_quantize" --type py -A 10
echo -e "\n=== Finding calls to auto_quantize ==="
rg -n "model = auto_quantize" --type py -B 3 -A 8
echo -e "\n=== Checking quant_cfg initialization and conditional branches ==="
rg -n "if.*auto_quantize_bits" examples/llm_ptq/hf_ptq.py --type py -B 3 -A 8Length of output: 10706
🏁 Script executed:
#!/bin/bash
# Get complete build_quant_cfg function
echo "=== Full build_quant_cfg function ==="
sed -n '55,100p' examples/llm_ptq/example_utils.py
echo -e "\n=== Check control flow around Nemotron VL code (lines 450-480) ==="
sed -n '450,480p' examples/llm_ptq/hf_ptq.py
echo -e "\n=== Full auto_quantize function signature in hf_ptq.py ==="
sed -n '96,120p' examples/llm_ptq/hf_ptq.py
echo -e "\n=== Check disabled_layers usage in auto_quantize ==="
rg -n "disabled_layers" examples/llm_ptq/hf_ptq.py -B 2 -A 2Length of output: 4324
Bug confirmed: KeyError when accessing quant_cfg["quant_cfg"] in auto-quantize mode
The issue is real. When args.auto_quantize_bits is set, build_quant_cfg() returns an empty dict {} (examples/llm_ptq/example_utils.py:64-65). The Nemotron VL code at lines 466-474 unconditionally accesses quant_cfg["quant_cfg"][...] without checking if the key exists, causing a KeyError. Additionally, the auto_quantize() function (line 96) has no mechanism to accept and pass through extra disabled layer patterns—disabled_layers is hardcoded to ["*lm_head*"] at line 135.
The proposed diff correctly addresses this by:
- Adding conditional logic to handle both auto-quant and non-auto-quant paths
- Extending
auto_quantize()to acceptextra_disabled_layersparameter - Passing the Nemotron VL patterns when calling
auto_quantize()
🤖 Prompt for AI Agents
In examples/llm_ptq/hf_ptq.py around lines 466 to 474, the code unconditionally
indexes quant_cfg["quant_cfg"] which raises KeyError in auto-quantize mode
(build_quant_cfg returns {}); update the Nemotron VL branch to first detect if
args.auto_quantize_bits is set and, if so, call auto_quantize(model, bits,
extra_disabled_layers=patterns) passing the vision/image/radio/visual patterns;
otherwise ensure quant_cfg has a "quant_cfg" dict (create one if missing) before
assigning the pattern keys and values. Also modify the auto_quantize(...)
function signature to accept an extra_disabled_layers param that merges with the
existing ["*lm_head*"] list and uses it when building disabled_layers so the
Nemotron VL disables those extra patterns during auto quantization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
examples/llm_ptq/vlm_utils.py (1)
24-35: Do not hardcode trust_remote_code=True; plumb from callerBoth helpers bypass the user’s CLI choice by forcing trust_remote_code=True. Accept a flag and pass it through. Also fix v1 image processing and image resource handling.
-def run_vl_preview_generation(model, tokenizer, model_path, stage_name): +def run_vl_preview_generation(model, tokenizer, model_path, stage_name, trust_remote_code: bool = False): @@ - image = Image.open(img_path) + # Open and normalize image safely + with Image.open(img_path) as img: + image = img.convert("RGB") @@ - image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True) + image_processor = AutoImageProcessor.from_pretrained( + model_path, trust_remote_code=trust_remote_code + ) @@ - image_features = image_processor([image]) # Pass as list with single image + image_features = image_processor( + images=[image], return_tensors="pt" + ) # single image as list @@ - processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) + processor = AutoProcessor.from_pretrained( + model_path, trust_remote_code=trust_remote_code + )-def run_text_only_generation(model, tokenizer, question, generation_config, model_path): +def run_text_only_generation( + model, tokenizer, question, generation_config, model_path, trust_remote_code: bool = False +): @@ - processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) + processor = AutoProcessor.from_pretrained( + model_path, trust_remote_code=trust_remote_code + )Please update call sites in examples/llm_ptq/hf_ptq.py to pass args.trust_remote_code. I can provide those diffs below.
Also applies to: 73-84, 94-125, 183-209
examples/llm_ptq/hf_ptq.py (2)
490-515: Plumb trust_remote_code to VL helpers (security/compliance)Forward the CLI flag to avoid forcing trust in remote code.
- text_response = run_text_only_generation( - full_model, tokenizer, question, generation_config, args.pyt_ckpt_path - ) + text_response = run_text_only_generation( + full_model, tokenizer, question, generation_config, args.pyt_ckpt_path, args.trust_remote_code + ) @@ - run_vl_preview_generation( - full_model, tokenizer, args.pyt_ckpt_path, "before quantization (VL test)" - ) + run_vl_preview_generation( + full_model, tokenizer, args.pyt_ckpt_path, "before quantization (VL test)", args.trust_remote_code + ) @@ - text_response = run_text_only_generation( - full_model, tokenizer, question, generation_config, args.pyt_ckpt_path - ) + text_response = run_text_only_generation( + full_model, tokenizer, question, generation_config, args.pyt_ckpt_path, args.trust_remote_code + ) @@ - run_vl_preview_generation( - full_model, tokenizer, args.pyt_ckpt_path, "after quantization (VL test)" - ) + run_vl_preview_generation( + full_model, tokenizer, args.pyt_ckpt_path, "after quantization (VL test)", args.trust_remote_code + )#!/bin/bash # Ensure no hardcoded trust_remote_code=True remains in vl utilities rg -n "trust_remote_code\s*=\s*True" examples/llm_ptq/vlm_utils.pyAlso applies to: 563-582
459-467: Fix KeyError in auto-quant mode and ensure vision blocks are disabled during searchWhen args.auto_quantize_bits is set, build_quant_cfg returns {}, so quant_cfg["quant_cfg"][...] raises KeyError and vision isn’t excluded. Handle both code paths and plumb disabled patterns to auto_quantize().
@@ - if is_nemotron_vl: - print("Disabling quantization for vision components in Nemotron VL model") - quant_cfg["quant_cfg"]["*vision*"] = {"enable": False} - quant_cfg["quant_cfg"]["*image*"] = {"enable": False} - # Also disable radio model components specifically - quant_cfg["quant_cfg"]["*radio*"] = {"enable": False} - quant_cfg["quant_cfg"]["*visual*"] = {"enable": False} + if is_nemotron_vl: + print("Disabling quantization for vision components in Nemotron VL model") + nemotron_disabled = { + "*vision*": {"enable": False}, + "*image*": {"enable": False}, + "*radio*": {"enable": False}, + "*visual*": {"enable": False}, + } + if not args.auto_quantize_bits: + quant_cfg.setdefault("quant_cfg", {}).update(nemotron_disabled) + else: + extra_disabled_patterns = list(nemotron_disabled.keys())Extend auto_quantize to accept these patterns and merge with existing disabled layers:
-def auto_quantize( - model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1 -): +def auto_quantize( + model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1, + extra_disabled_layers=None, +): @@ - disabled_layers=["*lm_head*"], + disabled_layers=(["*lm_head*"] + (extra_disabled_layers or [])),Pass the patterns at the call site:
- model = auto_quantize( + model = auto_quantize( model, args.qformat, args.auto_quantize_bits, calib_dataloader, calibrate_loop, - args.batch_size, + args.batch_size, + extra_disabled_layers=(extra_disabled_patterns if is_nemotron_vl else None), )
🧹 Nitpick comments (4)
modelopt/torch/export/model_utils.py (1)
114-157: Setter-less property edge case when reattaching the quantized language modelFunction correctly locates language_model across common layouts. Downstream code may fail if parent.language_model is a @Property without a setter. Recommend either:
- Document that callers should handle setter-less properties, or
- Return an additional hint (e.g., the attribute path to set), or
- Provide a small helper set_language_model(parent, lm) that falls back to parent.model.language_model if needed.
If preferred, I can add the helper and update callers.
modelopt/torch/export/unified_export_hf.py (1)
139-180: Use logging instead of prints in export pathReplace prints with a module logger to avoid noisy stdout in libraries and to enable level control.
+import logging +logger = logging.getLogger(__name__) @@ - is_vl_model = is_multimodal_model(model) + is_vl_model = is_multimodal_model(model) @@ - print( - "Found language_model component - running optimization on language model only" - ) + logger.info("Found language_model – optimizing language model only") @@ - print( - f"Running optimization on language model with fake_input shape: {fake_input.shape}" - ) + logger.debug("Optimizing language model with fake_input shape: %s", tuple(fake_input.shape)) @@ - language_model(fake_input) - print("✅ Language model optimization completed successfully") + language_model(fake_input) + logger.info("Language model optimization completed") @@ - print(f"Language model optimization failed: {e}") - print("Continuing with export...") + logger.warning("Language model optimization failed: %s. Continuing export…", e) @@ - print("Warning: No language_model found in VL model - skipping optimization") - print("This is unexpected for most VL models") + logger.warning("No language_model found in VL model – skipping optimization (unexpected for most VLs)")If desired, I can propagate this pattern to nearby messages as well.
examples/llm_ptq/hf_ptq.py (2)
533-539: Reattaching language_model: guard for getter-only propertiesIf parent_model.language_model is a @Property without a setter, assignment will fail. Add a fallback to parent_model.model.language_model when present, else raise a clear error.
- if parent_model is not None: - print("Updating full_model with quantized language_model...") - parent_model.language_model = model + if parent_model is not None: + print("Updating full_model with quantized language_model...") + prop = type(parent_model).__dict__.get("language_model") + if isinstance(prop, property) and prop.fset is None: + # common VLM layout: nested language_model under .model + if hasattr(parent_model, "model") and hasattr(parent_model.model, "language_model"): + parent_model.model.language_model = model + else: + raise AttributeError("language_model is read-only on parent; no fallback path found") + else: + parent_model.language_model = model
96-137: Avoid hidden dependency on global args inside auto_quantize()auto_quantize() reads args.kv_cache_qformat from outer scope. Prefer passing needed values explicitly to keep the helper reusable and testable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
CHANGELOG.rst(1 hunks)examples/llm_ptq/hf_ptq.py(9 hunks)examples/llm_ptq/vlm_utils.py(1 hunks)modelopt/torch/export/model_utils.py(2 hunks)modelopt/torch/export/unified_export_hf.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
modelopt/torch/export/unified_export_hf.py (1)
modelopt/torch/export/model_utils.py (2)
get_language_model_from_vl(114-157)is_multimodal_model(74-111)
examples/llm_ptq/vlm_utils.py (2)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
to(114-122)modelopt/torch/speculative/utils.py (1)
tokenize(213-230)
examples/llm_ptq/hf_ptq.py (2)
examples/llm_ptq/vlm_utils.py (2)
run_text_only_generation(161-233)run_vl_preview_generation(24-158)modelopt/torch/export/model_utils.py (2)
get_language_model_from_vl(114-157)is_multimodal_model(74-111)
🔇 Additional comments (1)
modelopt/torch/export/model_utils.py (1)
63-63: Public API surface — looks goodAdding get_language_model_from_vl to all is appropriate and consistent.
| - Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration. | ||
| - Add support for MCore MoE PTQ/QAT/QAD. | ||
| - Add support for multi-node PTQ and export with FSDP2 in ``examples/llm_ptq/multinode_ptq.py``. See `examples/llm_ptq/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq#multi-node-post-training-quantization-with-fsdp2>`_ for more details. | ||
| - Add support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forward-dated release entry
0.39 (2025-11-07) is in the future (today is 2025-10-23). Please mark this as Unreleased/TBD to avoid confusion until the release is cut.
-0.39 (2025-11-07)
+0.39 (Unreleased)Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In CHANGELOG.rst around line 16, the release entry "0.39 (2025-11-07)" is
forward-dated; change the header to indicate it is not yet released (e.g., "0.39
(Unreleased)" or "0.39 (TBD)") and leave the content line "Add support for
Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow." under that
Unreleased/TBD heading so the changelog does not show a future date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
examples/llm_ptq/vlm_utils.py (2)
73-73: Security issue remains unaddressed: hardcoded trust_remote_code=True bypasses user's CLI choice.This critical issue was flagged in a previous review but has not been resolved. Both
AutoImageProcessor.from_pretrained(line 73) andAutoProcessor.from_pretrained(line 94) hardcodetrust_remote_code=True, forcing this security-sensitive setting regardless of the user'sargs.trust_remote_codeCLI flag.Apply the diff from the previous review to add a
trust_remote_codeparameter to the function signature and pass it through to these calls. Callers inhf_ptq.pymust also be updated to passargs.trust_remote_code.Also applies to: 94-94
183-183: Security issue remains unaddressed: hardcoded trust_remote_code=True bypasses user's CLI choice.This critical issue was flagged in a previous review but has not been resolved.
AutoProcessor.from_pretrainedhardcodestrust_remote_code=True, forcing this security-sensitive setting regardless of the user'sargs.trust_remote_codeCLI flag.Apply the diff from the previous review to add a
trust_remote_codeparameter to the function signature and pass it through to this call. Callers inhf_ptq.pymust also be updated to passargs.trust_remote_code.
🧹 Nitpick comments (2)
modelopt/torch/export/unified_export_hf.py (1)
161-179: Consider structured logging for production environments.The VL optimization path uses multiple
print()statements for status updates. While consistent with the current file style, structured logging would provide better observability in production.Additionally, the emoji (✅) on line 173 may not render correctly in all terminal environments or log aggregation systems.
examples/llm_ptq/vlm_utils.py (1)
180-229: Consider refactoring: v2 path logic is duplicated across both functions.The v2 generation flow (lines 180-229) shares substantial logic with
run_vl_preview_generationlines 92-146: creating messages, applying chat template, processing inputs, moving to device, generating, and decoding. This duplication increases maintenance burden and the risk of inconsistencies.Consider extracting a helper function like
_generate_v2(model, processor, tokenizer, messages, generation_config, images=None)to consolidate the v2 flow and reduce code duplication. However, this can be deferred if the current structure is sufficient for immediate needs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/llm_ptq/vlm_utils.py(1 hunks)modelopt/torch/export/model_utils.py(2 hunks)modelopt/torch/export/unified_export_hf.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
modelopt/torch/export/unified_export_hf.py (1)
modelopt/torch/export/model_utils.py (2)
get_language_model_from_vl(114-157)is_multimodal_model(74-111)
examples/llm_ptq/vlm_utils.py (2)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
to(114-122)modelopt/torch/speculative/utils.py (1)
tokenize(213-230)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
🔇 Additional comments (5)
modelopt/torch/export/model_utils.py (2)
63-63: LGTM: Public API updated correctly.The addition of
"get_language_model_from_vl"to__all__properly exposes the new public function.
139-150: Code properly handles property delegation with appropriate fallbacks; documentation would clarify intent.The property detection logic in lines 141-147 uses valid Python introspection to check for properties and handles the case defensively. The assumption that a
language_modelproperty points tomodel.model.language_modelis speculative but reasonable for models with property-based delegation patterns. The code correctly falls back to direct attribute access if the nested structure doesn't exist, and Pattern 2 (lines 153-154) properly handles remaining cases where no property exists.The three-pattern design ensures complete coverage: Pattern 1 handles properties with potential delegation, Pattern 2 catches direct nested access, and Pattern 3 returns None when not found. No changes needed—the implementation is defensive and correct.
modelopt/torch/export/unified_export_hf.py (2)
58-58: LGTM: Import statement is correct.The added imports for VL model detection and language model extraction are properly sourced from the model_utils module.
138-141: LGTM: VL model detection is well-placed.The multimodal detection is appropriately positioned after input setup and before model-specific handling branches.
examples/llm_ptq/vlm_utils.py (1)
85-90: Confirm whether the different parameter passing forhistoryis intentional or should be harmonized for consistency.The review comment correctly identifies a real inconsistency in the codebase:
- Line 85-90 (VL generation mode):
model.chat()is called with keyword arguments and omits thehistoryparameter entirely- Line 178 (text-only generation mode):
model.chat()is called with positional arguments and explicitly passeshistory=NoneBoth calls target the same
model.chat()method but use different parameter passing conventions. The question of whether omittinghistoryis equivalent to passinghistory=Nonedepends on the method's signature and default parameter values, which are defined in the externally loaded model code (viatrust_remote_code=True) and cannot be verified locally in this repository.The developer should confirm:
- Whether this difference is intentional (e.g., different modes require different defaults)
- If equivalent, whether both calls should use the same explicit style for consistency
| elif is_vl_model: | ||
| # For VL models, try to run optimization on just the language model part | ||
| language_model, _ = get_language_model_from_vl(model) | ||
| if language_model is not None: | ||
| print( | ||
| "Found language_model component - running optimization on language model only" | ||
| ) | ||
|
|
||
| if language_model is not None: | ||
| # Run optimization on just the language model with the same input format as regular LLMs | ||
| # Use the same fake_input tensor that regular LLMs use | ||
| print( | ||
| f"Running optimization on language model with fake_input shape: {fake_input.shape}" | ||
| ) | ||
| try: | ||
| language_model(fake_input) | ||
| print("✅ Language model optimization completed successfully") | ||
| except Exception as e: | ||
| print(f"Language model optimization failed: {e}") | ||
| print("Continuing with export...") | ||
| else: | ||
| print("Warning: No language_model found in VL model - skipping optimization") | ||
| print("This is unexpected for most VL models") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Consolidate redundant if statements.
Lines 160-163 and 165-176 contain consecutive if language_model is not None: checks. The first block only prints a message before the second block performs the actual work, making the first check redundant.
Apply this diff to consolidate the logic:
elif is_vl_model:
# For VL models, try to run optimization on just the language model part
language_model, _ = get_language_model_from_vl(model)
if language_model is not None:
print(
"Found language_model component - running optimization on language model only"
)
-
- if language_model is not None:
# Run optimization on just the language model with the same input format as regular LLMs
# Use the same fake_input tensor that regular LLMs use
print(
f"Running optimization on language model with fake_input shape: {fake_input.shape}"
)
try:
language_model(fake_input)
print("✅ Language model optimization completed successfully")
except Exception as e:
print(f"Language model optimization failed: {e}")
print("Continuing with export...")
else:
print("Warning: No language_model found in VL model - skipping optimization")
print("This is unexpected for most VL models")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif is_vl_model: | |
| # For VL models, try to run optimization on just the language model part | |
| language_model, _ = get_language_model_from_vl(model) | |
| if language_model is not None: | |
| print( | |
| "Found language_model component - running optimization on language model only" | |
| ) | |
| if language_model is not None: | |
| # Run optimization on just the language model with the same input format as regular LLMs | |
| # Use the same fake_input tensor that regular LLMs use | |
| print( | |
| f"Running optimization on language model with fake_input shape: {fake_input.shape}" | |
| ) | |
| try: | |
| language_model(fake_input) | |
| print("✅ Language model optimization completed successfully") | |
| except Exception as e: | |
| print(f"Language model optimization failed: {e}") | |
| print("Continuing with export...") | |
| else: | |
| print("Warning: No language_model found in VL model - skipping optimization") | |
| print("This is unexpected for most VL models") | |
| elif is_vl_model: | |
| # For VL models, try to run optimization on just the language model part | |
| language_model, _ = get_language_model_from_vl(model) | |
| if language_model is not None: | |
| print( | |
| "Found language_model component - running optimization on language model only" | |
| ) | |
| # Run optimization on just the language model with the same input format as regular LLMs | |
| # Use the same fake_input tensor that regular LLMs use | |
| print( | |
| f"Running optimization on language model with fake_input shape: {fake_input.shape}" | |
| ) | |
| try: | |
| language_model(fake_input) | |
| print("✅ Language model optimization completed successfully") | |
| except Exception as e: | |
| print(f"Language model optimization failed: {e}") | |
| print("Continuing with export...") | |
| else: | |
| print("Warning: No language_model found in VL model - skipping optimization") | |
| print("This is unexpected for most VL models") |
🤖 Prompt for AI Agents
In modelopt/torch/export/unified_export_hf.py around lines 157 to 179, there are
two consecutive checks for "if language_model is not None" where the first only
prints a message and the second does the work; consolidate them into a single
check: perform the initial print about finding the language model, then
immediately print the fake_input shape and run the try/except block inside the
same "if language_model is not None" branch; remove the redundant earlier check
and its duplicated None test so behavior remains identical but with a single
conditional guarding the prints and optimization attempt (keep the warning
branch unchanged when language_model is None).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
examples/llm_ptq/hf_ptq.py (2)
458-466: Auto‑quant crash + fragile globals: guard quant_cfg in VL path, pass disabled patterns into auto_quantize, and stop using global args inside auto_quantize.
- In auto‑quant mode, build_quant_cfg() returns {}, so quant_cfg["quant_cfg"][...] raises KeyError (Lines 458-466).
- auto_quantize() accesses global args (Lines 137-143), which is brittle and breaks when imported.
Fix by:
- Guarding the quant_cfg updates when not auto‑quant.
- Passing Nemotron VL disable patterns to auto_quantize via a new extra_disabled_layers param.
- Passing kv_cache_qformat into auto_quantize instead of reading global args.
Apply these diffs:
- Guard VL quant_cfg edits and collect patterns for auto‑quant:
@@ - # For Nemotron VL models, disable quantization of vision components - if is_nemotron_vl: - print("Disabling quantization for vision components in Nemotron VL model") - quant_cfg["quant_cfg"]["*vision*"] = {"enable": False} - quant_cfg["quant_cfg"]["*image*"] = {"enable": False} - # Also disable radio model components specifically - quant_cfg["quant_cfg"]["*radio*"] = {"enable": False} - quant_cfg["quant_cfg"]["*visual*"] = {"enable": False} + # For Nemotron VL models, disable quantization of vision components + extra_disabled_patterns = None + if is_nemotron_vl: + print("Disabling quantization for vision components in Nemotron VL model") + nemotron_disabled = { + "*vision*": {"enable": False}, + "*image*": {"enable": False}, + "*radio*": {"enable": False}, + "*visual*": {"enable": False}, + } + if not args.auto_quantize_bits: + # Non auto-quant: mutate quant_cfg safely + quant_cfg.setdefault("quant_cfg", {}).update(nemotron_disabled) + else: + # Auto-quant: pass patterns down to auto_quantize() + extra_disabled_patterns = list(nemotron_disabled.keys())
- Plumb patterns and kv_cache_qformat into auto_quantize():
@@ def auto_quantize( -def auto_quantize( - model, qformat, auto_quantize_bits, calib_dataloader, calibrate_loop, batch_size=1 -): +def auto_quantize( + model, + qformat, + auto_quantize_bits, + calib_dataloader, + calibrate_loop, + batch_size=1, + *, + extra_disabled_layers=None, + kv_cache_qformat="none", +): @@ - model, _ = mtq.auto_quantize( + model, _ = mtq.auto_quantize( model, constraints={"effective_bits": auto_quantize_bits}, data_loader=calib_dataloader, forward_step=lambda model, batch: model(**batch), loss_func=loss_func, @@ - disabled_layers=["*lm_head*"], + disabled_layers=(["*lm_head*"] + (extra_disabled_layers or [])), ) @@ - enable_quant_kv_cache = args.kv_cache_qformat != "none" + enable_quant_kv_cache = kv_cache_qformat != "none" @@ - kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"] + kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[kv_cache_qformat])["quant_cfg"]
- Pass the new params at the call site:
@@ - model = auto_quantize( + model = auto_quantize( model, args.qformat, args.auto_quantize_bits, calib_dataloader, calibrate_loop, args.batch_size, + extra_disabled_layers=(extra_disabled_patterns if is_nemotron_vl else None), + kv_cache_qformat=args.kv_cache_qformat, )Also applies to: 95-136, 137-153, 186-197
488-491: Plumb trust_remote_code flag to VLM helpers (don’t hardcode True).Pass args.trust_remote_code through to helper calls to honor CLI and avoid unexpected code execution.
@@ - text_response = run_text_only_generation( - full_model, tokenizer, question, generation_config, args.pyt_ckpt_path - ) + text_response = run_text_only_generation( + full_model, tokenizer, question, generation_config, args.pyt_ckpt_path, args.trust_remote_code + ) @@ - run_vl_preview_generation( - full_model, tokenizer, args.pyt_ckpt_path, "before quantization (VL test)" - ) + run_vl_preview_generation( + full_model, tokenizer, args.pyt_ckpt_path, "before quantization (VL test)", args.trust_remote_code + ) @@ - text_response = run_text_only_generation( - full_model, tokenizer, question, generation_config, args.pyt_ckpt_path - ) + text_response = run_text_only_generation( + full_model, tokenizer, question, generation_config, args.pyt_ckpt_path, args.trust_remote_code + ) @@ - run_vl_preview_generation( - full_model, tokenizer, args.pyt_ckpt_path, "after quantization (VL test)" - ) + run_vl_preview_generation( + full_model, tokenizer, args.pyt_ckpt_path, "after quantization (VL test)", args.trust_remote_code + )#!/bin/bash # Verify no hardcoded trust_remote_code=True remains in helpers or call sites rg -n "trust_remote_code=True|run_vl_preview_generation\(|run_text_only_generation\(" examples/llm_ptq -n -C1Also applies to: 512-514, 563-565, 579-581
examples/llm_ptq/vlm_utils.py (2)
86-107: Do not force trust_remote_code=True; accept a flag.Plumb a trust_remote_code: bool parameter and forward it to from_pretrained calls.
-def run_vl_preview_generation(model, tokenizer, model_path, stage_name): +def run_vl_preview_generation(model, tokenizer, model_path, stage_name, trust_remote_code: bool = False): @@ - image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True) + image_processor = AutoImageProcessor.from_pretrained( + model_path, trust_remote_code=trust_remote_code + ) @@ - processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) + processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=trust_remote_code) @@ -def run_text_only_generation(model, tokenizer, question, generation_config, model_path): +def run_text_only_generation( + model, tokenizer, question, generation_config, model_path, trust_remote_code: bool = False +): @@ - processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) + processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=trust_remote_code)#!/bin/bash # After applying, confirm no hardcoded True remains here rg -n "trust_remote_code=True" examples/llm_ptq/vlm_utils.pyAlso applies to: 191-191
86-96: Ensure tensors are created and moved to device in v1 path.AutoImageProcessor(...) without return_tensors produces non‑tensor data; your .to() loop won’t execute, and model.chat may error. Request tensors and move in one step.
- image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True) - - image_features = image_processor([image]) # Pass as list with single image - - # Move image features to the same device as the model - model_device = model.device - for key, value in image_features.items(): - if hasattr(value, "to"): # Check if it's a tensor - image_features[key] = value.to(model_device) - print(f" Moved {key} to {model_device}") + image_processor = AutoImageProcessor.from_pretrained( + model_path, trust_remote_code=trust_remote_code + ) + # Ensure PT tensors are returned and move everything in one go + image_features = image_processor([image], return_tensors="pt") + model_device = model.device + image_features = { + k: (v.to(model_device) if hasattr(v, "to") else v) + for k, v in image_features.items() + } + for k, v in image_features.items(): + if hasattr(v, "device"): + print(f" Moved {k} to {model_device}")
🧹 Nitpick comments (2)
examples/llm_ptq/hf_ptq.py (2)
287-289: VLM detection: consider model metadata in addition to path.Path substring checks can misfire. Optionally include config- or arch‑based checks (e.g., getattr(model.config, "architectures", []) or model.config.model_type) for robustness.
319-333: Quantizing only language_model: verify export expectations; consider pattern-based disable.You quantize children with a disabled config to exclude non‑LM blocks. Confirm unified export relies on quantizer flags, not module identity. Alternatively, disable by pattern to avoid touching modules:
# Option: Use set_quantizer_by_cfg_context around export or quant to disable non-LM modules by pattern. # e.g., with mtq.set_quantizer_by_cfg_context(model, {"*": {"enable": False}, "language_model*": {"enable": True}}): ...
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/llm_ptq/hf_ptq.py(9 hunks)examples/llm_ptq/vlm_utils.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
examples/llm_ptq/hf_ptq.py (2)
examples/llm_ptq/vlm_utils.py (2)
run_text_only_generation(172-241)run_vl_preview_generation(24-169)modelopt/torch/export/model_utils.py (2)
get_language_model_from_vl(114-157)is_multimodal_model(74-111)
examples/llm_ptq/vlm_utils.py (2)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
to(114-122)modelopt/torch/speculative/utils.py (1)
tokenize(213-230)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (2)
examples/llm_ptq/hf_ptq.py (2)
655-660: Saving AutoConfig with attn_implementation: confirm Transformers accepts this kwarg.Some configs may ignore/warn on unknown kwargs. Validate that passing attn_implementation through from_pretrained is supported for your target models.
856-863: Behavior change: default --attn_implementation to "eager".Previously None; now always "eager". Ensure this is intentional for all supported models and update docs accordingly.
|
@Edwardf0t1 please make sure internal cicd in gitlab passes for llm_ptq and vlm_ptq before merging |
kevalmorabia97
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding my codeowner approval. Didnt review example changes
| if getattr(model.config, "is_encoder_decoder", False): | ||
| # For encoder-decoder models, we need to pass both the encoder and decoder input ids | ||
| model(fake_input, decoder_input_ids=decoder_fake_input) | ||
| elif is_vl_model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we want to achieve on this code change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle VLMs properly when feed fake inputs to the model, so we can run forward pass without issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we don't need this change before the nemotron nano VLM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the model forward pass signatures. Nemotron VLM has a tightly integrated multimodal architecture where the forward pass signature requires both vision and text inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then is this nemotron VL specific. How about we just make it explicit instead of applying this to all multimodal?
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Keval Morabia <[email protected]> Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
f8d5bc8 to
57d388e
Compare
Signed-off-by: Zhiyu Cheng <[email protected]>
| model.eval() | ||
|
|
||
| # If device_map was disabled (None), manually move model to target device | ||
| if device_map is None and device != "cpu": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if device == "cpu"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was handled by HF's device_map="cpu" in L210.
examples/llm_ptq/hf_ptq.py
Outdated
|
|
||
| # Generate preview before quantization | ||
| if is_nemotron_vl and tokenizer is not None: | ||
| print("Running text-only preview generation for Nemotron VL model...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you abstract lne 476 - 499 to a helper function? It can be re-used 527-559
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
cjluo-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments!
What does this PR do?
Type of change: New model support
Overview: This PR is to enable Nemotron nano vlm v1&v2 nvfp4 PTQ flow.
Usage
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Chores
Documentation