Skip to content

Conversation

Yuening-wa
Copy link

@Yuening-wa Yuening-wa commented Aug 23, 2025

What does this PR do?

Type of change: new feature

Overview: Add support for INT8 weight-only per-channel quantization. The output int8 quantized checkpoint is HuggingFace format and can be directly used in TRTLLM PyTorch workflow.

Usage

# quantize the model by ModelOpt
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path $model_path \
    --qformat "int8_wo" --kv_cache_qformat "none" \
    --export_fmt hf \
    --export_path $output_path

# do inference in TRTLLM
python3 $trtllm_path/examples/llm-api/quickstart_advanced.py --model_dir $output_path

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • New Features

    • Added weight-only INT8 quantization (int8_wo) support across export, quantization utilities, and public config options; selection and conversion paths updated.
  • Examples

    • Enabled int8_wo in the Hugging Face PTQ example and validation script, expanding accepted formats and messages.
  • Documentation

    • Fixed a typo in the quantization guide.
  • Tests

    • Extended PTQ and export tests to include and validate int8_wo.

Copy link

copy-pr-bot bot commented Aug 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Yuening-wa Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from 4f99116 to aa960ea Compare August 25, 2025 08:04
Copy link

codecov bot commented Aug 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.87%. Comparing base (d5c88e7) to head (1b8036e).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #263      +/-   ##
==========================================
- Coverage   73.93%   73.87%   -0.07%     
==========================================
  Files         172      172              
  Lines       17408    17440      +32     
==========================================
+ Hits        12871    12884      +13     
- Misses       4537     4556      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Yuening-wa Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from aa960ea to d6f8908 Compare August 25, 2025 08:27
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners September 2, 2025 14:29
@Yuening-wa Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from d6f8908 to d989313 Compare September 8, 2025 15:27
Copy link

coderabbitai bot commented Sep 8, 2025

Walkthrough

Adds a weight-only int8 quantization option ("int8_wo") across configs, export utilities, examples, scripts, and tests; updates quantization decision logic to distinguish SQ vs WO, adds related constants, and fixes a documentation typo.

Changes

Cohort / File(s) Summary of Changes
Documentation
docs/source/guides/_compress_quantized_models.rst
Fixed typo: “initaializing” → “initializing”.
HF PTQ Example
examples/llm_ptq/hf_ptq.py
Added int8_wo to QUANT_CFG_CHOICES; allowed int8/int8_wo in HF export validation; extended auto-quantize to accept int8_wo.
HF PTQ Script
examples/llm_ptq/scripts/huggingface_example.sh
Accepted int8_wo in QFORMAT validation blocks; updated error messages to list int8_wo.
Export Core
modelopt/torch/export/model_config.py, modelopt/torch/export/quant_utils.py
Added QUANTIZATION_INT8_WO; updated quantization detection: 8-bit returns INT8_SQ only if input quantizer is present/enabled, otherwise INT8_WO; unified to/from quantized weight handling for INT8_SQ/WO; clarified weight-scaling comment.
Quantization Config
modelopt/torch/quantization/config.py
Added INT8_WEIGHT_ONLY_CFG (weight-only 8-bit config, input quantizer disabled) and exported it in available choices.
Tests
tests/examples/llm_ptq/test_llm_ptq.py, tests/gpu/torch/export/test_export.py
PTQ tests: replaced prior entries with PTQCommand(quant="int8_wo", export_fmt="hf"); Export tests: import and parameterize INT8_WEIGHT_ONLY_CFG with expected block size 0.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as CLI/Script
  participant PTQ as HF_PTQ_Example
  participant Quant as Quant_Utils
  participant Export as Export/Conversion

  User->>CLI: run with QFORMAT=int8_wo
  CLI->>PTQ: validate args (accept int8_wo)
  PTQ->>Quant: request auto_quantize / export decision
  Note over Quant: Determine per-layer quantization mode
  Quant->>Quant: if weight.num_bits==8 and input_quant enabled → INT8_SQ\nelse → INT8_WO
  Quant->>Export: to_quantized_weight (INT8_SQ or INT8_WO)
  Export-->>PTQ: return quantized weights/artifacts
  PTQ-->>User: emit HF artifacts (int8_wo)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “Support INT8 Weight-Only Quantization” clearly and concisely describes the main feature introduced by the pull request, namely adding support for weight-only 8-bit quantization. It is specific enough that a reviewer scanning the project history will immediately understand the purpose of the change without unnecessary detail or noise. The phrasing aligns directly with the PR objectives and avoids generic terms.

Poem

A rabbit taps code with a tiny toe,
New carrots labeled int8_wo.
Weights shrink down, the tests all show,
Scripts nod kindly, exports glow.
Docs fixed a typo — hop, go, go!

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Yuening-wa Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from d989313 to 12d955e Compare September 8, 2025 15:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
modelopt/torch/quantization/config.py (1)

634-635: Remember to surface this in user-facing docs/tables.

Please add “INT8 Weight-only (W8A16)” to the “Quantization Formats” table to avoid discoverability gaps. I can send a doc patch if you want.

modelopt/torch/export/quant_utils.py (1)

92-122: Unused helper (maybe_transpose_expert_weight_dimensions).

Defined but not used. Either wire it into the packing paths that need it or drop to avoid dead code.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b233ad1 and 12d955e.

📒 Files selected for processing (8)
  • docs/source/guides/_compress_quantized_models.rst (1 hunks)
  • examples/llm_ptq/hf_ptq.py (3 hunks)
  • examples/llm_ptq/scripts/huggingface_example.sh (2 hunks)
  • modelopt/torch/export/model_config.py (2 hunks)
  • modelopt/torch/export/quant_utils.py (5 hunks)
  • modelopt/torch/quantization/config.py (2 hunks)
  • tests/examples/llm_ptq/test_llm_ptq.py (1 hunks)
  • tests/gpu/torch/export/test_export.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (12)
docs/source/guides/_compress_quantized_models.rst (1)

35-35: Typo/clarity fix looks good.

Reads well and aligns with the section intent.

modelopt/torch/export/model_config.py (2)

32-32: Expose INT8 weight-only enum — OK.

Constant addition is consistent with other quant identifiers.


205-205: Comment update matches behavior.

The concat path now rightly mentions Int8 WO alongside SQ/AWQ/NVFP4.

modelopt/torch/quantization/config.py (1)

181-188: Config definition for INT8 weight-only — OK.

Per-channel weights-only with axis=0 and inputs disabled; sensible default “max”.

tests/examples/llm_ptq/test_llm_ptq.py (1)

98-98: Test matrix expanded to int8_wo (HF) — good coverage bump.

tests/gpu/torch/export/test_export.py (2)

60-60: Import of INT8_WEIGHT_ONLY_CFG — OK.


327-328: Block size expectation for INT8 WO = 0 — sane.

examples/llm_ptq/hf_ptq.py (2)

65-65: QUANT_CFG_CHOICES: added int8_wo — OK.


95-106: HF auto-quant allowlist includes int8_wo — OK.

Keeps “int8” out of auto path, as intended.

modelopt/torch/export/quant_utils.py (3)

53-54: New enum import — OK.


458-462: INT8 detection split (SQ vs WO) — OK.

Uses input_quantizer.is_enabled to distinguish; matches config semantics.


633-635: Per-layer config mapping for int8_wo → W8A16 — OK.

Comment on lines +234 to 236
"int8",
"int8_wo",
"int4_awq",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Mismatch with huggingface_example.sh allowlists.

HF path here allows “int8” and “int8_wo”, but the script denies “int8”, causing CLI rejection before Python runs. Update the script allowlists accordingly (see suggested patch in the script file comment).


🏁 Script executed:

#!/bin/bash
set -e

# Locate the huggingface example script
script_path=$(find . -type f -name huggingface_example.sh | head -n1)
echo "Found script at: $script_path"

# Show lines around allowlist/quantization options
echo "=== Context around allowlisted quant types ==="
grep -nE 'allowlist|quant' -C3 "$script_path"

# Show any explicit int8/int8_wo mentions
echo "=== int8 / int8_wo occurrences ==="
grep -nE 'int8|int8_wo' -n "$script_path"

# Show validation logic (case statements) for quant types
echo "=== Case statements for quant type validation ==="
grep -nR -n 'case .* in' -C2 "$script_path"

Length of output: 3127


Add “int8” to quantization allowlist in huggingface_example.sh
In examples/llm_ptq/scripts/huggingface_example.sh, update both quant-format case statements (around lines 48 and 77) to include “int8” alongside “int8_wo” and adjust the corresponding error-message lists so that “int8” is accepted.

🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/huggingface_example.sh around the quant-format case
statements near lines ~48 and ~77, the script currently accepts "int8_wo" (and
other formats) but does not include "int8" in the allowed patterns or the
corresponding error-message lists; update both case branches to include "int8"
alongside "int8_wo" (e.g., add "int8" as an accepted pattern in each case
pattern list) and update the error/usage messages so that they list "int8" as an
accepted format where applicable.

Comment on lines +48 to 51
fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
*)
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
exit 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

HF allowlist should include ‘int8’ to match hf_ptq.py.

Without this, “--export_fmt hf --qformat int8” is blocked by the script. Add “int8” here.

Apply:

-        fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
+        fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8 | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
*)
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
exit 1
fp16 | bf16 | fp8 | fp8_pc_pt | fp8_pb_wo | int8 | int8_wo | int4_awq | nvfp4 | nvfp4_awq | w4a8_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
*)
echo "Unsupported quant argument: Expected one of: [fp16, bf16, fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int4_awq, nvfp4, nvfp4_awq, w4a8_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
exit 1
🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/huggingface_example.sh around lines 48 to 51, the
quantization allowlist is missing the "int8" option which prevents "--export_fmt
hf --qformat int8" from being accepted; add "int8" to the case pattern list (and
update the echo message to include "int8" among the expected values) so the
script accepts that quant argument and exits successfully for valid inputs.

Comment on lines +77 to 80
fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
*)
echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
echo "Unknown quant argument: Expected one of: [fp8, fp8_pc_pt, fp8_pb_wo, int8_wo, int8_sq, int4_awq, w4a8_awq, fp16, bf16, nvfp4, nvfp4_awq, w4a8_nvfp4_fp8, w4a8_mxfp4_fp8]" >&2
exit 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

General quant allowlist also needs ‘int8’ or guard by export_fmt.

Currently this block runs for HF too and rejects “int8”. Minimal fix: add “int8”.

Apply:

-    fp8 | fp8_pc_pt | fp8_pb_wo | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;
+    fp8 | fp8_pc_pt | fp8_pb_wo | int8 | int8_wo | int8_sq | int4_awq | w4a8_awq | fp16 | bf16 | nvfp4 | nvfp4_awq | w4a8_nvfp4_fp8 | w4a8_mxfp4_fp8) ;;

Alternative (cleaner): wrap this whole validation in if [ "$EXPORT_FORMAT" != "hf" ]; then ... fi so HF path is validated only once.

🤖 Prompt for AI Agents
In examples/llm_ptq/scripts/huggingface_example.sh around lines 77-80, the quant
argument validation excludes "int8" and incorrectly rejects it for the HF path;
either add "int8" to the list of allowed quant values in the case pattern or,
preferably, wrap this entire validation block in a guard so it only runs when
EXPORT_FORMAT is not "hf" (i.e., surround the case...esac with if [
"$EXPORT_FORMAT" != "hf" ]; then ... fi), ensuring HF exports are not blocked
and that "int8" remains accepted where appropriate.

Comment on lines +760 to 762
if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
return (weight / weights_scaling_factor[:, None]).round().clamp(-128, 127).to(torch.int8)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

INT8 pack path doesn’t handle 3D (MoE) weights.

For stacked expert weights (E, out, in), broadcasting with [:, None] is wrong. Add 3D handling to avoid shape/broadcast errors and incorrect scaling.

Apply:

-    if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
-        return (weight / weights_scaling_factor[:, None]).round().clamp(-128, 127).to(torch.int8)
+    if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
+        if weight.dim() == 3:
+            # (experts, out, in) ÷ (experts, out, 1)
+            return (
+                (weight / weights_scaling_factor.unsqueeze(-1))
+                .round()
+                .clamp(-128, 127)
+                .to(torch.int8)
+            )
+        elif weight.dim() == 2:
+            return (weight / weights_scaling_factor[:, None]).round().clamp(-128, 127).to(torch.int8)
+        else:
+            raise NotImplementedError("INT8 quantization expects 2D or 3D weight tensors")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
return (weight / weights_scaling_factor[:, None]).round().clamp(-128, 127).to(torch.int8)
if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
if weight.dim() == 3:
# (experts, out, in) ÷ (experts, out, 1)
return (
(weight / weights_scaling_factor.unsqueeze(-1))
.round()
.clamp(-128, 127)
.to(torch.int8)
)
elif weight.dim() == 2:
return (weight / weights_scaling_factor[:, None]).round().clamp(-128, 127).to(torch.int8)
else:
raise NotImplementedError("INT8 quantization expects 2D or 3D weight tensors")

Comment on lines +812 to 814
if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

INT8 unpack path also misses 3D (MoE) weights.

Mirror the 3D case to restore full-precision weights correctly.

Apply:

-    if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
-        return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)
+    if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
+        if weight.dim() == 3:
+            return weight.to(torch_dtype) * weights_scaling_factor.unsqueeze(-1).to(torch_dtype)
+        elif weight.dim() == 2:
+            return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)
+        else:
+            raise NotImplementedError("INT8 dequantization expects 2D or 3D weight tensors")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)
if quantization in [QUANTIZATION_INT8_SQ, QUANTIZATION_INT8_WO]:
if weight.dim() == 3:
return weight.to(torch_dtype) * weights_scaling_factor.unsqueeze(-1).to(torch_dtype)
elif weight.dim() == 2:
return weight.to(torch_dtype) * weights_scaling_factor[:, None].to(torch_dtype)
else:
raise NotImplementedError("INT8 dequantization expects 2D or 3D weight tensors")

@Yuening-wa Yuening-wa force-pushed the user/yueningl/support_int8_wo_quantization branch from 12d955e to 1b8036e Compare September 10, 2025 09:06
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/examples/llm_ptq/test_llm_ptq.py (1)

98-98: Explicitly set KV-cache quantization to none for INT8 weight-only HF export

Aligns the test with the documented usage and avoids relying on defaults that might change.

-        PTQCommand(quant="int8_wo", export_fmt="hf"),
+        PTQCommand(quant="int8_wo", export_fmt="hf", kv_cache_quant="none"),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12d955e and 1b8036e.

📒 Files selected for processing (8)
  • docs/source/guides/_compress_quantized_models.rst (1 hunks)
  • examples/llm_ptq/hf_ptq.py (3 hunks)
  • examples/llm_ptq/scripts/huggingface_example.sh (2 hunks)
  • modelopt/torch/export/model_config.py (2 hunks)
  • modelopt/torch/export/quant_utils.py (5 hunks)
  • modelopt/torch/quantization/config.py (2 hunks)
  • tests/examples/llm_ptq/test_llm_ptq.py (1 hunks)
  • tests/gpu/torch/export/test_export.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/source/guides/_compress_quantized_models.rst
🚧 Files skipped from review as they are similar to previous changes (6)
  • modelopt/torch/quantization/config.py
  • examples/llm_ptq/scripts/huggingface_example.sh
  • modelopt/torch/export/model_config.py
  • examples/llm_ptq/hf_ptq.py
  • tests/gpu/torch/export/test_export.py
  • modelopt/torch/export/quant_utils.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/examples/llm_ptq/test_llm_ptq.py (1)
tests/_test_utils/ptq_utils.py (1)
  • PTQCommand (28-79)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs

PTQCommand(quant="bf16"),
PTQCommand(quant="int8_sq"),
# ("int8_sq", "tensorrt_llm", "sparsegpt"),
PTQCommand(quant="int8_wo", export_fmt="hf"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify end-to-end plumbing for "int8_wo" and HF export args

Sanity-check that the CLI and scripts accept "int8_wo", pass through export_fmt="hf", and that the KV-cache flag name matches what the scripts expect.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "1) Confirm repo recognizes INT8 weight-only:"
rg -nP -C2 '(int8_wo|INT8_WEIGHT_ONLY|WEIGHT[-_ ]ONLY.*INT8)'

echo
echo "2) Check script argument wiring for HF PTQ path:"
rg -n -C2 'huggingface_example.sh|hf_ptq\.py' scripts examples | sed -n '1,120p' || true
rg -n -C2 -- '(^|\s)--(quant|export_fmt)\b' scripts/huggingface_example.sh examples/**/hf_ptq.py || true

echo
echo "3) Confirm KV-cache flag spelling expected by code (quant vs qformat):"
rg -nP -C2 'kv_cache_(quant|qformat)' scripts/huggingface_example.sh examples/**/hf_ptq.py

echo
echo "Expected:"
echo "- At least one config/enum or decision branch for int8_wo"
echo "- huggingface_example.sh forwards --quant and --export_fmt to hf_ptq.py"
echo "- hf_ptq.py consumes kv_cache_quant (or adjust test if it uses kv_cache_qformat)"

Length of output: 19435


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "1) Parser args in hf_ptq.py:"
grep -R "add_argument" -n examples/llm_ptq/hf_ptq.py | grep -E "export_fmt|qformat|quant"

echo
echo "2) hf_ptq.py export_fmt flag usage:"
grep -R -- "--export_fmt" -n examples/llm_ptq/hf_ptq.py

echo
echo "3) Script forwarding in examples/llm_ptq/scripts/huggingface_example.sh:"
grep -n "hf_ptq.py" examples/llm_ptq/scripts/huggingface_example.sh | sed -n '180,230p'
grep -n -- "--export_fmt" examples/llm_ptq/scripts/huggingface_example.sh
grep -n -- "--quant" examples/llm_ptq/scripts/huggingface_example.sh

Length of output: 28


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "1) Parser args in hf_ptq.py:"
grep -R "add_argument" -n examples/llm_ptq/hf_ptq.py | grep -E "quant|qformat|export_fmt|kv_cache" || true

echo
echo "2) Script invocation in huggingface_example.sh:"
sed -n '180,260p' examples/llm_ptq/scripts/huggingface_example.sh || true

echo
echo "3) export_fmt usage in tests:"
grep -n "export_fmt" tests/examples/llm_ptq/test_llm_ptq.py || true

echo
echo "4) kv_cache usage in tests:"
grep -n "kv_cache" tests/examples/llm_ptq/test_llm_ptq.py || true

Length of output: 4192


Align CLI flags with tests and scripts

  • In examples/llm_ptq/hf_ptq.py (parser around line 707):
    • Add a --export_fmt argument to consume the export_fmt="hf" passed by tests and huggingface_example.sh.
    • Alias --quant to --qformat (or accept both) so PTQCommand(quant=…) maps to args.qformat.
    • Alias --kv_cache_quant to --kv_cache_qformat (or accept both) so tests’ kv_cache_quant matches args.kv_cache_qformat.

Copy link
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yuening-wa for adding int8-wo support. 👍

Do we how the accuracy and perf look like compared with int8-sq?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants