[Draft] [5526696] Add kv cache quantization support for onnx quantization #486

zhanghaoc · 2025-10-31T00:16:27Z

What does this PR do?

Add kv cache quantization. Currently support int8/fp8 minMax calibration method.

Overview:

add new file kv_cache.py. Include a function to save calibration data, a function to read data and do scale calculations and finally add attributes and new inputs to onnx model.
other files' change only pass new parameters

Usage

python -m modelopt.onnx.quantization --onnx_path="C:\repos\models\Llama-3.2-3B-Instruct-ONNX\cuda\cuda-fp16\model.onnx" --quantize_mode=int4 --calibration_method=rtn_dq --kv_quant_mode=PER_TENSOR --output_path="C:\repos\models\Llama-3.2-3B-Instruct-ONNX\cuda\cuda-fp16\model.int4.rtn_dq.kv_cache.onnx" --log_level=DEBUG

Testing

Test not done, still waiting for feedback.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: Yes
Did you update Changelog?: No

Additional Information

Signed-off-by: zhanghaoc <[email protected]>

copy-pr-bot · 2025-10-31T00:16:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

codecov · 2025-10-31T00:30:59Z

Codecov Report

❌ Patch coverage is 26.27737% with 101 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.02%. Comparing base (9e64f81) to head (a877d02).
⚠️ Report is 38 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/onnx/quantization/kv_cache.py	15.17%	95 Missing ⚠️
modelopt/onnx/quantization/int4.py	88.88%	2 Missing ⚠️
modelopt/onnx/quantization/ort_patching.py	33.33%	2 Missing ⚠️
modelopt/onnx/quantization/quantize.py	50.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #486      +/-   ##
==========================================
- Coverage   73.38%   73.02%   -0.36%     
==========================================
  Files         180      181       +1     
  Lines       17934    18260     +326     
==========================================
+ Hits        13160    13334     +174     
- Misses       4774     4926     +152

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: zhanghaoc <[email protected]>

modelopt/onnx/quantization/__main__.py

vishalpandya1990 · 2025-11-05T06:05:34Z

modelopt/onnx/quantization/kv_cache.py

+    # call to_dict and save to json
+    with open(calib_data_path, "wb") as f:
+        pickle.dump(kv_tensor_data, f)
+    intermediate_generated_files.append(calib_data_path)


What is the memory impact (or other issues) if we keep the KV cache related data in variable instead of writing them to disk file?

modelopt/onnx/quantization/kv_cache.py

vishalpandya1990 · 2025-11-05T06:25:16Z

modelopt/onnx/quantization/kv_cache.py

+                    f"Unsupported kv_cache_type {kv_cache_type} for kv cache quantization"
+                )
+
+    kv_tensor_names_list.sort()


Should we add an assert/exception/suitable-safe-early-return here if the input model is not GenAI based i.e. it doesn't have expected IO binding / names? (e.g. if this list is empty or if there are no GQA nodes seen etc.?)

I think we currently support 8-bit KV Cache with GenAI Builder exported ONNX LLMs only, right?

modelopt/onnx/quantization/kv_cache.py

vishalpandya1990 · 2025-11-05T06:35:24Z

modelopt/onnx/quantization/int4.py

    if calibration_method in ["rtn", "rtn_dq", "rtn_trt", "rtn_trt_dq"]:
+        # Save kv-cache calibration data if kv_quant_mode is not NONE
+        if kv_quant_mode != "NONE":
+            save_kv_cache_calib_data_rtn(


For INT4 AWQ/RTN + 8-bit KV Cache quantization, can we avoid 2 session runs by preparing KV tensor names before creating augmented model, augmenting model for these KV tensors too, and post-processing for save-kv-cache-calib-data after AWQ/RTN loop?

Just checking if we can avoid 2 session runs, and thereby speedup the combined quantization of matmul and kv-cache.

Yes, that's possible, but this change won't apply to int8/fp8 path and for awq_lite, awq_clip, rtn, they need to be implemented separately which means not much code can be reused. If you fell it's worthy, I can definitely implement in this way.

modelopt/onnx/quantization/kv_cache.py

vishalpandya1990 · 2025-11-05T06:42:59Z

modelopt/onnx/quantization/kv_cache.py

+    for output in onnx_model.graph.output:
+        if "present" in output.name:
+            kv_tensor_names_list.append(output.name)
+            if kv_cache_type == "fp8":


I think we can simplify this a bit by creating a map, with assert/valueError for unsupported type. Something like below:

output.type.tensor_type.elemt_type = output_type_map[kv_cache_type]

where output_type_map = {"int8": , "fp8": }

Possibly we can create a util for validating dtype, model, input model - whether it is currently supported or not.

vishalpandya1990 · 2025-11-05T06:45:32Z

modelopt/onnx/quantization/fp8.py

                # With ActivationSymmetric as True, MinMax calibration is equivalent to max calibration
                else CalibrationMethod.MinMax
            ),
+            intermediate_generated_files=intermediate_generated_files,


I didn't get how KV-Cache quantization meta-data is used with int8/fp8 quantization. Can you please elaborate the flow?

Both int8/fp8 quantization call quantize_static from ort_patching. The change happens in ort_pathching.py. If kv_quant_mode is not None, it will save additional calibration data on disk.

vishalpandya1990 · 2025-11-05T06:48:02Z

modelopt/onnx/quantization/kv_cache.py

+                node.input.append("")
+            node.input.append(k_scale.name)
+            node.input.append(v_scale.name)
+


I think if kv-quant-type is per-channel then things wont go well, since we are not supporting it but not checking / flagging it as well. Is it?

Add the check

zhanghaoc added 4 commits October 24, 2025 12:42

add basic support for kv cache quantization

544ccd8

Signed-off-by: zhanghaoc <[email protected]>

Add kv cache support

c062c08

Signed-off-by: zhanghaoc <[email protected]>

Add kv cache support for int8/fp8

7428db4

Signed-off-by: zhanghaoc <[email protected]>

Fix small bug for kv cache

91154e6

Signed-off-by: zhanghaoc <[email protected]>

zhanghaoc assigned vishalpandya1990 and zhanghaoc Oct 31, 2025

zhanghaoc requested a review from a team as a code owner October 31, 2025 00:16

zhanghaoc requested a review from gcunhase October 31, 2025 00:16

Style fix

a877d02

Signed-off-by: zhanghaoc <[email protected]>

vishalpandya1990 removed their assignment Oct 31, 2025

vishalpandya1990 reviewed Nov 5, 2025

View reviewed changes

modelopt/onnx/quantization/__main__.py Show resolved Hide resolved

vishalpandya1990 reviewed Nov 5, 2025

View reviewed changes

modelopt/onnx/quantization/kv_cache.py Show resolved Hide resolved

vishalpandya1990 reviewed Nov 5, 2025

View reviewed changes

modelopt/onnx/quantization/kv_cache.py Show resolved Hide resolved

vishalpandya1990 reviewed Nov 5, 2025

View reviewed changes

modelopt/onnx/quantization/kv_cache.py Show resolved Hide resolved

vishalpandya1990 reviewed Nov 5, 2025

View reviewed changes

modelopt/onnx/quantization/kv_cache.py Show resolved Hide resolved

vishalpandya1990 reviewed Nov 5, 2025

View reviewed changes

[Draft] [5526696] Add kv cache quantization support for onnx quantization #486

Are you sure you want to change the base?

[Draft] [5526696] Add kv cache quantization support for onnx quantization #486

Uh oh!

Conversation

zhanghaoc commented Oct 31, 2025

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Oct 31, 2025

Uh oh!

codecov bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

vishalpandya1990 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vishalpandya1990 Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vishalpandya1990 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

zhanghaoc Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vishalpandya1990 Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vishalpandya1990 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

zhanghaoc Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

vishalpandya1990 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

zhanghaoc Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Oct 31, 2025 •

edited

Loading

vishalpandya1990 Nov 5, 2025 •

edited

Loading

vishalpandya1990 Nov 5, 2025 •

edited

Loading