Skip to content

Conversation

@anzr299
Copy link
Collaborator

@anzr299 anzr299 commented Jan 12, 2026

Changes

The approach here is quite straightfoward, _quantize_weights works as usual for 2D weights. The difference is in calculate hessian where the hessian is 3D in both 3D and 2D weights case. By default hessian has the shape (1, hidden_dim, hidden_dim).
Before this was just (hidden_dim, hidden_dim). For 3D, it is (num_experts/batch, hidden_dim, hidden_dim).

Now, this 3D hessian or "batched" hessian is looped over and the 2D weight is extracted and passed to the old _quantize_weights function as usual and scale/zp are returned. These scales and zp are then stacked together in a collector variable. For 2D case, it is flattened. For 3D the stacked scale, zp are returned.

NOTE: Scale Estimation + GPTQ support is not added for 3D weights yet

Reason for changes

Support 3D weights for models like MoE in GPTQ

Related tickets

175789 & 175212

Tests

Model: Qwen/Qwen3-30B-A3B
NNCF Backend: OpenVINO
Higher is better.
Task: gsm8k
Limit: 100
Max New Tokens: 10000
OpenVINO version: 2026.0.0.dev20251111 (with WA for 176465)
n-shots: 5(default)

Precision Type Filter Value
INT4 SYM GS128 (with GPTQ) Calibrated on GSM8k with 128 samples flexible-extract 0.79
  strict-match 0.64
INT4 SYM GS128 (with GPTQ after bug fix commit dc355fe) Calibrated on GSM8k with 128 samples flexible-extract 0.78
  strict-match 0.57
INT4 SYM GS128 flexible-extract 0.55
  strict-match 0.29
FP32 flexible-extract 0.92
  strict-match 0.82

Comparison of accuracy with meta-llama/Llama-3.2-1B-Instruct on Develop and this branch

Variant bits_per_byte byte_perplexity word_perplexity
This Branch (GPTQ) 0.7965 1.7368 19.1466
develop (GPTQ) 0.7965 1.7368 19.1466

@anzr299 anzr299 requested a review from a team as a code owner January 12, 2026 08:35
@anzr299 anzr299 marked this pull request as draft January 12, 2026 08:35
@github-actions github-actions bot added the NNCF OpenVINO Pull requests that updates NNCF OpenVINO label Jan 12, 2026
@anzr299 anzr299 marked this pull request as ready for review January 19, 2026 11:52
import math
from typing import Optional, TypeVar

import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexanderDokuchaev Is it possible to use np here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not I can use the same approach as

reduce(mul, shape[:act_ch_axis] + shape[act_ch_axis % len(shape) + 1 :], 1) for shape in stats.shape_values

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

wc_params.node_with_weight, wc_params.weight_port_id, model, graph
)
weight_tensor = fns.astype(weight_tensor, TensorDataType.float32)
if len(hessian.shape) == 3 and hessian.shape[0] == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which model does this happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this is just a safety check. Since for 3D case also we pass 2D hessian to this function. I added it when the older test called this function manually. Would it be better to remove it?

@ljaljushkin ljaljushkin requested a review from Copilot January 20, 2026 18:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for 3D weights in the GPTQ algorithm to enable quantization of models with 3D weight tensors, such as Mixture-of-Experts (MoE) models. The implementation extends the existing GPTQ algorithm to handle batched Hessian matrices.

Changes:

  • Extended _calculate_hessian to support 3D weight tensors by creating batched Hessian matrices
  • Refactored _quantize_weights to accept tensors directly instead of fetching them, enabling batch processing
  • Added loop-based quantization in apply method to process each batch of 3D weights separately

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/nncf/quantization/algorithms/weight_compression/gptq.py Implements core GPTQ algorithm changes to support 3D weights through batched Hessian calculation and iterative quantization
tests/openvino/native/quantization/test_gptq.py Adds parameterized test coverage for both 2D and 3D weight cases with reference implementation validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@ljaljushkin ljaljushkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition!

anzr299 and others added 3 commits January 21, 2026 12:18
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@MaximProshin MaximProshin merged commit 61ea196 into openvinotoolkit:develop Jan 21, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Code Freeze NNCF OpenVINO Pull requests that updates NNCF OpenVINO

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants