Skip to content

Add zero point support to dp4a 2-bit dequantization in the WebGPU MatMulNbits#27325

Open
HectorSVC wants to merge 4 commits intomicrosoft:mainfrom
HectorSVC:hecli_webgpu_2bit_improvement
Open

Add zero point support to dp4a 2-bit dequantization in the WebGPU MatMulNbits#27325
HectorSVC wants to merge 4 commits intomicrosoft:mainfrom
HectorSVC:hecli_webgpu_2bit_improvement

Conversation

@HectorSVC
Copy link
Contributor

@HectorSVC HectorSVC commented Feb 12, 2026

Add zero point support to dp4a 2-bit dequantization in the WebGPU MatMulNBits kernel. Previously, the dp4a path for 2-bit quantization used a hardcoded 256-entry LUT assuming zero_point=2, and was blocked from running when custom zero points were provided.

  1. dp4a_matmul_common.wgsl.template — Core LUT & dequantization function
    Added a 1024-entry LUT (4 sections × 256 entries) when has_zero_points is true. Each section corresponds to a zero point value (0–3), pre-computing pack4xI8(value - zero_point) for every possible byte input.
    Added a new DequantizedFrom2BitsTo8Bits(in: u32, zero: i32) overload that indexes the LUT as zero * 256 + byte_value.
    Original 256-entry LUT and parameterless function preserved for the !has_zero_points path.

  2. dp4a_matmul.wgsl.template — Large-M tiled kernel (workgroup=256)
    loadSHMB for n_bits==2: reads zero point via mm_read_zero() and passes it to DequantizedFrom2BitsTo8Bits(b_value, zero) when has_zero_points.
    LoadDequantizationTable: expanded to 4 calls (local_idx + 0/256/512/768) to load all 1024 entries when has_zero_points.

  3. dp4a_matmul_small_m.wgsl.template — Small-M kernel (workgroup=128)
    LoadDequantizationTable: expanded to 8 calls to load 1024 entries when has_zero_points.
    DequantizedFrom2BitsTo8Bits calls pass zero when has_zero_points.
    Bug fix: corrected off-by-one local_idx+127 → local_idx+128 in the non-zero-point path.

  4. matmul_nbits.cc — Kernel dispatch logic
    Removed the guard !(has_zero_points && nbits == 2) that previously blocked the dp4a path for 2-bit with custom zero points.
    Updated comment to document the new 1024-entry LUT support.

…MulNBits kernel. Previously, the dp4a path for 2-bit quantization used a hardcoded 256-entry LUT assuming zero_point=2, and was blocked from running when custom zero points were provided.
@HectorSVC
Copy link
Contributor Author

reflect the review comments in PR: #27285

@guschmue guschmue requested a review from Copilot February 13, 2026 16:42
@guschmue
Copy link
Contributor

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables the WebGPU DP4A implementation of MatMulNBits for 2-bit quantized weights when custom zero_points are provided, by extending the dequantization LUT logic in the DP4A WGSL shaders and removing the previous dispatch guard that blocked this path.

Changes:

  • Extend the Q2 DP4A dequantization LUT from 256 entries to 1024 entries when has_zero_points is true, and route dequantization through the zero-point-aware lookup.
  • Update both DP4A kernels (large-M and small-M variants) to load the expanded LUT and pass zero points into Q2 dequantization.
  • Remove the C++ dispatch guard that prevented selecting the DP4A path for (nbits=2, has_zero_points=true), and update the in-code comment accordingly.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Allows DP4A dispatch for 2-bit with zero points by removing the previous guard and updating the comment.
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_common.wgsl.template Adds a 1024-entry LUT (4×256) and a zero-point-aware Q2 dequantization function.
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul.wgsl.template Large-M DP4A kernel loads all 1024 LUT entries when needed and passes per-block zero point into Q2 dequantization.
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_small_m.wgsl.template Small-M DP4A kernel loads all 1024 LUT entries when needed, uses zero-point-aware Q2 dequantization, and fixes a LUT-load offset bug in the non-zero-point path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

HectorSVC and others added 2 commits February 13, 2026 09:53
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants