-
Notifications
You must be signed in to change notification settings - Fork 3k
[GPU] Add u2 weight quantization backend support #33243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Part of openvinotoolkit#32716. Defines UINT2/INT2 enums, updates parameter key bitfields to support 2-bit types, and implements JIT constant generation for packed 2-bit integer types (emulated via int32).
Part of openvinotoolkit#32716. - Adds 'int2_utils.cl' for bit-wise unpacking of u2/i2 values. - Updates 'fully_connected_gpu_bfyx_ref' to support on-the-fly decompression. - Implements 'ReorderWeightsKernelInt2' for 2-bit weight formatting. - Enables u2 graph pattern matching in 'compressed_weights_pattern.hpp'.
Part of openvinotoolkit#32716. Adds 'shared_matmul_weights_decompression_u2' validating end-to-end MatMul inference with 2-bit unsigned integer weights on GPU.
|
greetings @isanghao and @ljaljushkin, this PR focuses on backend enablement with reference kernels performance optimization is intentionally deferred to keep the change reviewable and low-risk. would appreciate your reviews if you dont have any constraints. happy to make any necessary changes needed. thanks! |
|
Hi @ruskaruma, thanks for the PR. Could you share the target model or end-user user scenario of the feature? I'm trying to understand how this will benefit end-user in long term. |
|
Hi @isanghao, thank you for taking the time to review this PR. That’s a very valid question. At the moment, there aren’t any prod ready 2bit models in OpenVINO. NNCF doesnt support 2bit quantization yet, and GPTQ or the AWQ at 2bit is still experimental. CPU already has u2 support via oneDNN, but without this change the GPU backend would fall back or implicitly expand weights. The primary purpose of this PR is backend parity and correctness.My intent here bas been towards establishing the minimal infra needed so future optimization or tooling work can be done incrementally, without coupling everything into a single large change. I intentionally kept this PR narrowly scoped and correctness-focused, since combining backend enablement, optimized kernels, and quantization tooling in one submission would be harder to review and riskier to merge. For now, the goal is simply to keep CPU and GPU behavior aligned. As mentioned in the PR description, this work is part of a broader, staged approach. I would be very interested to hear your thoughts on whether this direction makes sense, and I appreciate the perspective behind the question. |
upstreaming the branch
|
also, i noticed that some of the CI checks are currently failing. ive already identified the underlying issue and am working on resolving it. |
Summary
This PR enables unsigned 2-bit (u2) quantized weight support in the Intel GPU plugin, aligning GPU behavior with the CPU plugin's existing u2 implementation. The change is intentionally limited to backend enablement and correctness using reference kernels; performance optimizations and tooling are deferred to follow-up work.
Background
Support for u2 compressed weights was added to the CPU plugin in September 2024 via oneDNN integration. Since the GPU plugin uses a different execution backend (kernel_selector + OpenCL), equivalent support requires a separate implementation.
Related Issues
Implements:
Datatype,WeightsType,ParamsKey)oiyx,ioyx)Intentionally deferred to future work:
element::i2support)All changes are gated under
#ifdef COMPRESSED_WEIGHTS_INT2and do not affect existing paths.For the design this follows the CPU implementation strategy: prioritize correctness with a reference kernel, restrict support to linear layouts to keep the change minimal, and include i2 infrastructure early to avoid future type-system churn once Core support lands.
Testing was done with 16 functional cases (
smoke_MatMulSharedCompressedWeightsU2), verified IR serialization and binary size to confirm true 2-bit storage, and observed no regressions with all changes guarded under#ifdef COMPRESSED_WEIGHTS_INT2.Note: This PR mirrors the CPU enablement strategy: minimal, correct, and isolated. Performance and broader coverage are intentionally split to keep this change safe and reviewable.