-
Notifications
You must be signed in to change notification settings - Fork 733
Enable 16-bit activations and 8 bit weigths in Cadence Quantizer for Matmul #15929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15929
Note: Links to docs will display an error until the docs builds have been completed. ❌ 16 New Failures, 1 Pending, 2 Unrelated FailuresAs of commit 562e94b with merge base a78f023 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
aa791a5 to
39dc222
Compare
This PR needs a
|
…Matmul (pytorch#15929) Summary: Pull Request resolved: pytorch#15929 # Context We continue from D84284794 to add support for 16-bit activations. Note that right now, all though they support 16-bit activations already, it's only if the weights are also 16-bits. To do this, we need to change the way we template some functions. # Current Behavior Right now, we're composing two macros together, the `ET_FORALL_JARVIS_QUANTIZED_TYPES_WITH_INT16` macro: https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/operators.h?lines=22-25 and the function macro(`quantized_linear` chosen for example): https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/quantized_linear_out.cpp?lines=30-41 so together, it just becomes a switch statement, calling the `quantized_linear` function with the correct template parameter. However, note that it assumes that both the input activations and weights are the same dtype, which is not the case. # This Diff We fix this checking for our datatypes, and calling the functions with the correct data types, as in D86538176. Differential Revision: D86644079
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for 16-bit activations with 8-bit weights (A16W8) in the Cadence Quantizer for matrix multiplication operations. Previously, 16-bit support required both activations and weights to be 16-bit. This change enables mixed precision quantization by dispatching int16 output tensors to a generic implementation that handles mixed-precision matmul operations.
Key Changes
- Added new quantizer classes for 16-bit activation support in convolution and matmul operations
- Modified
quantized_matmul_outto dispatch int16 output type to a generic implementation supporting mixed precision - Added comprehensive tests for both non-transposed and transposed matmul with int16 activations and int8 weights
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
backends/cadence/hifi/operators/tests/test_op_quantized_matmul_out.cpp |
New test file with two tests verifying int16 activations with int8 weights for both standard and transposed matmul operations |
backends/cadence/hifi/operators/targets.bzl |
Moved quantized_matmul_out from OPERATORS list to separate definition with additional dependencies for int16 support |
backends/cadence/hifi/operators/operators.h |
Added function signature declaration for quantized_matmul_out supporting mixed precision |
backends/cadence/hifi/operators/op_quantized_matmul_out.cpp |
Added dispatch logic to route int16 output tensors to generic implementation supporting A16W8 mixed precision |
backends/cadence/aot/quantizer/quantizer.py |
Added two new quantizer classes: CadenceWith16BitConvActivationsQuantizer and CadenceWith16BitMatmulActivationsQuantizer for A16W8 quantization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| define_operator(op) | ||
|
|
||
| # quantized_matmul_out needs additional dependency for int16 support | ||
| define_operator("quantized_matmul_out", deps=["fbcode//on_device_ai/Assistant/Jarvis/min_runtime/operators:quantize_matmul_out", "fbcode//on_device_ai/Assistant/Jarvis/min_runtime/operators:headers",]) |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deps list has a trailing comma after the closing bracket which is unusual formatting. While this may not cause a syntax error in Starlark, it's inconsistent with standard formatting practices. Consider reformatting as:
define_operator("quantized_matmul_out", deps=["fbcode//on_device_ai/Assistant/Jarvis/min_runtime/operators:quantize_matmul_out", "fbcode//on_device_ai/Assistant/Jarvis/min_runtime/operators:headers"])| define_operator("quantized_matmul_out", deps=["fbcode//on_device_ai/Assistant/Jarvis/min_runtime/operators:quantize_matmul_out", "fbcode//on_device_ai/Assistant/Jarvis/min_runtime/operators:headers",]) | |
| define_operator("quantized_matmul_out", deps=["fbcode//on_device_ai/Assistant/Jarvis/min_runtime/operators:quantize_matmul_out", "fbcode//on_device_ai/Assistant/Jarvis/min_runtime/operators:headers"]) |
39dc222 to
562e94b
Compare
Summary:
Context
We continue from D84284794 to add support for 16-bit activations. Note that right now, all though they support 16-bit activations already, it's only if the weights are also 16-bits. To do this, we need to change the way we template some functions.
Current Behavior
Right now, we're composing two macros together, the
ET_FORALL_JARVIS_QUANTIZED_TYPES_WITH_INT16macro:https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/operators.h?lines=22-25
and the function macro(
quantized_linearchosen for example):https://www.internalfb.com/code/fbsource/[9e8c6d8466107f58aa3de1b9e4ec71c49d670a8f]/fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/generic/quantized_linear_out.cpp?lines=30-41
so together, it just becomes a switch statement, calling the
quantized_linearfunction with the correct template parameter.However, note that it assumes that both the input activations and weights are the same dtype, which is not the case.
This Diff
We fix this checking for our datatypes, and calling the functions with the correct data types, as in D86538176.
Differential Revision: D86644079