Skip to content

Conversation

@ahmtox
Copy link
Contributor

@ahmtox ahmtox commented Jun 26, 2025

Stack from ghstack (oldest at bottom):

Context

This test framework establishes the foundation for validating the linear_qta8a_qga4w_qta8o operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware.

The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic.

This operator nomenclature breakdown:

  • qta8a: Quantized per-token affine 8-bit activation inputs
  • qga4w: Quantized per-group affine 4-bit weights
  • qta8o: Quantized per-token affine 8-bit outputs

Changes

The reference implementation (linear_qta8a_qga4w_qta8o_4bit_dequant_impl) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation (quantized_input.to(at::kFloat) - input_zero_point) * input_scale. After dequantization, the implementation performs standard floating point linear operation at::linear(x_float, weights_dequantized), then manually quantizes the result using at::round(linear_result / output_scale) + output_zero_point with clamping to the int8 range [-128,127]. This two-stage approach of dequantize → compute → quantize provides a clear reference against which the GPU's integer arithmetic implementation can be validated.

Differential Revision: D77173442

# Context

This test framework establishes the foundation for validating the `linear_qta8a_qga4w_qta8o` operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware.

The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic.

This operator nomenclature breakdown:
- **qta8a**: Quantized per-token affine 8-bit activation inputs
- **qga4w**: Quantized per-group affine 4-bit weights
- **qta8o**: Quantized per-token affine 8-bit outputs

# Changes

The reference implementation (`linear_qta8a_qga4w_qta8o_4bit_dequant_impl`) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation `(quantized_input.to(at::kFloat) - input_zero_point) * input_scale`. After dequantization, the implementation performs standard floating point linear operation `at::linear(x_float, weights_dequantized)`, then manually quantizes the result using `at::round(linear_result / output_scale) + output_zero_point` with clamping to the int8 range [-128,127]. This two-stage approach of dequantize → compute → quantize provides a clear reference against which the GPU's integer arithmetic implementation can be validated.

Differential Revision: [D77173442](https://our.internmc.facebook.com/intern/diff/D77173442/)

[ghstack-poisoned]
@ahmtox ahmtox requested a review from SS-JIA as a code owner June 26, 2025 16:56
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12005

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 838865f with merge base dd06b3b (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 26, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77173442

@ahmtox ahmtox added the release notes: vulkan Changes to the Vulkan backend delegate label Jun 26, 2025
# Context

This test framework establishes the foundation for validating the `linear_qta8a_qga4w_qta8o` operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware.

The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic.

This operator nomenclature breakdown:
- **qta8a**: Quantized per-token affine 8-bit activation inputs
- **qga4w**: Quantized per-group affine 4-bit weights
- **qta8o**: Quantized per-token affine 8-bit outputs

# Changes

The reference implementation (`linear_qta8a_qga4w_qta8o_4bit_dequant_impl`) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation `(quantized_input.to(at::kFloat) - input_zero_point) * input_scale`. After dequantization, the implementation performs standard floating point linear operation `at::linear(x_float, weights_dequantized)`, then manually quantizes the result using `at::round(linear_result / output_scale) + output_zero_point` with clamping to the int8 range [-128,127]. This two-stage approach of dequantize → compute → quantize provides a clear reference against which the GPU's integer arithmetic implementation can be validated.

Differential Revision: [D77173442](https://our.internmc.facebook.com/intern/diff/D77173442/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77173442

# Context

This test framework establishes the foundation for validating the `linear_qta8a_qga4w_qta8o` operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware.

The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic.

This operator nomenclature breakdown:
- **qta8a**: Quantized per-token affine 8-bit activation inputs
- **qga4w**: Quantized per-group affine 4-bit weights
- **qta8o**: Quantized per-token affine 8-bit outputs

# Changes

The reference implementation (`linear_qta8a_qga4w_qta8o_4bit_dequant_impl`) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation `(quantized_input.to(at::kFloat) - input_zero_point) * input_scale`. After dequantization, the implementation performs standard floating point linear operation `at::linear(x_float, weights_dequantized)`, then manually quantizes the result using `at::round(linear_result / output_scale) + output_zero_point` with clamping to the int8 range [-128,127]. This two-stage approach of dequantize → compute → quantize provides a clear reference against which the GPU's integer arithmetic implementation can be validated.

Differential Revision: [D77173442](https://our.internmc.facebook.com/intern/diff/D77173442/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77173442

# Context

This test framework establishes the foundation for validating the `linear_qta8a_qga4w_qta8o` operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware.

The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic.

This operator nomenclature breakdown:
- **qta8a**: Quantized per-token affine 8-bit activation inputs
- **qga4w**: Quantized per-group affine 4-bit weights
- **qta8o**: Quantized per-token affine 8-bit outputs

# Changes

The reference implementation (`linear_qta8a_qga4w_qta8o_4bit_dequant_impl`) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation `(quantized_input.to(at::kFloat) - input_zero_point) * input_scale`. After dequantization, the implementation performs standard floating point linear operation `at::linear(x_float, weights_dequantized)`, then manually quantizes the result using `at::round(linear_result / output_scale) + output_zero_point` with clamping to the int8 range [-128,127]. This two-stage approach of dequantize → compute → quantize provides a clear reference against which the GPU's integer arithmetic implementation can be validated.

Differential Revision: [D77173442](https://our.internmc.facebook.com/intern/diff/D77173442/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77173442

# Context

This test framework establishes the foundation for validating the `linear_qta8a_qga4w_qta8o` operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware.

The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic.

This operator nomenclature breakdown:
- **qta8a**: Quantized per-token affine 8-bit activation inputs
- **qga4w**: Quantized per-group affine 4-bit weights
- **qta8o**: Quantized per-token affine 8-bit outputs

# Changes

The reference implementation (`linear_qta8a_qga4w_qta8o_4bit_dequant_impl`) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation `(quantized_input.to(at::kFloat) - input_zero_point) * input_scale`. After dequantization, the implementation performs standard floating point linear operation `at::linear(x_float, weights_dequantized)`, then manually quantizes the result using `at::round(linear_result / output_scale) + output_zero_point` with clamping to the int8 range [-128,127]. This two-stage approach of dequantize → compute → quantize provides a clear reference against which the GPU's integer arithmetic implementation can be validated.

Differential Revision: [D77173442](https://our.internmc.facebook.com/intern/diff/D77173442/)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Jul 9, 2025
Summary:

# Context

This test framework establishes the foundation for validating the `linear_qta8a_qga4w` operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware.

The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic. 

This operator nomenclature breakdown:
- **qta8a**: Quantized per-token affine 8-bit activation inputs
- **qga4w**: Quantized per-group affine 4-bit weights

# Changes

The reference implementation (`linear_qta8a_qga4w_4bit_dequant_impl`) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation `(quantized_input.to(at::kFloat) - input_zero_point) * input_scale`. After dequantization, the implementation performs standard floating point linear operation `at::linear(x_float, weights_dequantized)`. This two-stage approach of dequantize → compute provides a clear reference against which the GPU's integer arithmetic implementation can be validated.
ghstack-source-id: 295152415
exported-using-ghexport

Reviewed By: SS-JIA

Differential Revision: D77173442
# Context

This test framework establishes the foundation for validating the `linear_qta8a_qga4w_qta8o` operator implementation as part of enabling dynamic quantization. The motivation stems from advancing beyond weight-only quantization to full activation and weight quantized linear operations, enabling true integer arithmetic throughout the matrix multiplication process for improved performance on GPU hardware.

The current weight-only quantized linear implementations in ET-VK dequantize weights to floating point before computation, missing the performance benefits of integer arithmetic.

This operator nomenclature breakdown:
- **qta8a**: Quantized per-token affine 8-bit activation inputs
- **qga4w**: Quantized per-group affine 4-bit weights
- **qta8o**: Quantized per-token affine 8-bit outputs

# Changes

The reference implementation (`linear_qta8a_qga4w_qta8o_4bit_dequant_impl`) provides a baseline for validating the GPU shader implementation through a deliberately simplified computation path. The quantized int8 input tensor is dequantized using the standard affine transformation `(quantized_input.to(at::kFloat) - input_zero_point) * input_scale`. After dequantization, the implementation performs standard floating point linear operation `at::linear(x_float, weights_dequantized)`, then manually quantizes the result using `at::round(linear_result / output_scale) + output_zero_point` with clamping to the int8 range [-128,127]. This two-stage approach of dequantize → compute → quantize provides a clear reference against which the GPU's integer arithmetic implementation can be validated.

Differential Revision: [D77173442](https://our.internmc.facebook.com/intern/diff/D77173442/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77173442

@facebook-github-bot facebook-github-bot merged commit dfaceba into gh/ahmtox/24/base Jul 10, 2025
95 of 99 checks passed
@facebook-github-bot facebook-github-bot deleted the gh/ahmtox/24/head branch July 10, 2025 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported release notes: vulkan Changes to the Vulkan backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants