Improving quantized matmul performance by devectorizing shader. #15274

trivedivivek · 2025-10-20T15:46:03Z

Summary:
This diff improves the performance of quantized matrix multiplication by devectorizing the shader.

An example modification is shown below:

// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}

Differential Revision: D85023829

pytorch-bot · 2025-10-20T15:46:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15274

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 2 Unrelated Failures

As of commit 29e995a with merge base 82611e9 ():

NEW FAILURES - The following jobs have failed:

pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 70b0ef4fc57990baa0e1f75f25a1b662861b677484453ccfdafd31f572646c74 /exec failed with exit code 1
pull / unittest-arm-backend-with-no-fvp (test_pytest_ops) / linux-job (gh)
RuntimeError: Command docker exec -t 363dcd28b6fc4eb081abe2e9f77309c518534a712813023c7d5f2f7d0ae68b4d /exec failed with exit code 1
pull / unittest-nxp-neutron / linux-job (gh)
RuntimeError: Command docker exec -t 0b9c787e006069ccb6214b652ce17c8c7924e8f59a129e6e8e418f4142129dcb /exec failed with exit code 1
Test CUDA Builds / export-gemma3-cuda-non-quantized / linux-job (gh)
RuntimeError: Command docker exec -t c9385a637d6a79120528b67ba541805f3f007ed6b49197d24f63f5ccc9cb845c /exec failed with exit code 2
Test CUDA Builds / export-voxtral-cuda-non-quantized / linux-job (gh)
RuntimeError: Command docker exec -t 0042bd78fff029268237868ce98643e46479171a31ee52cfcedb04e1b5700706 /exec failed with exit code 2
Test CUDA Builds / export-voxtral-cuda-quantized-int4-tile-packed / linux-job (gh)
RuntimeError: Command docker exec -t 4112dcd580ddb3804936bfe6fa3721b8cd704ac17c65febac3def5799d3944b1 /exec failed with exit code 2
Test CUDA Builds / export-voxtral-cuda-quantized-int4-weight-only / linux-job (gh)
RuntimeError: Command docker exec -t b943262baf1826bc6345887ecdf9dccbd79dd31d67a243604c92e34db7aab68f /exec failed with exit code 2
Test Metal Backend / export-voxtral-metal-artifact / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 2

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-10-20T15:46:10Z

@trivedivivek has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85023829.

…rch#15274) Summary: This diff improves the performance of quantized matrix multiplication by devectorizing the shader. An example modification is shown below: ```glsl // Before VEC4_T sums[TILE_ROWS][TILE_TXCOLS]; // After T sums[TILE_ROWS * TILE_TXCOLS * 4]; // Before sums[r][${c}] = VEC4_T(0.0); // After for (int j = 0; j < 4; j++) { sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0); } ``` Differential Revision: D85023829

…rch#15274) Summary: This diff improves the performance of quantized matrix multiplication by devectorizing the shader. An example modification is shown below: ```glsl // Before VEC4_T sums[TILE_ROWS][TILE_TXCOLS]; // After T sums[TILE_ROWS * TILE_TXCOLS * 4]; // Before sums[r][${c}] = VEC4_T(0.0); // After for (int j = 0; j < 4; j++) { sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0); } ``` Reviewed By: SS-JIA Differential Revision: D85023829

trivedivivek requested a review from SS-JIA as a code owner October 20, 2025 15:46

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 20, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 20, 2025

trivedivivek added the release notes: vulkan Changes to the Vulkan backend delegate label Oct 20, 2025

trivedivivek force-pushed the export-D85023829 branch 2 times, most recently from 4913210 to b831b18 Compare October 20, 2025 22:09

trivedivivek force-pushed the export-D85023829 branch from b831b18 to 88f7aa0 Compare October 21, 2025 14:01

trivedivivek force-pushed the export-D85023829 branch from 88f7aa0 to 1a04777 Compare October 21, 2025 14:40

SS-JIA approved these changes Oct 21, 2025

View reviewed changes

trivedivivek force-pushed the export-D85023829 branch from 1a04777 to 81e68f6 Compare October 21, 2025 16:31

trivedivivek force-pushed the export-D85023829 branch from 81e68f6 to cf198e1 Compare October 21, 2025 20:31

trivedivivek force-pushed the export-D85023829 branch from cf198e1 to 17fc715 Compare October 22, 2025 05:34

trivedivivek force-pushed the export-D85023829 branch from 17fc715 to 29e995a Compare October 22, 2025 05:41

meta-codesync bot merged commit 788ef2f into pytorch:main Oct 22, 2025
138 of 149 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving quantized matmul performance by devectorizing shader. #15274

Improving quantized matmul performance by devectorizing shader. #15274

trivedivivek commented Oct 20, 2025

Uh oh!

pytorch-bot bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improving quantized matmul performance by devectorizing shader. #15274

Improving quantized matmul performance by devectorizing shader. #15274

Conversation

trivedivivek commented Oct 20, 2025

Uh oh!

pytorch-bot bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15274

❌ 8 New Failures, 2 Unrelated Failures

Uh oh!

meta-codesync bot commented Oct 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Oct 20, 2025 •

edited

Loading