Improving quantized matmul performance by devectorizing shader. #15274

trivedivivek · 2025-10-20T15:46:03Z

Summary:
This diff improves the performance of quantized matrix multiplication by devectorizing the shader.

An example modification is shown below:

// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}

Differential Revision: D85023829

pytorch-bot · 2025-10-20T15:46:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15274

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

AWS was down, GHA infrastructure effected / recovering

❌ 8 New Failures, 2 Unrelated Failures

As of commit b831b18 with merge base 0a1dfb2 ():

NEW FAILURES - The following jobs have failed:

pull / android / run-emulator (gh)
The process '/usr/bin/sh' failed with exit code 255
pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t e0fef530deba5e97665fc7560aa2baf95b6d3828c3c5be2cc4fbcc554112b2f8 /exec failed with exit code 1
pull / unittest-arm-backend-with-no-fvp (test_pytest_models) / linux-job (gh)
RuntimeError: Command docker exec -t 0e2ce92b2f3c3a34e23c9ab9f08494679a6406623912142ff7b95ccb7471223b /exec failed with exit code 1
pull / unittest-nxp-neutron / linux-job (gh)
RuntimeError: Command docker exec -t a60486e2b38613961fcc5b8f0830a800995c06ea1fddea5a33a7ddf5968c9ca3 /exec failed with exit code 1
Test CUDA Builds / export-voxtral-cuda-non-quantized / linux-job (gh)
RuntimeError: Command docker exec -t f49de6da0b0be081ee1730670e545fc6d4b2701971069ba84259a493c34bb876 /exec failed with exit code 2
Test CUDA Builds / export-voxtral-cuda-quantized-int4-tile-packed / linux-job (gh)
RuntimeError: Command docker exec -t 71e42c2235aad24e10721467e3ab760960cb8e965b1db02b8ae5a0979cdeccc1 /exec failed with exit code 2
Test CUDA Builds / export-voxtral-cuda-quantized-int4-weight-only / linux-job (gh)
RuntimeError: Command docker exec -t 506a0f48cbb18fe759d3c6b3daa111055cbe4bd6af2b6e15721d87b3cf0c8dab /exec failed with exit code 2
Test Metal Backend / export-voxtral-metal-artifact / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 2

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / unittest / linux / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest-editable / linux / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2025-10-20T15:46:10Z

@trivedivivek has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85023829.

…rch#15274) Summary: This diff improves the performance of quantized matrix multiplication by devectorizing the shader. An example modification is shown below: ```glsl // Before VEC4_T sums[TILE_ROWS][TILE_TXCOLS]; // After T sums[TILE_ROWS * TILE_TXCOLS * 4]; // Before sums[r][${c}] = VEC4_T(0.0); // After for (int j = 0; j < 4; j++) { sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0); } ``` Differential Revision: D85023829

Summary: The diff includes minor performance improvements to the quantized matrix multiplication shader. Differential Revision: D84998542

…rch#15274) Summary: This diff improves the performance of quantized matrix multiplication by devectorizing the shader. An example modification is shown below: ```glsl // Before VEC4_T sums[TILE_ROWS][TILE_TXCOLS]; // After T sums[TILE_ROWS * TILE_TXCOLS * 4]; // Before sums[r][${c}] = VEC4_T(0.0); // After for (int j = 0; j < 4; j++) { sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0); } ``` Differential Revision: D85023829

trivedivivek requested a review from SS-JIA as a code owner October 20, 2025 15:46

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 20, 2025

meta-codesync bot added fb-exported meta-exported labels Oct 20, 2025

trivedivivek added the release notes: vulkan Changes to the Vulkan backend delegate label Oct 20, 2025

trivedivivek force-pushed the export-D85023829 branch from 97badee to 4913210 Compare October 20, 2025 18:40

trivedivivek added 2 commits October 20, 2025 15:08

Minor perf improvements to quantized mat mul shader. (pytorch#15261)

92f1606

Summary: The diff includes minor performance improvements to the quantized matrix multiplication shader. Differential Revision: D84998542

trivedivivek force-pushed the export-D85023829 branch from 4913210 to b831b18 Compare October 20, 2025 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving quantized matmul performance by devectorizing shader. #15274

Improving quantized matmul performance by devectorizing shader. #15274

Uh oh!

trivedivivek commented Oct 20, 2025

Uh oh!

pytorch-bot bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

meta-codesync bot commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improving quantized matmul performance by devectorizing shader. #15274

Are you sure you want to change the base?

Improving quantized matmul performance by devectorizing shader. #15274

Uh oh!

Conversation

trivedivivek commented Oct 20, 2025

Uh oh!

pytorch-bot bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15274

❗ 1 Active SEVs

❌ 8 New Failures, 2 Unrelated Failures

Uh oh!

meta-codesync bot commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot bot commented Oct 20, 2025 •

edited

Loading