Skip to content

Conversation

trivedivivek
Copy link
Contributor

Summary:
This diff improves the performance of quantized matrix multiplication by devectorizing the shader.

An example modification is shown below:

// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}

Differential Revision: D85023829

@trivedivivek trivedivivek requested a review from SS-JIA as a code owner October 20, 2025 15:46
Copy link

pytorch-bot bot commented Oct 20, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15274

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures, 2 Unrelated Failures

As of commit 29e995a with merge base 82611e9 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 20, 2025
Copy link

meta-codesync bot commented Oct 20, 2025

@trivedivivek has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85023829.

@trivedivivek trivedivivek added the release notes: vulkan Changes to the Vulkan backend delegate label Oct 20, 2025
trivedivivek added a commit to trivedivivek/executorch that referenced this pull request Oct 20, 2025
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Differential Revision: D85023829
@trivedivivek trivedivivek force-pushed the export-D85023829 branch 2 times, most recently from 4913210 to b831b18 Compare October 20, 2025 22:09
trivedivivek added a commit to trivedivivek/executorch that referenced this pull request Oct 20, 2025
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Differential Revision: D85023829
trivedivivek added a commit to trivedivivek/executorch that referenced this pull request Oct 21, 2025
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Differential Revision: D85023829
trivedivivek added a commit to trivedivivek/executorch that referenced this pull request Oct 21, 2025
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Differential Revision: D85023829
trivedivivek added a commit to trivedivivek/executorch that referenced this pull request Oct 21, 2025
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Reviewed By: SS-JIA

Differential Revision: D85023829
trivedivivek added a commit to trivedivivek/executorch that referenced this pull request Oct 21, 2025
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Reviewed By: SS-JIA

Differential Revision: D85023829
trivedivivek added a commit to trivedivivek/executorch that referenced this pull request Oct 21, 2025
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Reviewed By: SS-JIA

Differential Revision: D85023829
trivedivivek added a commit to trivedivivek/executorch that referenced this pull request Oct 21, 2025
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Reviewed By: SS-JIA

Differential Revision: D85023829
trivedivivek added a commit to trivedivivek/executorch that referenced this pull request Oct 21, 2025
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Reviewed By: SS-JIA

Differential Revision: D85023829
trivedivivek added a commit to trivedivivek/executorch that referenced this pull request Oct 22, 2025
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Reviewed By: SS-JIA

Differential Revision: D85023829
…rch#15274)

Summary:

This diff improves the performance of quantized matrix multiplication by devectorizing the shader. 

An example modification is shown below:

```glsl
// Before
VEC4_T sums[TILE_ROWS][TILE_TXCOLS];

// After
T sums[TILE_ROWS * TILE_TXCOLS * 4];

// Before
sums[r][${c}] = VEC4_T(0.0);

// After
for (int j = 0; j < 4; j++) {
    sums[r * TILE_TXCOLS * 4 + ${c} * 4 + j] = T(0.0);
}
```

Reviewed By: SS-JIA

Differential Revision: D85023829
@meta-codesync meta-codesync bot merged commit 788ef2f into pytorch:main Oct 22, 2025
138 of 149 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported release notes: vulkan Changes to the Vulkan backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants