[ET-VK] New implementation of `cat` operator #11508

SS-JIA · 2025-06-09T22:59:13Z

Stack from ghstack (oldest at bottom):

Changes

Introduce concat_texture.glsl and concat_buffer.glsl to implement the torch.cat operator
Introduce Concat.cpp to replace Cat.cpp
Fix a bug with channels-packed buffer tensors where input data would be copied incorrectly with multiple dims have a stride of 1

Motivation

Introduce concat_texture.glsl and concat_buffer.glsl to implement the torch.cat operator

Introduce Concat.cpp to replace Cat.cpp

The existing implementation of torch.cat uses the copy_channel_offset` shaders. However, these shaders have a critical bug where the output tensor is passed in separately with difference access types, i.e.

graph.execute_nodes().emplace_back(new DispatchNode(
        graph,
        VK_KERNEL_FROM_STR(kernel_name),
        global_size,
        local_size,
        // Inputs and Outputs
        {
            {out, vkapi::kWrite},
            {out, vkapi::kRead},
            {in, vkapi::kRead},
        },

This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly. The result is that the cat operator produces incorrect result on many platforms.

Rather than fix the copy_offset shaders, I decided to just introduce new shaders to perform the concat operation. The new implementation handles both buffer and texture inputs and is agnostic to memory layout.

Differential Revision: D76305343

## Changes * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator * Introduce `Concat.cpp` to replace `Cat.cpp` * Fix a bug with channels-packed buffer tensors where input data would be copied incorrectly with multiple dims have a stride of 1 ## Motivation > * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator > * Introduce `Concat.cpp` to replace `Cat.cpp` The existing implementation of `torch.cat` uses the copy_channel_offset` shaders. However, these shaders have a critical bug where the output tensor is passed in separately with difference access types, i.e. ``` graph.execute_nodes().emplace_back(new DispatchNode( graph, VK_KERNEL_FROM_STR(kernel_name), global_size, local_size, // Inputs and Outputs { {out, vkapi::kWrite}, {out, vkapi::kRead}, {in, vkapi::kRead}, }, ``` This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly. The result is that the `cat` operator produces incorrect result on many platforms. Rather than fix the `copy_offset` shaders, I decided to just introduce new shaders to perform the concat operation. The new implementation handles both buffer and texture inputs and is agnostic to memory layout. Differential Revision: [D76305343](https://our.internmc.facebook.com/intern/diff/D76305343/) [ghstack-poisoned]

## Changes * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator * Introduce `Concat.cpp` to replace `Cat.cpp` * Fix a bug with channels-packed buffer tensors where input data would be copied incorrectly with multiple dims have a stride of 1 ## Motivation > * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator > * Introduce `Concat.cpp` to replace `Cat.cpp` The existing implementation of `torch.cat` uses the copy_channel_offset` shaders. However, these shaders have a critical bug where the output tensor is passed in separately with difference access types, i.e. ``` graph.execute_nodes().emplace_back(new DispatchNode( graph, VK_KERNEL_FROM_STR(kernel_name), global_size, local_size, // Inputs and Outputs { {out, vkapi::kWrite}, {out, vkapi::kRead}, {in, vkapi::kRead}, }, ``` This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly. The result is that the `cat` operator produces incorrect result on many platforms. Rather than fix the `copy_offset` shaders, I decided to just introduce new shaders to perform the concat operation. The new implementation handles both buffer and texture inputs and is agnostic to memory layout. Differential Revision: [D76305343](https://our.internmc.facebook.com/intern/diff/D76305343/) ghstack-source-id: 289266299 Pull Request resolved: #11508

pytorch-bot · 2025-06-09T22:59:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11508

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Cancelled Jobs, 4 Unrelated Failures

As of commit a30836e with merge base 4a14fdd ():

NEW FAILURES - The following jobs have failed:

Build Presets / linux (linux, linux.2xlarge, executorch-ubuntu-22.04-clang12) / build (gh)
pull / unittest / linux / linux-job (gh)
/pytorch/executorch/backends/vulkan/runtime/api/containers/Tensor.cpp:651:17: error: no matching constructor for initialization of 'vkcompute::api::vTensor::TextureLimits'
pull / unittest-editable / linux / linux-job (gh)
/pytorch/executorch/backends/vulkan/runtime/api/containers/Tensor.cpp:651:17: error: no matching constructor for initialization of 'vkcompute::api::vTensor::TextureLimits'

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-custom-ops-linux / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04... / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
pull / unittest-arm-backend-with-no-fvp (test_pytest_models) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-06-09T22:59:24Z

This pull request was exported from Phabricator. Differential Revision: D76305343

github-actions · 2025-06-09T22:59:57Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

## Changes * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator * Introduce `Concat.cpp` to replace `Cat.cpp` * Fix a bug with channels-packed buffer tensors where input data would be copied incorrectly with multiple dims have a stride of 1 ## Motivation > * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator > * Introduce `Concat.cpp` to replace `Cat.cpp` The existing implementation of `torch.cat` uses the copy_channel_offset` shaders. However, these shaders have a critical bug where the output tensor is passed in separately with difference access types, i.e. ``` graph.execute_nodes().emplace_back(new DispatchNode( graph, VK_KERNEL_FROM_STR(kernel_name), global_size, local_size, // Inputs and Outputs { {out, vkapi::kWrite}, {out, vkapi::kRead}, {in, vkapi::kRead}, }, ``` This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly. The result is that the `cat` operator produces incorrect result on many platforms. Rather than fix the `copy_offset` shaders, I decided to just introduce new shaders to perform the concat operation. The new implementation handles both buffer and texture inputs and is agnostic to memory layout. Differential Revision: [D76305343](https://our.internmc.facebook.com/intern/diff/D76305343/) [ghstack-poisoned]

Pull Request resolved: #11508 ## Changes * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator * Introduce `Concat.cpp` to replace `Cat.cpp` * Fix a bug with channels-packed buffer tensors where input data would be copied incorrectly with multiple dims have a stride of 1 ## Motivation > * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator > * Introduce `Concat.cpp` to replace `Cat.cpp` The existing implementation of `torch.cat` uses the copy_channel_offset` shaders. However, these shaders have a critical bug where the output tensor is passed in separately with difference access types, i.e. ``` graph.execute_nodes().emplace_back(new DispatchNode( graph, VK_KERNEL_FROM_STR(kernel_name), global_size, local_size, // Inputs and Outputs { {out, vkapi::kWrite}, {out, vkapi::kRead}, {in, vkapi::kRead}, }, ``` This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly. The result is that the `cat` operator produces incorrect result on many platforms. Rather than fix the `copy_offset` shaders, I decided to just introduce new shaders to perform the concat operation. The new implementation handles both buffer and texture inputs and is agnostic to memory layout. Differential Revision: [D76305343](https://our.internmc.facebook.com/intern/diff/D76305343/) ghstack-source-id: 289266299

facebook-github-bot · 2025-06-12T15:03:37Z

This pull request was exported from Phabricator. Differential Revision: D76305343

## Changes * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator * Introduce `Concat.cpp` to replace `Cat.cpp` * Fix a bug with channels-packed buffer tensors where input data would be copied incorrectly with multiple dims have a stride of 1 ## Motivation > * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator > * Introduce `Concat.cpp` to replace `Cat.cpp` The existing implementation of `torch.cat` uses the copy_channel_offset` shaders. However, these shaders have a critical bug where the output tensor is passed in separately with difference access types, i.e. ``` graph.execute_nodes().emplace_back(new DispatchNode( graph, VK_KERNEL_FROM_STR(kernel_name), global_size, local_size, // Inputs and Outputs { {out, vkapi::kWrite}, {out, vkapi::kRead}, {in, vkapi::kRead}, }, ``` This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly. The result is that the `cat` operator produces incorrect result on many platforms. Rather than fix the `copy_offset` shaders, I decided to just introduce new shaders to perform the concat operation. The new implementation handles both buffer and texture inputs and is agnostic to memory layout. Differential Revision: [D76305343](https://our.internmc.facebook.com/intern/diff/D76305343/) [ghstack-poisoned]

Pull Request resolved: #11508 ## Changes * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator * Introduce `Concat.cpp` to replace `Cat.cpp` * Fix a bug with channels-packed buffer tensors where input data would be copied incorrectly with multiple dims have a stride of 1 ## Motivation > * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator > * Introduce `Concat.cpp` to replace `Cat.cpp` The existing implementation of `torch.cat` uses the copy_channel_offset` shaders. However, these shaders have a critical bug where the output tensor is passed in separately with difference access types, i.e. ``` graph.execute_nodes().emplace_back(new DispatchNode( graph, VK_KERNEL_FROM_STR(kernel_name), global_size, local_size, // Inputs and Outputs { {out, vkapi::kWrite}, {out, vkapi::kRead}, {in, vkapi::kRead}, }, ``` This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly. The result is that the `cat` operator produces incorrect result on many platforms. Rather than fix the `copy_offset` shaders, I decided to just introduce new shaders to perform the concat operation. The new implementation handles both buffer and texture inputs and is agnostic to memory layout. ghstack-source-id: 289956459 Differential Revision: [D76305343](https://our.internmc.facebook.com/intern/diff/D76305343/)

facebook-github-bot · 2025-06-12T18:41:05Z

This pull request was exported from Phabricator. Differential Revision: D76305343

## Changes * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator * Introduce `Concat.cpp` to replace `Cat.cpp` * Fix a bug with channels-packed buffer tensors where input data would be copied incorrectly with multiple dims have a stride of 1 ## Motivation > * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator > * Introduce `Concat.cpp` to replace `Cat.cpp` The existing implementation of `torch.cat` uses the copy_channel_offset` shaders. However, these shaders have a critical bug where the output tensor is passed in separately with difference access types, i.e. ``` graph.execute_nodes().emplace_back(new DispatchNode( graph, VK_KERNEL_FROM_STR(kernel_name), global_size, local_size, // Inputs and Outputs { {out, vkapi::kWrite}, {out, vkapi::kRead}, {in, vkapi::kRead}, }, ``` This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly. The result is that the `cat` operator produces incorrect result on many platforms. Rather than fix the `copy_offset` shaders, I decided to just introduce new shaders to perform the concat operation. The new implementation handles both buffer and texture inputs and is agnostic to memory layout. Differential Revision: [D76305343](https://our.internmc.facebook.com/intern/diff/D76305343/) [ghstack-poisoned]

Pull Request resolved: #11508 ## Changes * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator * Introduce `Concat.cpp` to replace `Cat.cpp` * Fix a bug with channels-packed buffer tensors where input data would be copied incorrectly with multiple dims have a stride of 1 ## Motivation > * Introduce `concat_texture.glsl` and `concat_buffer.glsl` to implement the `torch.cat` operator > * Introduce `Concat.cpp` to replace `Cat.cpp` The existing implementation of `torch.cat` uses the copy_channel_offset` shaders. However, these shaders have a critical bug where the output tensor is passed in separately with difference access types, i.e. ``` graph.execute_nodes().emplace_back(new DispatchNode( graph, VK_KERNEL_FROM_STR(kernel_name), global_size, local_size, // Inputs and Outputs { {out, vkapi::kWrite}, {out, vkapi::kRead}, {in, vkapi::kRead}, }, ``` This creates many validation layer errors because the memory barriers for the resource cannot be formed properly. The shader essentially relies on undefined behaviour to work correctly. The result is that the `cat` operator produces incorrect result on many platforms. Rather than fix the `copy_offset` shaders, I decided to just introduce new shaders to perform the concat operation. The new implementation handles both buffer and texture inputs and is agnostic to memory layout. ghstack-source-id: 290022825 Differential Revision: [D76305343](https://our.internmc.facebook.com/intern/diff/D76305343/)

facebook-github-bot · 2025-06-12T19:34:57Z

This pull request was exported from Phabricator. Differential Revision: D76305343

## Context Previously, I updated the implementation of `aten.cat.default` in D76305343 (#11508) since the original implementation had a bug. The new implementation only supported up to 3 input tensors, but several models require the need for up to 6 input tensors. This diff updates the capabilities of the `concat` op so that any arbitrary number of input tensors may be accepted. ## Changes * Update implementation of the concat shader to be able to be called repeatedly, allowing support for any number of input tensors. Differential Revision: [D79893084](https://our.internmc.facebook.com/intern/diff/D79893084/) [ghstack-poisoned]

… any number of input tensors" ## Context Previously, I updated the implementation of `aten.cat.default` in D76305343 (#11508) since the original implementation had a bug. The new implementation only supported up to 3 input tensors, but several models require the need for up to 6 input tensors. This diff updates the capabilities of the `concat` op so that any arbitrary number of input tensors may be accepted. ## Changes * Update implementation of the concat shader to be able to be called repeatedly, allowing support for any number of input tensors. Differential Revision: [D79893084](https://our.internmc.facebook.com/intern/diff/D79893084/) [ghstack-poisoned]

…nput tensors" ## Context Previously, I updated the implementation of `aten.cat.default` in D76305343 (#11508) since the original implementation had a bug. The new implementation only supported up to 3 input tensors, but several models require the need for up to 6 input tensors. This diff updates the capabilities of the `concat` op so that any arbitrary number of input tensors may be accepted. ## Changes * Update implementation of the concat shader to be able to be called repeatedly, allowing support for any number of input tensors. Differential Revision: [D79893084](https://our.internmc.facebook.com/intern/diff/D79893084/) [ghstack-poisoned]

#13252) ## Context Previously, I updated the implementation of `aten.cat.default` in D76305343 (#11508) since the original implementation had a bug. The new implementation only supported up to 3 input tensors, but several models require the need for up to 6 input tensors. This diff updates the capabilities of the `concat` op so that any arbitrary number of input tensors may be accepted. ## Changes * Update implementation of the concat shader to be able to be called repeatedly, allowing support for any number of input tensors. Differential Revision: [D79893084](https://our.internmc.facebook.com/intern/diff/D79893084/)

pytorch#13252) ## Context Previously, I updated the implementation of `aten.cat.default` in D76305343 (pytorch#11508) since the original implementation had a bug. The new implementation only supported up to 3 input tensors, but several models require the need for up to 6 input tensors. This diff updates the capabilities of the `concat` op so that any arbitrary number of input tensors may be accepted. ## Changes * Update implementation of the concat shader to be able to be called repeatedly, allowing support for any number of input tensors. Differential Revision: [D79893084](https://our.internmc.facebook.com/intern/diff/D79893084/)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2025

facebook-github-bot added the fb-exported label Jun 9, 2025

This was referenced Jun 12, 2025

[ET-VK] Clean up vTensor member variables and expose dim order UBO and push constant #11599

Merged

[ET-VK] Use dim order when converting buffer index to tensor index #11600

Merged

trivedivivek approved these changes Jun 12, 2025

View reviewed changes

facebook-github-bot merged commit 3f2addc into gh/SS-JIA/239/base Jun 12, 2025
87 of 98 checks passed

facebook-github-bot deleted the gh/SS-JIA/239/head branch June 12, 2025 22:28

facebook-github-bot temporarily deployed to cherry-pick-bot June 12, 2025 22:28 — with GitHub Actions Inactive

pytorchbot mentioned this pull request Jun 12, 2025

[ET-VK] New implementation of cat operator #11623

Merged

SS-JIA mentioned this pull request Aug 8, 2025

[ET-VK] Allow aten.cat.default to handle any number of input tensors #13226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ET-VK] New implementation of `cat` operator #11508

[ET-VK] New implementation of `cat` operator #11508

Uh oh!

SS-JIA commented Jun 9, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 9, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Jun 9, 2025

Uh oh!

github-actions bot commented Jun 9, 2025

Uh oh!

facebook-github-bot commented Jun 12, 2025

Uh oh!

facebook-github-bot commented Jun 12, 2025

Uh oh!

facebook-github-bot commented Jun 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[ET-VK] New implementation of cat operator #11508

[ET-VK] New implementation of cat operator #11508

Uh oh!

Conversation

SS-JIA commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Motivation

Uh oh!

pytorch-bot bot commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/11508

❌ 3 New Failures, 2 Cancelled Jobs, 4 Unrelated Failures

Uh oh!

facebook-github-bot commented Jun 9, 2025

Uh oh!

github-actions bot commented Jun 9, 2025

This PR needs a release notes: label

Uh oh!

facebook-github-bot commented Jun 12, 2025

Uh oh!

facebook-github-bot commented Jun 12, 2025

Uh oh!

facebook-github-bot commented Jun 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[ET-VK] New implementation of `cat` operator #11508

[ET-VK] New implementation of `cat` operator #11508

SS-JIA commented Jun 9, 2025 •

edited

Loading

pytorch-bot bot commented Jun 9, 2025 •

edited

Loading

This PR needs a `release notes:` label