Skip to content

Commit f8b1ee9

Browse files
chunhuanMengCopilotxytintel
authored
Add Template Parameter to gpu_kernel for Controlling Broadcasting Vectorization (#1873)
This pull request updates the `gpu_kernel` function in `src/ATen/native/xpu/sycl/Loops.h` to introduce a new template parameter for better control over broadcasting behavior. The changes ensure that the broadcasting vectorization can be enabled or disabled explicitly during function calls. ### Enhancements to `gpu_kernel`: * Added a new template parameter `enable_broadcast_vec` (defaulting to `true`) to the `gpu_kernel` function, allowing explicit control over broadcasting vectorization. * Updated recursive and implementation calls within `gpu_kernel` to pass the `enable_broadcast_vec` parameter, ensuring consistent behavior during sub-iteration and implementation. ### Reason for the changes: The reason for introducing the `enable_broadcast_vec` parameter is to address an issue with the output offset calculation when the iterator (`iter`) is split. When broadcasting vectorization is enabled (`enable_broadcast_vec` is `true`), the path taken during the computation can lead to incorrect output offsets after the iterator has been split. By allowing explicit control over broadcasting vectorization, we can disable it in scenarios where the iterator has been split, thereby ensuring correct output offset calculations. Resolve #1813 --------- Co-authored-by: Copilot <[email protected]> Co-authored-by: Yutao Xu <[email protected]>
1 parent 65d4902 commit f8b1ee9

File tree

2 files changed

+10
-3
lines changed

2 files changed

+10
-3
lines changed

src/ATen/native/xpu/sycl/Loops.h

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -620,7 +620,7 @@ void gpu_kernel_nocast(TensorIteratorBase& iter, const func_t& f) {
620620
gpu_kernel_impl_nocast(iter, f);
621621
}
622622

623-
template <typename func_t>
623+
template <typename func_t, bool enable_broadcast_vec = true>
624624
void gpu_kernel(TensorIteratorBase& iter, const func_t& f) {
625625
for (int arg = 0; arg < iter.ntensors(); arg++) {
626626
TORCH_INTERNAL_ASSERT(
@@ -637,12 +637,14 @@ void gpu_kernel(TensorIteratorBase& iter, const func_t& f) {
637637

638638
if (!iter.can_use_32bit_indexing()) {
639639
for (auto& sub_iter : iter.with_32bit_indexing()) {
640-
gpu_kernel(sub_iter, f);
640+
// Broadcasting vectorization is disabled for sub-iterators to prevent
641+
// potential output offset calculation issues.
642+
gpu_kernel<func_t, false>(sub_iter, f);
641643
}
642644
return;
643645
}
644646

645-
gpu_kernel_impl(iter, f);
647+
gpu_kernel_impl<func_t, enable_broadcast_vec>(iter, f);
646648
}
647649

648650
template <typename arg1_t, typename arg2_t, typename return_t, typename func_t>

test/regressions/test_loops.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,3 +70,8 @@ def test_loops_dynamic_cast(self):
7070
c = a + b + 1
7171
c_xpu = a_xpu + b_xpu + 1
7272
self.assertEqual(c, c_xpu.cpu())
73+
74+
def test_bc_vec_large_tensor(self):
75+
raw_data = torch.rand(48, 64, 64, 64, 64)
76+
a = raw_data.xpu().transpose(0, 1).contiguous().transpose(0, 1)
77+
self.assertEqual(a, raw_data.xpu())

0 commit comments

Comments
 (0)