Commit 5ee2a32

and

authored

optimize embedding bag (#1726)

I have remove the batch kernel config so that igc will choose grf mode 128 with the sycl assert inside kernel with vec size = 8. We can now get equivalent optimization compare to previous result. ~~1. remove SYCL_KERNEL_ASSERT, this will change grf mode from 256 to 128, but there is an existing issue #1052, I did not remove it in this pr. We should add NDEBUG flag later or use vec_size = 4~~ 2. I see instruction fetch stalls because of the if branches, so move them to template params. 3. I also fixed the vectorization. Previously we actually do not enable it. 4. Previously we only use 256 threads per workgroup, but workgroup size is 1024 performance on input [409581], weight [1000000,64], offset [4096] (4096 bags), dtype = half, mode = sum | | PVC | BMG | | ------------- | ------------- |----| | main branch | 0.18ms | 0.43ms | | current change | 0.08ms | 0.23ms | | current change + if we remove assert | 0.07 ms | 0.22 ms| | ~~remove sycl assert~~ | ~~0.10ms~~ | ~~0.30 ms~~ | | ~~remove branching~~ | ~~0.08ms~~ | ~~0.28 ms~~ | | ~~tiling~~ | ~~0.087ms~~ | ~~0.22 ms~~ | ~~Note: We are stalled [here](https://github.com/intel/torch-xpu-ops/blob/5b4d7444484576f721d2295761cf8fafa924ef36/src/ATen/native/xpu/sycl/EmbeddingBag.h#L68) `vec_t other = w_vec_[i_off];` when vector size is 8, the assembly is `load.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4]; load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC];` After fix, it changes to `load.ugm.d32x4`. There is no performance change on peak frequency, but when profiling on lower frequency, I see 9% faster.~~ ~~PVC does not benefit from tiling, in this case, there will be 32 workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2 batch is still a regression. The best config is vec_size=4, and set workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to set a smaller work group size.~~ --------- Co-authored-by: intel <intel.com> Co-authored-by: Copilot <[email protected]>

1 parent a63be60 commit 5ee2a32Copy full SHA for 5ee2a32

2 files changed

+356

-188

lines changed

src/ATen/native/xpu/sycl
- EmbeddingBag.cpp
- EmbeddingBag.h

2 files changed

+356

-188

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 5ee2a32

2 files changed

2 files changed

File tree

2 files changed

2 files changed

0 commit comments