-
Notifications
You must be signed in to change notification settings - Fork 49
Commit 5ee2a32
optimize embedding bag (#1726)
I have remove the batch kernel config so that igc will choose grf mode
128 with the sycl assert inside kernel with vec size = 8. We can now get
equivalent optimization compare to previous result.
~~1. remove SYCL_KERNEL_ASSERT, this will change grf mode from 256 to
128, but there is an existing issue
#1052, I did not remove it
in this pr. We should add NDEBUG flag later or use vec_size = 4~~
2. I see instruction fetch stalls because of the if branches, so move
them to template params.
3. I also fixed the vectorization. Previously we actually do not enable
it.
4. Previously we only use 256 threads per workgroup, but workgroup size
is 1024
performance on input [409581], weight [1000000,64], offset [4096] (4096
bags), dtype = half, mode = sum
| | PVC | BMG |
| ------------- | ------------- |----|
| main branch | 0.18ms | 0.43ms |
| current change | 0.08ms | 0.23ms |
| current change + if we remove assert | 0.07 ms | 0.22 ms|
| ~~remove sycl assert~~ | ~~0.10ms~~ | ~~0.30 ms~~ |
| ~~remove branching~~ | ~~0.08ms~~ | ~~0.28 ms~~ |
| ~~tiling~~ | ~~0.087ms~~ | ~~0.22 ms~~ |
~~Note: We are stalled
[here](https://github.com/intel/torch-xpu-ops/blob/5b4d7444484576f721d2295761cf8fafa924ef36/src/ATen/native/xpu/sycl/EmbeddingBag.h#L68)
`vec_t other = w_vec_[i_off];` when vector size is 8, the assembly is
`load.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4];
load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC];` After fix,
it changes to `load.ugm.d32x4`. There is no performance change on peak
frequency, but when profiling on lower frequency, I see 9% faster.~~
~~PVC does not benefit from tiling, in this case, there will be 32
workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2
batch is still a regression. The best config is vec_size=4, and set
workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to
set a smaller work group size.~~
---------
Co-authored-by: intel <intel.com>
Co-authored-by: Copilot <[email protected]>1 parent a63be60 commit 5ee2a32Copy full SHA for 5ee2a32
File tree
Expand file treeCollapse file tree
2 files changed
+356
-188
lines changedFilter options
- src/ATen/native/xpu/sycl
Expand file treeCollapse file tree
2 files changed
+356
-188
lines changed
0 commit comments