optimize embedding #1891

jianyizh · 2025-07-31T06:48:30Z

Current index select kernel uses index kernel config, which sets wrong batch and problem batch, result to wrong launch configs that may lead to low occupancy. Also, inside kernel, half of thread is skipped in the following case:
index: 56103808
embedding table: [1683, 1]
previous on pvc: 19ms
now on pvc: 6.7ms

We will try to improve index kernel config next step. @yucai-intel

Copilot

Pull Request Overview

This PR optimizes the embedding operation for index select kernels on Intel XPU by introducing specialized embedding kernel functors. The optimization addresses performance issues where the previous implementation used incorrect index kernel configurations leading to low GPU occupancy.

Adds dedicated EmbeddingKernelFunctor and EmbeddingKernelSLMFunctor classes for optimized embedding operations
Implements dynamic kernel selection based on shared local memory availability
Integrates the new embedding path into the existing index select kernel for specific tensor configurations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/ATen/native/xpu/sycl/Indexing.h	Adds new embedding kernel functor classes with optimized launch configurations
src/ATen/native/xpu/sycl/Indexing.cpp	Implements embedding function and integrates it into index select kernel for 2D contiguous tensors

Copilot · 2025-07-31T06:49:26Z

src/ATen/native/xpu/sycl/Indexing.h

+    for (auto thread_id = item.get_global_linear_id();
+         thread_id < indices_length_ * embedding_dim_;
+         thread_id += item.get_local_range(0) * item.get_group_range(0)) {
+      SYCL_KERNEL_ASSERT(index_[thread_id / embedding_dim_] < num_embeddings_);


The assertion index_[thread_id / embedding_dim_] < num_embeddings_ is evaluated inside the tight loop for every thread iteration. Consider moving this validation outside the performance-critical path or using a debug-only assertion to avoid overhead in production builds.

Copilot · 2025-07-31T06:49:27Z

src/ATen/native/xpu/sycl/Indexing.h

+    for (auto thread_id = item.get_global_linear_id();
+         thread_id < indices_length_ * embedding_dim_;
+         thread_id += item.get_local_range(0) * item.get_group_range(0)) {
+      SYCL_KERNEL_ASSERT(index_[thread_id / embedding_dim_] < num_embeddings_);


Similar to the regular embedding kernel, this assertion inside the loop may impact performance. The same index bounds check is duplicated between both kernel functors and could benefit from optimization or debug-only execution.

Suggested change

SYCL_KERNEL_ASSERT(index_[thread_id / embedding_dim_] < num_embeddings_);

#ifdef DEBUG

if (index_[thread_id / embedding_dim_] >= num_embeddings_) {

SYCL_KERNEL_ASSERT(false && "Index out of bounds in EmbeddingKernelSLMFunctor");

}

#endif

src/ATen/native/xpu/sycl/Indexing.cpp

Copilot · 2025-07-31T06:49:28Z

src/ATen/native/xpu/sycl/Indexing.cpp

-                src_info, dst_info, index_info, new_indexing_dim);
-          else
+          if (dst.is_contiguous() && indices.is_contiguous()) {
+            if (src.dim() == 2 && indices.dim() == 1 && src.is_contiguous()) {


The condition src.is_contiguous() is redundant here since it's already checked in the parent condition on line 244. This duplicate check adds unnecessary complexity to the conditional logic.

Suggested change

if (src.dim() == 2 && indices.dim() == 1 && src.is_contiguous()) {

if (src.dim() == 2 && indices.dim() == 1) {

Co-authored-by: Copilot <[email protected]>

jianyizh added 3 commits July 17, 2025 07:57

add a simple kernel

b834118

add for loop to avoid too much workgroup

d04b03c

slm

d64f678

Copilot AI review requested due to automatic review settings July 31, 2025 06:48

jianyizh added the kernel_optimization label Jul 31, 2025

jianyizh requested review from EikanWang and xytintel July 31, 2025 06:48

Copilot AI reviewed Jul 31, 2025

View reviewed changes

jianyizh assigned jianyizh and yucai-intel Jul 31, 2025

Apply suggestions from copilot

38ff50c

Co-authored-by: Copilot <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize embedding #1891

optimize embedding #1891

Uh oh!

jianyizh commented Jul 31, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Uh oh!

Copilot AI Jul 31, 2025

Uh oh!

Uh oh!

-      SYCL_KERNEL_ASSERT(index_[thread_id / embedding_dim_] < num_embeddings_);
+#ifdef DEBUG
+      if (index_[thread_id / embedding_dim_] >= num_embeddings_) {
+        SYCL_KERNEL_ASSERT(false && "Index out of bounds in EmbeddingKernelSLMFunctor");
+      }
+#endif

	if (src.dim() == 2 && indices.dim() == 1 && src.is_contiguous()) {
	if (src.dim() == 2 && indices.dim() == 1) {

optimize embedding #1891

Are you sure you want to change the base?

optimize embedding #1891

Uh oh!

Conversation

jianyizh commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jianyizh commented Jul 31, 2025 •

edited

Loading