Skip to content

Conversation

yucai-intel
Copy link
Contributor

No description provided.

@CuiYifeng CuiYifeng requested a review from Copilot October 16, 2025 08:45
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Split IndexToOffset into compile-time (static Dims) and runtime (-1) variants to reduce nvcc compilation time and refactor all kernel call sites to use the new template form. Key changes remove the previous contiguous fast path flag and introduce dimension template parameters across many XPU SYCL kernels.

  • Introduced IndexToOffset<T, IndexType, Dims> and dynamic specialization with Dims = -1; removed strict/non-strict contiguous branching.
  • Updated all kernel usages to pass an explicit Dims (positive, -1, or new sentinel -2) and added indexing_kind template parameters in RNN kernels.
  • Replaced standard SYCL subgroup size attribute with Intel-specific attribute in one kernel.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/comm/TensorInfo.h Replaced old IndexToOffset implementation with compile-time and runtime (-1) variants, removed contiguous fast path.
src/ATen/native/xpu/sycl/WeightNormKernels.cpp Updated all IndexToOffset calls to new API (runtime -1).
src/ATen/native/xpu/sycl/TensorModeKernel.cpp Switched subgroup size attribute to Intel-specific and updated IndexToOffset usage.
src/ATen/native/xpu/sycl/TensorApplyUtils.h Updated ApplyOp2 to use runtime (-1) IndexToOffset.
src/ATen/native/xpu/sycl/SummaryOpsKernels.cpp Added ADims/BDims template params and updated IndexToOffset calls.
src/ATen/native/xpu/sycl/Sorting.cpp Pass compile-time Dim to IndexToOffset.
src/ATen/native/xpu/sycl/ScanUtils.h Migrated to runtime (-1) IndexToOffset calls.
src/ATen/native/xpu/sycl/RNNKernels.cpp Added indexing_kind template parameter and adjusted macros to new IndexToOffset signature.
src/ATen/native/xpu/sycl/Indexing.h Updated offset calculations to new runtime form.
src/ATen/native/xpu/sycl/Indexing.cpp Added DstDim/SrcDim/IdxDim template params, macros now emit various Dims (including -2) for IndexToOffset.
src/ATen/native/xpu/sycl/Dropout.cpp Added ADims / BDims template-based offset computation.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +155 to 172
template <typename T, typename IndexType, int Dims>
struct IndexToOffset {
static constexpr bool STRICT_CONTIGUOUS = true;
static constexpr bool NON_STRICT_CONTIGUOUS = false;
static inline IndexType get(
static IndexType get(
IndexType linearId,
const TensorInfo<T, IndexType>& info,
bool strict_contiguous = true) {
const TensorInfo<T, IndexType>& info) {
IndexType offset = 0;

if (info.isContiguousCheckStrict(strict_contiguous)) {
return linearId;
}

for (int dim = info.dims - 1; dim > 0; --dim) {
IndexType curDimIndex = linearId % info.sizes[dim];
IndexType curDimOffset = curDimIndex * info.strides[dim];
// Uses static dims
for (int i = Dims - 1; i > 0; --i) {
IndexType curDimIndex = linearId % info.sizes[i];
IndexType curDimOffset = curDimIndex * info.strides[i];
offset += curDimOffset;
linearId /= info.sizes[dim];
linearId /= info.sizes[i];
}

return offset + linearId * info.strides[0];
}
};
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No safeguard enforces that runtime info.dims equals the compile-time Dims; if a caller instantiates this with Dims > info.dims, the loop will read past the valid sizes/strides arrays causing undefined behavior. Add a static_assert for Dims > 0 and a runtime (debug) assert (e.g., TORCH_INTERNAL_ASSERT(info.dims == Dims)) to prevent misuse.

Copilot uses AI. Check for mistakes.

Comment on lines 175 to 191
template <typename T, typename IndexType>
struct IndexToOffset<T, IndexType, true> {
static constexpr bool STRICT_CONTIGUOUS = true;
static constexpr bool NON_STRICT_CONTIGUOUS = false;
struct IndexToOffset<T, IndexType, -1> {
static inline IndexType get(
IndexType linearId,
const TensorInfo<T, IndexType>& info,
bool strict_contiguous = true) {
return linearId;
const TensorInfo<T, IndexType>& info) {
IndexType offset = 0;

for (int i = info.dims - 1; i > 0; --i) {
IndexType curDimIndex = linearId % info.sizes[i];
IndexType curDimOffset = curDimIndex * info.strides[i];
offset += curDimOffset;
linearId /= info.sizes[i];
}

return offset + linearId * info.strides[0];
}
};
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous implementation returned linearId immediately for (non-)strict contiguous tensors; removing that fast path forces modulus/divide per dimension for contiguous cases, increasing kernel instruction count. Consider an early check (e.g., if (info.isContiguous()) return linearId;) or a dims==1 shortcut to preserve performance.

Copilot uses AI. Check for mistakes.

template <typename T, unsigned int Power2Size>
struct ComputeModeKernelFunctor : public __SYCL_KER_CONFIG_CONVENTION__ {
[[sycl::reqd_sub_group_size(32)]] void operator()(
[[intel::reqd_sub_group_size(32)]] void operator()(
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing the standard [[sycl::reqd_sub_group_size(32)]] with vendor-specific [[intel::reqd_sub_group_size(32)]] reduces portability across SYCL implementations. Prefer the standard attribute or guard the Intel extension with preprocessor checks.

Suggested change
[[intel::reqd_sub_group_size(32)]] void operator()(
[[sycl::reqd_sub_group_size(32)]] void operator()(

Copilot uses AI. Check for mistakes.

Comment on lines +1539 to +1541
if (selfInfo.dims == 1 && sourceInfo.dims == 1 && indContig) {
auto caller = SMALL_INDEX(
scalar_t, index_t, unsigned int, 1, 1, -2, func_t);
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentinel value -2 for IdxDim is undocumented and differs from the established -1 dynamic case; it implicitly relies on the primary template's loop skipping logic and produces offset = linearId * stride[0], which would be incorrect if indices_ is not 1D with stride[0]==1. Replace -2 with an explicit dimension (e.g., 1) or unify on -1 with a clear fast path, and document the intent.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants