Rebase dev/sycl-free-func #2141

fengyuan14 · 2025-10-09T05:08:05Z

No description provided.

1. torchbench issues caused by deps installation 2. pt2e test dataset path and deps installation 3. op benchmark github access permission 4. enhance bisect search disable_distributed

This PR intends to add some ported distributed cases in torch-xpu-ops CI. - Add ZE_AFFINITY_MASK to ensure using Xelink. - Add CCL_ROOT for Xelink, this WA can be removed after oneCCL upgrade to 2021.16.2 - Increase distributed test time limit. Currently, the test part needs about 1 hour after add ported cases. disable_e2e disable_ut

Support high priority stream for xccl, test case add in #2049 We need merge this pr first and upstream op register pytorch/pytorch#163049 and then test case could be pass --------- Co-authored-by: mengfei25 <[email protected]>

- Early return if input to nonzero has no elements, - Change pytorch `MAX_DIMS` to `XPU_MAX_TENSORINFO_DIMS` xpu equivalent, - Rename `tensor` to `out` to match op schema. --------- Co-authored-by: Tadeusz Krawiec <[email protected]>

…tion library (#2048) Co-authored-by: Chao Han <[email protected]>

Fixes #2014

# Motivation Clean up `getDeviceIndexOfCurrentQueue` since it duplicates `current_device` and provides no additional functionality.

disable_ut disable_e2e disable_distributed

…#2076) The callback used to track the work status in ProcessGroupXCCL was causing an unintended memory leak by maintaining the work objects and therefor the stashed tensors. For now, I'm removing the callback and I have added a unit test to ensure this memory leak isn't returning. Fix #2084

@EikanWang

Some versions of DPC++ compiler pass paths to SYCL headers as user include paths (`-I`) rather than system paths (`-isystem`). This makes host compiler to report warnings encountered in the SYCL headers, such as deprecated warnings, even if warned API is not actually used in the program. We expect that this issue will be addressed in the later version of DPC++ compiler. To workaround the issue we wrap paths to SYCL headers in `-isystem`. disable_ut disable_e2e disable_distributed CC: @EikanWang @chuanqi129 Signed-off-by: Dmitry Rogozhkin <[email protected]>

Add new LLM models deps `accelerate` disable_ut disable_distributed

Fixes #1737 and the flaky backward accuracy issue uncovered by fixing the forward pass, using sycl_global_and_local_fence to sync data in global memory - log_alpha_data_.

Fixes #1943

…2062) Fixes #1823

- remove unused variables, - add compiler flag to prevent this in the future

# Motivation To resolve #2034, expose all xpu internal headers to PyTorch.

This PR is to fix the access to exception vector in error handling. Original implementation may cause a segmentation fault due to out-of-bounds access.

- Add template for dynamic skip, makes it easier to create issues that require skipping disable_all --------- Co-authored-by: Wang, Chuanqi <[email protected]>

1. Use `timm_nfnet` for CI test instead of `timm_regnet`, which has known accuracy issue #1334 2. fix ondemand test issue disable_build disable_ut disable_distributed

# Motivation The original issue occurs on some old iGPU running the following code on Windows: ```python import torch import torch.nn.functional as F print(torch.xpu.get_device_properties()) arr = torch.rand(1, 2, 5, 5, device='xpu') pts = torch.rand(1, 3, 3, 2, device='xpu') out = F.grid_sample(arr, pts, align_corners=False) ``` The failure output is: ```bash Traceback (most recent call last): File "C:\Vesuvius\urerr\urerr.py", line 22, in <module> out = F.grid_sample(arr, pts, align_corners=False) File "C:\Anaconda3\envs\pytn\Lib\site-packages\torch\nn\functional.py", line 5118, in grid_sample return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: UR error ``` The Driver team analysis located the crash in the generated spriv IR code. ```asm ; Function Attrs: nounwind define internal spir_func float @_ZN2at6native3xpuL19compute_coordinatesIfEET_S3_iNS0_6detail18GridSamplerPaddingEb(float %0, i32 %1, i32 %2, i1 zeroext %3) #0 !spirv.ParameterDecorations !393 { %5 = alloca double, align 8, !spirv.Decorations !394 switch i32 %2, label %58 [ i32 1, label %6 i32 2, label %14 ] 6: ; preds = %4 %7 = sext i32 %1 to i64 %8 = add nsw i64 %7, -1, !spirv.Decorations !387 %9 = sitofp i64 %8 to float %10 = fcmp olt float %0, 0.000000e+00 %11 = select i1 %10, float 0.000000e+00, float %0 %12 = fcmp olt float %11, %9 %13 = select i1 %12, float %11, float %9 br label %58 14: ; preds = %4 br i1 %3, label %15, label %32 15: ; preds = %14 %16 = shl i32 %1, 1 %17 = add i32 %16, -2 %18 = icmp eq i32 %17, 0 br i1 %18, label %49, label %19 19: ; preds = %15 %20 = sitofp i32 %17 to float %21 = fmul float %20, 5.000000e-01 %22 = call spir_func float @_Z16__spirv_ocl_fabsf(float %0) #0 %23 = call spir_func float @fmodf(float %22, float %21) #3 %24 = fdiv float %22, %21 %25 = call spir_func float @_Z17__spirv_ocl_floorf(float %24) #0 %26 = fptosi float %25 to i32 %27 = and i32 %26, 1 %28 = icmp eq i32 %27, 0 %29 = fsub float %21, %23 %30 = select i1 %28, float %23, float %29 %31 = fadd float %30, 0.000000e+00 br label %49 32: ; preds = %14 %33 = icmp eq i32 %1, 0 br i1 %33, label %49, label %34 34: ; preds = %32 %35 = shl nsw i32 %1, 1, !spirv.Decorations !387 %36 = sitofp i32 %35 to float %37 = fmul float %36, 5.000000e-01 %38 = fadd float %0, 5.000000e-01 %39 = call spir_func float @_Z16__spirv_ocl_fabsf(float %38) #0 %40 = call spir_func float @fmodf(float %39, float %37) #3 %41 = fdiv float %39, %37 %42 = call spir_func float @_Z17__spirv_ocl_floorf(float %41) #0 %43 = fptosi float %42 to i32 %44 = and i32 %43, 1 %45 = icmp eq i32 %44, 0 %46 = fsub float %37, %40 %47 = select i1 %45, float %40, float %46 %48 = fadd float %47, -5.000000e-01 br label %49 49: ; preds = %34, %32, %19, %15 %50 = phi float [ %31, %19 ], [ 0.000000e+00, %15 ], [ %48, %34 ], [ 0.000000e+00, %32 ] %51 = sext i32 %1 to i64 %52 = add nsw i64 %51, -1, !spirv.Decorations !387 %53 = sitofp i64 %52 to float %54 = fcmp olt float %50, 0.000000e+00 %55 = select i1 %54, float 0.000000e+00, float %50 %56 = fcmp olt float %55, %53 %57 = select i1 %56, float %55, float %53 br label %58 58: ; preds = %49, %6, %4 %59 = phi float [ %13, %6 ], [ %57, %49 ], [ %0, %4 ] %60 = fptosi float %59 to i64 %61 = icmp sgt i64 %60, 2147483646 %62 = fcmp olt float %59, 0xC1E0000000000000 %63 = or i1 %61, %62 br i1 %63, label %72, label %64 64: ; preds = %58 %65 = fpext float %59 to double %66 = bitcast double* %5 to i8* call void @llvm.lifetime.start.p0i8(i64 8, i8* %66) %67 = addrspacecast double* %5 to double addrspace(4)* store double %65, double* %5, align 8 %68 = call spir_func signext i16 @_dtest(double addrspace(4)* %67) #3 %69 = bitcast double* %5 to i8* call void @llvm.lifetime.end.p0i8(i64 8, i8* %69) %70 = icmp slt i16 %68, 1 %71 = select i1 %70, float %59, float -1.000000e+02 br label %72 72: ; preds = %64, %58 %73 = phi float [ %71, %64 ], [ -1.000000e+02, %58 ] ret float %73 } ``` We can see that spirv IR code uses a double type and calls the @_dtest function in block 64. Accroding to [MSVC document](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/floating-point-primitives?view=msvc-170#_dtest-_ldtest-_fdtest), _dtest is used to detect whether a number is `Nan` or `INFINITE`. This allows us to locate the root cause of the crash, which corresponds to the following C++ logic: ```cpp if (static_cast<int64_t>(x) > INT_MAX - 1 || x < INT_MIN || !std::isfinite(static_cast<double>(x))) return static_cast<scalar_t>(-100.0); return x; ``` In other words, the crash occurs when the GPU executes code that tries to convert a floating-point value (Half or BFloat16) to a double and check whether it is finite. # Solution - For Half and BFloat16, `std::isfinite(x)` promot `x` to `float`, providing enough precision for finiteness checks. Casting to double is redundant and can be safely removed. - Explicitly return `-100.0f` instead of double type. # Additional Context I can't find the iGPU that could verify the fix, but it is unlikely to introduce any additional error.

To fix #2070. This PR updates several SYCL kernel launch functions in `src/ATen/native/xpu/sycl/Loops.h` to use `int64_t` for workgroup size and number of workgroups calculations. This change prevents overflow issues when handling large tensor sizes.

…2091) Fixed the following issues found by test/test_nn.py::TestNNDeviceTypeXPU::test_avg_pool_large_tensor2_xpu 1. A segmentation fault caused by a data type conversion error that invalidated the memory address. 2. A calculation error caused by data overflow. --------- Co-authored-by: Cui, Yifeng <[email protected]>

…#2036) Fixes #1978 In ProcessGroupNCCL, `globalRank()` returns a static int globalRank, which is first initialized by the ProcessGroup setup, so the globalRank assigned to each thread matches the id of the CUDA device. However, we were not using this same pattern for XCCL. Instead, we were just using the assigned rank of the thread, which were not necessarily the same as the globalRank. The failing test `test_barrier` created two separate groups of 2 ranks each, and then 4 threads called barrier, but all on the same 2-thread group. Since initially the device id is not specified in this barrier call, the thread attempts to "guess" the device index. In the previous code, this guess would be 0 or 1, since the rank of each thread was not actually the globalRank. In `barrier`, this guessed id was used to initialize XCCLComm objects, and then call allreduce on a single element tensor. However, this resulted in an allreduce call two times on each device, which could result in a hang based on the execution order of the 4 threads. With the update in this PR, PGXCCL now uses the static globalRank in the same places as ProcessGroupNCCL, so the initialized XCCLComm objects are for unique devices and allreduce doesn't call on the same device multiple times.

As a follow-up to #1867 , this PR includes tests for the FlightRecorder on XCCL, as well as moving some definitions from ProcessGroupXCCL::Options to Backend::Options. These tests are largely based on `pytorch/test/distributed/test_c10d_nccl.py`, but doesn't include some tests: - `test_short_json` since json dumps are not supported in ProcessGroupXCCL - `test_trace_while_all_works_retired`: `_wait_for_pending_works` isn't supported by XCCL - `test_trace_while_active`: XCCL hangs when op is called on only one rank - `test_trace_while_stuck`: XCCL hangs when op is called on only one rank --------- Co-authored-by: Yu, Guangye <[email protected]>

Observed that recompilations are triggered by updating files by install_xpu_headers.py script. Turns out that script does not change the files in any way but rewriting the same content into files updating their timestamp causing multiple dependent files to recompile. This PR makes sure that `install_xpu_headers.py` changes or creates files only when content should change. This allow to speedup recompilations several times, by my experience from few minutes to few seconds. This fixes: #2093 --------- Co-authored-by: Pawel Swider <[email protected]>

mengfei25 and others added 27 commits September 16, 2025 01:38

[CI] Fix CICD test issues (#2027)

4d38b5e

1. torchbench issues caused by deps installation 2. pt2e test dataset path and deps installation 3. op benchmark github access permission 4. enhance bisect search disable_distributed

support high priority stream (#1715)

74b11bf

Support high priority stream for xccl, test case add in #2049 We need merge this pr first and upstream op register pytorch/pytorch#163049 and then test case could be pass --------- Co-authored-by: mengfei25 <[email protected]>

Reduce tensor shape avoid timeout (#2051)

6d1a476

Move checks from nonzero kernel to operator (#1991)

a5f73c1

- Early return if input to nonzero has no elements, - Change pytorch `MAX_DIMS` to `XPU_MAX_TENSORINFO_DIMS` xpu equivalent, - Rename `tensor` to `out` to match op schema. --------- Co-authored-by: Tadeusz Krawiec <[email protected]>

Check input contiguous before collectives call to low-level communica…

9f08551

…tion library (#2048) Co-authored-by: Chao Han <[email protected]>

Fix hardswish gradients corner case (#2050)

6508096

Fixes #2014

Clean up getDeviceIndexOfCurrentQueue (#2060)

24fab67

# Motivation Clean up `getDeviceIndexOfCurrentQueue` since it duplicates `current_device` and provides no additional functionality.

[CI] Cleanup after build to avoid permission issue (#2088)

bc52e63

disable_ut disable_e2e disable_distributed

[CI] Add deps for new LLM models (#2054)

b755d3c

Add new LLM models deps `accelerate` disable_ut disable_distributed

Fix accuracy issues with CTC loss (#2074)

9eed218

Fixes #1737 and the flaky backward accuracy issue uncovered by fixing the forward pass, using sycl_global_and_local_fence to sync data in global memory - log_alpha_data_.

Implement aten::nonzero_static on XPU backend (#2061)

0df6a62

Fixes #1943

Stop recursive calculations in polynomial kernels if tensor has NaNs (#…

229e8ba

…2062) Fixes #1823

Remove unused variables (#2044)

23cd584

- remove unused variables, - add compiler flag to prevent this in the future

[CI] Login docker hub to enlarge pull limitation (#2107)

304983a

Install xpu internal headers to PyTorch (#2106)

26ae9e7

# Motivation To resolve #2034, expose all xpu internal headers to PyTorch.

Fix error handling for BatchLinearAlgebra Ops (#2073)

eed1b8f

This PR is to fix the access to exception vector in error handling. Original implementation may cause a segmentation fault due to out-of-bounds access.

Add dynamic skip template (#2115)

9e95ad0

- Add template for dynamic skip, makes it easier to create issues that require skipping disable_all --------- Co-authored-by: Wang, Chuanqi <[email protected]>

[CI] Modify ci test workflow (#2116)

09edbee

1. Use `timm_nfnet` for CI test instead of `timm_regnet`, which has known accuracy issue #1334 2. fix ondemand test issue disable_build disable_ut disable_distributed

fengyuan14 changed the title ~~Rebase/sycl free func~~ Rebase dev/sycl-free-func Oct 9, 2025

fengyuan14 requested a review from tye1 October 9, 2025 05:09

Merge branch 'main' into rebase/sycl-free-func

f3800e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rebase dev/sycl-free-func #2141

Rebase dev/sycl-free-func #2141

Uh oh!

fengyuan14 commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Rebase dev/sycl-free-func #2141

Are you sure you want to change the base?

Rebase dev/sycl-free-func #2141

Uh oh!

Conversation

fengyuan14 commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants