Skip to content

Conversation

fengyuan14
Copy link
Contributor

No description provided.

mengfei25 and others added 27 commits September 16, 2025 01:38
1. torchbench issues caused by deps installation
2. pt2e test dataset path and deps installation
3. op benchmark github access permission
4. enhance bisect search

disable_distributed
This PR intends to add some ported distributed cases in torch-xpu-ops
CI.
- Add ZE_AFFINITY_MASK to ensure using Xelink.
- Add CCL_ROOT for Xelink, this WA can be removed after oneCCL upgrade
to 2021.16.2
- Increase distributed test time limit. Currently, the test part needs
about 1 hour after add ported cases.

disable_e2e
disable_ut
Support high priority stream for xccl, test case add in
#2049
We need merge this pr first and upstream op register
pytorch/pytorch#163049 and then test case could
be pass

---------

Co-authored-by: mengfei25 <[email protected]>
- Early return if input to nonzero has no elements,
- Change pytorch `MAX_DIMS` to `XPU_MAX_TENSORINFO_DIMS` xpu equivalent,
- Rename `tensor` to `out` to match op schema.

---------

Co-authored-by: Tadeusz Krawiec <[email protected]>
# Motivation
Clean up `getDeviceIndexOfCurrentQueue` since it duplicates
`current_device` and provides no additional functionality.
disable_ut
disable_e2e
disable_distributed
…#2076)

The callback used to track the work status in ProcessGroupXCCL was
causing an unintended memory leak by maintaining the work objects and
therefor the stashed tensors. For now, I'm removing the callback and I
have added a unit test to ensure this memory leak isn't returning.

Fix #2084
Some versions of DPC++ compiler pass paths to SYCL headers as user
include paths (`-I`) rather than system paths (`-isystem`). This makes
host compiler to report warnings encountered in the SYCL headers, such
as deprecated warnings, even if warned API is not actually used in the
program. We expect that this issue will be addressed in the later
version of DPC++ compiler. To workaround the issue we wrap paths to SYCL
headers in `-isystem`.

disable_ut
disable_e2e
disable_distributed

CC: @EikanWang @chuanqi129

Signed-off-by: Dmitry Rogozhkin <[email protected]>
Add new LLM models deps `accelerate`

disable_ut
disable_distributed
Fixes #1737 and the flaky backward accuracy issue uncovered by fixing
the forward pass, using sycl_global_and_local_fence to sync data in
global memory - log_alpha_data_.
- remove unused variables,
- add compiler flag to prevent this in the future
# Motivation
To resolve #2034, expose
all xpu internal headers to PyTorch.
This PR is to fix the access to exception vector in error handling.
Original implementation may cause a segmentation fault due to
out-of-bounds access.
- Add template for dynamic skip, makes it easier to create issues that
require skipping

disable_all

---------

Co-authored-by: Wang, Chuanqi <[email protected]>
1. Use `timm_nfnet` for CI test instead of `timm_regnet`, which has
known accuracy issue #1334
2. fix ondemand test issue

disable_build
disable_ut
disable_distributed
# Motivation
The original issue occurs on some old iGPU running the following code on
Windows:
```python
import torch
import torch.nn.functional as F

print(torch.xpu.get_device_properties())
arr = torch.rand(1, 2, 5, 5, device='xpu')
pts = torch.rand(1, 3, 3, 2, device='xpu')
out = F.grid_sample(arr, pts, align_corners=False)
```
The failure output is:
```bash
Traceback (most recent call last):
  File "C:\Vesuvius\urerr\urerr.py", line 22, in <module>
    out = F.grid_sample(arr, pts, align_corners=False)
  File "C:\Anaconda3\envs\pytn\Lib\site-packages\torch\nn\functional.py", line 5118, in grid_sample
    return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
           ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: UR error
```

The Driver team analysis located the crash in the generated spriv IR
code.
```asm
; Function Attrs: nounwind
define internal spir_func float @_ZN2at6native3xpuL19compute_coordinatesIfEET_S3_iNS0_6detail18GridSamplerPaddingEb(float %0, i32 %1, i32 %2, i1 zeroext %3) #0 !spirv.ParameterDecorations !393 {
  %5 = alloca double, align 8, !spirv.Decorations !394
  switch i32 %2, label %58 [
    i32 1, label %6
    i32 2, label %14
  ]

6:                                                ; preds = %4
  %7 = sext i32 %1 to i64
  %8 = add nsw i64 %7, -1, !spirv.Decorations !387
  %9 = sitofp i64 %8 to float
  %10 = fcmp olt float %0, 0.000000e+00
  %11 = select i1 %10, float 0.000000e+00, float %0
  %12 = fcmp olt float %11, %9
  %13 = select i1 %12, float %11, float %9
  br label %58

14:                                               ; preds = %4
  br i1 %3, label %15, label %32

15:                                               ; preds = %14
  %16 = shl i32 %1, 1
  %17 = add i32 %16, -2
  %18 = icmp eq i32 %17, 0
  br i1 %18, label %49, label %19

19:                                               ; preds = %15
  %20 = sitofp i32 %17 to float
  %21 = fmul float %20, 5.000000e-01
  %22 = call spir_func float @_Z16__spirv_ocl_fabsf(float %0) #0
  %23 = call spir_func float @fmodf(float %22, float %21) #3
  %24 = fdiv float %22, %21
  %25 = call spir_func float @_Z17__spirv_ocl_floorf(float %24) #0
  %26 = fptosi float %25 to i32
  %27 = and i32 %26, 1
  %28 = icmp eq i32 %27, 0
  %29 = fsub float %21, %23
  %30 = select i1 %28, float %23, float %29
  %31 = fadd float %30, 0.000000e+00
  br label %49

32:                                               ; preds = %14
  %33 = icmp eq i32 %1, 0
  br i1 %33, label %49, label %34

34:                                               ; preds = %32
  %35 = shl nsw i32 %1, 1, !spirv.Decorations !387
  %36 = sitofp i32 %35 to float
  %37 = fmul float %36, 5.000000e-01
  %38 = fadd float %0, 5.000000e-01
  %39 = call spir_func float @_Z16__spirv_ocl_fabsf(float %38) #0
  %40 = call spir_func float @fmodf(float %39, float %37) #3
  %41 = fdiv float %39, %37
  %42 = call spir_func float @_Z17__spirv_ocl_floorf(float %41) #0
  %43 = fptosi float %42 to i32
  %44 = and i32 %43, 1
  %45 = icmp eq i32 %44, 0
  %46 = fsub float %37, %40
  %47 = select i1 %45, float %40, float %46
  %48 = fadd float %47, -5.000000e-01
  br label %49

49:                                               ; preds = %34, %32, %19, %15
  %50 = phi float [ %31, %19 ], [ 0.000000e+00, %15 ], [ %48, %34 ], [ 0.000000e+00, %32 ]
  %51 = sext i32 %1 to i64
  %52 = add nsw i64 %51, -1, !spirv.Decorations !387
  %53 = sitofp i64 %52 to float
  %54 = fcmp olt float %50, 0.000000e+00
  %55 = select i1 %54, float 0.000000e+00, float %50
  %56 = fcmp olt float %55, %53
  %57 = select i1 %56, float %55, float %53
  br label %58

58:                                               ; preds = %49, %6, %4
  %59 = phi float [ %13, %6 ], [ %57, %49 ], [ %0, %4 ]
  %60 = fptosi float %59 to i64
  %61 = icmp sgt i64 %60, 2147483646
  %62 = fcmp olt float %59, 0xC1E0000000000000
  %63 = or i1 %61, %62
  br i1 %63, label %72, label %64

64:                                               ; preds = %58
  %65 = fpext float %59 to double
  %66 = bitcast double* %5 to i8*
  call void @llvm.lifetime.start.p0i8(i64 8, i8* %66)
  %67 = addrspacecast double* %5 to double addrspace(4)*
  store double %65, double* %5, align 8
  %68 = call spir_func signext i16 @_dtest(double addrspace(4)* %67) #3
  %69 = bitcast double* %5 to i8*
  call void @llvm.lifetime.end.p0i8(i64 8, i8* %69)
  %70 = icmp slt i16 %68, 1
  %71 = select i1 %70, float %59, float -1.000000e+02
  br label %72

72:                                               ; preds = %64, %58
  %73 = phi float [ %71, %64 ], [ -1.000000e+02, %58 ]
  ret float %73
}
```
We can see that spirv IR code uses a double type and calls the @_dtest
function in block 64. Accroding to [MSVC
document](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/floating-point-primitives?view=msvc-170#_dtest-_ldtest-_fdtest),
_dtest is used to detect whether a number is `Nan` or `INFINITE`.
This allows us to locate the root cause of the crash, which corresponds
to the following C++ logic:
```cpp
if (static_cast<int64_t>(x) > INT_MAX - 1 || x < INT_MIN ||
      !std::isfinite(static_cast<double>(x)))
    return static_cast<scalar_t>(-100.0);
  return x;
```
In other words, the crash occurs when the GPU executes code that tries
to convert a floating-point value (Half or BFloat16) to a double and
check whether it is finite.

# Solution
- For Half and BFloat16, `std::isfinite(x)` promot `x` to `float`,
providing enough precision for finiteness checks. Casting to double is
redundant and can be safely removed.
- Explicitly return `-100.0f` instead of double type.

# Additional Context
I can't find the iGPU that could verify the fix, but it is unlikely to
introduce any additional error.
To fix #2070.
This PR updates several SYCL kernel launch functions in
`src/ATen/native/xpu/sycl/Loops.h` to use `int64_t` for workgroup size
and number of workgroups calculations. This change prevents overflow
issues when handling large tensor sizes.
…2091)

Fixed the following issues found by
test/test_nn.py::TestNNDeviceTypeXPU::test_avg_pool_large_tensor2_xpu
1. A segmentation fault caused by a data type conversion error that
invalidated the memory address.
2. A calculation error caused by data overflow.

---------

Co-authored-by: Cui, Yifeng <[email protected]>
…#2036)

Fixes #1978 

In ProcessGroupNCCL, `globalRank()` returns a static int globalRank,
which is first initialized by the ProcessGroup setup, so the globalRank
assigned to each thread matches the id of the CUDA device. However, we
were not using this same pattern for XCCL. Instead, we were just using
the assigned rank of the thread, which were not necessarily the same as
the globalRank.

The failing test `test_barrier` created two separate groups of 2 ranks
each, and then 4 threads called barrier, but all on the same 2-thread
group. Since initially the device id is not specified in this barrier
call, the thread attempts to "guess" the device index. In the previous
code, this guess would be 0 or 1, since the rank of each thread was not
actually the globalRank. In `barrier`, this guessed id was used to
initialize XCCLComm objects, and then call allreduce on a single element
tensor. However, this resulted in an allreduce call two times on each
device, which could result in a hang based on the execution order of the
4 threads.

With the update in this PR, PGXCCL now uses the static globalRank in the
same places as ProcessGroupNCCL, so the initialized XCCLComm objects are
for unique devices and allreduce doesn't call on the same device
multiple times.
As a follow-up to #1867 , this PR includes tests for the FlightRecorder
on XCCL, as well as moving some definitions from
ProcessGroupXCCL::Options to Backend::Options.

These tests are largely based on
`pytorch/test/distributed/test_c10d_nccl.py`, but doesn't include some
tests:
- `test_short_json` since json dumps are not supported in
ProcessGroupXCCL
- `test_trace_while_all_works_retired`: `_wait_for_pending_works` isn't
supported by XCCL
- `test_trace_while_active`: XCCL hangs when op is called on only one
rank
- `test_trace_while_stuck`: XCCL hangs when op is called on only one
rank

---------

Co-authored-by: Yu, Guangye <[email protected]>
Observed that recompilations are triggered by updating files by
install_xpu_headers.py script. Turns out that script does not change the
files in any way but rewriting the same content into files updating
their timestamp causing multiple dependent files to recompile.

This PR makes sure that `install_xpu_headers.py` changes or creates
files only when content should change. This allow to speedup
recompilations several times, by my experience from few minutes to few
seconds.


This fixes: #2093

---------

Co-authored-by: Pawel Swider <[email protected]>
@fengyuan14 fengyuan14 changed the title Rebase/sycl free func Rebase dev/sycl-free-func Oct 9, 2025
@fengyuan14 fengyuan14 requested a review from tye1 October 9, 2025 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.