-
Notifications
You must be signed in to change notification settings - Fork 60
Rebase dev/sycl-free-func #2141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
fengyuan14
wants to merge
28
commits into
dev/sycl-free-func
Choose a base branch
from
rebase/sycl-free-func
base: dev/sycl-free-func
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. torchbench issues caused by deps installation 2. pt2e test dataset path and deps installation 3. op benchmark github access permission 4. enhance bisect search disable_distributed
This PR intends to add some ported distributed cases in torch-xpu-ops CI. - Add ZE_AFFINITY_MASK to ensure using Xelink. - Add CCL_ROOT for Xelink, this WA can be removed after oneCCL upgrade to 2021.16.2 - Increase distributed test time limit. Currently, the test part needs about 1 hour after add ported cases. disable_e2e disable_ut
Support high priority stream for xccl, test case add in #2049 We need merge this pr first and upstream op register pytorch/pytorch#163049 and then test case could be pass --------- Co-authored-by: mengfei25 <[email protected]>
- Early return if input to nonzero has no elements, - Change pytorch `MAX_DIMS` to `XPU_MAX_TENSORINFO_DIMS` xpu equivalent, - Rename `tensor` to `out` to match op schema. --------- Co-authored-by: Tadeusz Krawiec <[email protected]>
…tion library (#2048) Co-authored-by: Chao Han <[email protected]>
# Motivation Clean up `getDeviceIndexOfCurrentQueue` since it duplicates `current_device` and provides no additional functionality.
disable_ut disable_e2e disable_distributed
Some versions of DPC++ compiler pass paths to SYCL headers as user include paths (`-I`) rather than system paths (`-isystem`). This makes host compiler to report warnings encountered in the SYCL headers, such as deprecated warnings, even if warned API is not actually used in the program. We expect that this issue will be addressed in the later version of DPC++ compiler. To workaround the issue we wrap paths to SYCL headers in `-isystem`. disable_ut disable_e2e disable_distributed CC: @EikanWang @chuanqi129 Signed-off-by: Dmitry Rogozhkin <[email protected]>
Add new LLM models deps `accelerate` disable_ut disable_distributed
Fixes #1737 and the flaky backward accuracy issue uncovered by fixing the forward pass, using sycl_global_and_local_fence to sync data in global memory - log_alpha_data_.
- remove unused variables, - add compiler flag to prevent this in the future
# Motivation To resolve #2034, expose all xpu internal headers to PyTorch.
This PR is to fix the access to exception vector in error handling. Original implementation may cause a segmentation fault due to out-of-bounds access.
- Add template for dynamic skip, makes it easier to create issues that require skipping disable_all --------- Co-authored-by: Wang, Chuanqi <[email protected]>
1. Use `timm_nfnet` for CI test instead of `timm_regnet`, which has known accuracy issue #1334 2. fix ondemand test issue disable_build disable_ut disable_distributed
# Motivation The original issue occurs on some old iGPU running the following code on Windows: ```python import torch import torch.nn.functional as F print(torch.xpu.get_device_properties()) arr = torch.rand(1, 2, 5, 5, device='xpu') pts = torch.rand(1, 3, 3, 2, device='xpu') out = F.grid_sample(arr, pts, align_corners=False) ``` The failure output is: ```bash Traceback (most recent call last): File "C:\Vesuvius\urerr\urerr.py", line 22, in <module> out = F.grid_sample(arr, pts, align_corners=False) File "C:\Anaconda3\envs\pytn\Lib\site-packages\torch\nn\functional.py", line 5118, in grid_sample return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: UR error ``` The Driver team analysis located the crash in the generated spriv IR code. ```asm ; Function Attrs: nounwind define internal spir_func float @_ZN2at6native3xpuL19compute_coordinatesIfEET_S3_iNS0_6detail18GridSamplerPaddingEb(float %0, i32 %1, i32 %2, i1 zeroext %3) #0 !spirv.ParameterDecorations !393 { %5 = alloca double, align 8, !spirv.Decorations !394 switch i32 %2, label %58 [ i32 1, label %6 i32 2, label %14 ] 6: ; preds = %4 %7 = sext i32 %1 to i64 %8 = add nsw i64 %7, -1, !spirv.Decorations !387 %9 = sitofp i64 %8 to float %10 = fcmp olt float %0, 0.000000e+00 %11 = select i1 %10, float 0.000000e+00, float %0 %12 = fcmp olt float %11, %9 %13 = select i1 %12, float %11, float %9 br label %58 14: ; preds = %4 br i1 %3, label %15, label %32 15: ; preds = %14 %16 = shl i32 %1, 1 %17 = add i32 %16, -2 %18 = icmp eq i32 %17, 0 br i1 %18, label %49, label %19 19: ; preds = %15 %20 = sitofp i32 %17 to float %21 = fmul float %20, 5.000000e-01 %22 = call spir_func float @_Z16__spirv_ocl_fabsf(float %0) #0 %23 = call spir_func float @fmodf(float %22, float %21) #3 %24 = fdiv float %22, %21 %25 = call spir_func float @_Z17__spirv_ocl_floorf(float %24) #0 %26 = fptosi float %25 to i32 %27 = and i32 %26, 1 %28 = icmp eq i32 %27, 0 %29 = fsub float %21, %23 %30 = select i1 %28, float %23, float %29 %31 = fadd float %30, 0.000000e+00 br label %49 32: ; preds = %14 %33 = icmp eq i32 %1, 0 br i1 %33, label %49, label %34 34: ; preds = %32 %35 = shl nsw i32 %1, 1, !spirv.Decorations !387 %36 = sitofp i32 %35 to float %37 = fmul float %36, 5.000000e-01 %38 = fadd float %0, 5.000000e-01 %39 = call spir_func float @_Z16__spirv_ocl_fabsf(float %38) #0 %40 = call spir_func float @fmodf(float %39, float %37) #3 %41 = fdiv float %39, %37 %42 = call spir_func float @_Z17__spirv_ocl_floorf(float %41) #0 %43 = fptosi float %42 to i32 %44 = and i32 %43, 1 %45 = icmp eq i32 %44, 0 %46 = fsub float %37, %40 %47 = select i1 %45, float %40, float %46 %48 = fadd float %47, -5.000000e-01 br label %49 49: ; preds = %34, %32, %19, %15 %50 = phi float [ %31, %19 ], [ 0.000000e+00, %15 ], [ %48, %34 ], [ 0.000000e+00, %32 ] %51 = sext i32 %1 to i64 %52 = add nsw i64 %51, -1, !spirv.Decorations !387 %53 = sitofp i64 %52 to float %54 = fcmp olt float %50, 0.000000e+00 %55 = select i1 %54, float 0.000000e+00, float %50 %56 = fcmp olt float %55, %53 %57 = select i1 %56, float %55, float %53 br label %58 58: ; preds = %49, %6, %4 %59 = phi float [ %13, %6 ], [ %57, %49 ], [ %0, %4 ] %60 = fptosi float %59 to i64 %61 = icmp sgt i64 %60, 2147483646 %62 = fcmp olt float %59, 0xC1E0000000000000 %63 = or i1 %61, %62 br i1 %63, label %72, label %64 64: ; preds = %58 %65 = fpext float %59 to double %66 = bitcast double* %5 to i8* call void @llvm.lifetime.start.p0i8(i64 8, i8* %66) %67 = addrspacecast double* %5 to double addrspace(4)* store double %65, double* %5, align 8 %68 = call spir_func signext i16 @_dtest(double addrspace(4)* %67) #3 %69 = bitcast double* %5 to i8* call void @llvm.lifetime.end.p0i8(i64 8, i8* %69) %70 = icmp slt i16 %68, 1 %71 = select i1 %70, float %59, float -1.000000e+02 br label %72 72: ; preds = %64, %58 %73 = phi float [ %71, %64 ], [ -1.000000e+02, %58 ] ret float %73 } ``` We can see that spirv IR code uses a double type and calls the @_dtest function in block 64. Accroding to [MSVC document](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/floating-point-primitives?view=msvc-170#_dtest-_ldtest-_fdtest), _dtest is used to detect whether a number is `Nan` or `INFINITE`. This allows us to locate the root cause of the crash, which corresponds to the following C++ logic: ```cpp if (static_cast<int64_t>(x) > INT_MAX - 1 || x < INT_MIN || !std::isfinite(static_cast<double>(x))) return static_cast<scalar_t>(-100.0); return x; ``` In other words, the crash occurs when the GPU executes code that tries to convert a floating-point value (Half or BFloat16) to a double and check whether it is finite. # Solution - For Half and BFloat16, `std::isfinite(x)` promot `x` to `float`, providing enough precision for finiteness checks. Casting to double is redundant and can be safely removed. - Explicitly return `-100.0f` instead of double type. # Additional Context I can't find the iGPU that could verify the fix, but it is unlikely to introduce any additional error.
To fix #2070. This PR updates several SYCL kernel launch functions in `src/ATen/native/xpu/sycl/Loops.h` to use `int64_t` for workgroup size and number of workgroups calculations. This change prevents overflow issues when handling large tensor sizes.
…2091) Fixed the following issues found by test/test_nn.py::TestNNDeviceTypeXPU::test_avg_pool_large_tensor2_xpu 1. A segmentation fault caused by a data type conversion error that invalidated the memory address. 2. A calculation error caused by data overflow. --------- Co-authored-by: Cui, Yifeng <[email protected]>
…#2036) Fixes #1978 In ProcessGroupNCCL, `globalRank()` returns a static int globalRank, which is first initialized by the ProcessGroup setup, so the globalRank assigned to each thread matches the id of the CUDA device. However, we were not using this same pattern for XCCL. Instead, we were just using the assigned rank of the thread, which were not necessarily the same as the globalRank. The failing test `test_barrier` created two separate groups of 2 ranks each, and then 4 threads called barrier, but all on the same 2-thread group. Since initially the device id is not specified in this barrier call, the thread attempts to "guess" the device index. In the previous code, this guess would be 0 or 1, since the rank of each thread was not actually the globalRank. In `barrier`, this guessed id was used to initialize XCCLComm objects, and then call allreduce on a single element tensor. However, this resulted in an allreduce call two times on each device, which could result in a hang based on the execution order of the 4 threads. With the update in this PR, PGXCCL now uses the static globalRank in the same places as ProcessGroupNCCL, so the initialized XCCLComm objects are for unique devices and allreduce doesn't call on the same device multiple times.
As a follow-up to #1867 , this PR includes tests for the FlightRecorder on XCCL, as well as moving some definitions from ProcessGroupXCCL::Options to Backend::Options. These tests are largely based on `pytorch/test/distributed/test_c10d_nccl.py`, but doesn't include some tests: - `test_short_json` since json dumps are not supported in ProcessGroupXCCL - `test_trace_while_all_works_retired`: `_wait_for_pending_works` isn't supported by XCCL - `test_trace_while_active`: XCCL hangs when op is called on only one rank - `test_trace_while_stuck`: XCCL hangs when op is called on only one rank --------- Co-authored-by: Yu, Guangye <[email protected]>
Observed that recompilations are triggered by updating files by install_xpu_headers.py script. Turns out that script does not change the files in any way but rewriting the same content into files updating their timestamp causing multiple dependent files to recompile. This PR makes sure that `install_xpu_headers.py` changes or creates files only when content should change. This allow to speedup recompilations several times, by my experience from few minutes to few seconds. This fixes: #2093 --------- Co-authored-by: Pawel Swider <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.