Atomic Exchange CUDA Error

### Describe the bug

Hey! I am learning to use SYCL but I encountered a little issue when using `sycl::atomic_ref::exchange`. Things work fine on CPU, but when I switched to GPU even a very simple test (see below) crash with a CUDA error. Other atomic primitives such as `store` or `load` works fine.

### To reproduce

1. Include code snippet as short as possible

```cpp
#include <sycl.hpp>

int main()
{
    sycl::queue queue(sycl::gpu_selector_v);
    std::cout << "Device: " << queue.get_device().get_info<sycl::info::device::name>() << std::endl;

    queue.submit([&](sycl::handler& diana)
    {
        sycl::stream out(1024, 256, diana);

        diana.parallel_for(1, [=](sycl::id<> id)
        {
            int memory = 3;
            sycl::atomic_ref<int,
                sycl::memory_order::relaxed,
                sycl::memory_scope::work_item> at(memory);

            int load = at.exchange(123);
            out << "id " << id << " load " << load << sycl::endl;
        });
    });

    queue.wait_and_throw();
}
```

2. Specify the command which should be used to compile the program

```sh
icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda main.cpp 
```

3. Specify the command which should be used to launch the program

```sh
./a.out
```

4. Indicate what is wrong and what was expected

This is my output; obviously it crashes which is not what one would expect.

```
Device: NVIDIA GeForce RTX 4090
<CUDA>[ERROR]: 
UR CUDA ERROR:
        Value:           719
        Name:            CUDA_ERROR_LAUNCH_FAILED
        Description:     unspecified launch failure
        Function:        urEnqueueMemBufferRead
        Source Location: /tmp/tmp.nlKu2FwFq5/intel-llvm-mirror/build/_deps/unified-runtime-src/source/adapters/cuda/enqueue.cpp:1777

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Aborted (core dumped)
```

### Environment

- OS: Fedora Linux 40 x86_64 6.11.6-200.fc40.x86_64
- Target device and vendor: NVIDIA GeForce RTX 4090
- DPC++ version: `icpx --version` output:

```
Intel(R) oneAPI DPC++/C++ Compiler 2025.0.0 (2025.0.0.20241008)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2025.0/bin/compiler
Configuration file: /opt/intel/oneapi/compiler/2025.0/bin/compiler/../icpx.cfg
```

- Dependencies version: Header of `nvidia-smi`:

```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
```

And output of `sycl-ls --verbose`:
```
[opencl:cpu][opencl:0] Intel(R) OpenCL, AMD Ryzen 9 3900X 12-Core Processor             OpenCL 3.0 (Build 0) [2024.18.10.0.08_160000]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 4090 8.9 [CUDA 12.6]

Platforms: 2
Platform [#1]:
    Version  : OpenCL 3.0 LINUX
    Name     : Intel(R) OpenCL
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#0]:
        Type              : cpu
        Version           : OpenCL 3.0 (Build 0)
        Name              : AMD Ryzen 9 3900X 12-Core Processor            
        Vendor            : Intel(R) Corporation
        Driver            : 2024.18.10.0.08_160000
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_private_alloca
        info::device::sub_group_sizes: 4 8 16 32 64
        Architecture: x86_64
Platform [#2]:
    Version  : CUDA 12.6
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type              : gpu
        Version           : 8.9
        Name              : NVIDIA GeForce RTX 4090
        Vendor            : NVIDIA Corporation
        Driver            : CUDA 12.6
        UUID              : 1367131105491041301142711512019110415220878
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_cuda_async_barrier ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_1d_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_external_memory_import ext_oneapi_external_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag ext_oneapi_virtual_mem ext_oneapi_image_array ext_oneapi_unique_addressing_per_dim ext_oneapi_bindless_images_sample_1d_usm ext_oneapi_bindless_images_sample_2d_usm
        info::device::sub_group_sizes: 32
        Architecture: nvidia_gpu_sm_89
default_selector()      : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 4090 8.9 [CUDA 12.6]
accelerator_selector()  : No device of requested type available. Please chec...
cpu_selector()          : cpu, Intel(R) OpenCL, AMD Ryzen 9 3900X 12-Core Processor             OpenCL 3.0 (Build 0) [2024.18.10.0.08_160000]
gpu_selector()          : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 4090 8.9 [CUDA 12.6]
custom_selector(gpu)    : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 4090 8.9 [CUDA 12.6]
custom_selector(cpu)    : cpu, Intel(R) OpenCL, AMD Ryzen 9 3900X 12-Core Processor             OpenCL 3.0 (Build 0) [2024.18.10.0.08_160000]
custom_selector(acc)    : No device of requested type available. Please chec...

```


### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Atomic Exchange CUDA Error #16037

Describe the bug

To reproduce

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Atomic Exchange CUDA Error #16037

Description

Describe the bug

To reproduce

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions