Skip to content

SIGBUS during memcpy when trying to use level_zero:gpu while opencl:gpu works #847

@mkottman

Description

@mkottman

I am trying to use llama.cpp with SYCL and when running with default settings I'm getting a "Bus error" (SIGBUS) when loading models:

$ ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf
WARNING: Small BAR detected for device 0000:03:00.0
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
Bus error                  (core dumped) ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf

That is using the level_zero device by default. When using the OpenCL version using ONEAPI_DEVICE_SELECTOR the code works fine:

$ ONEAPI_DEVICE_SELECTOR=opencl:gpu ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf
WARNING: Small BAR detected for device 0000:03:00.0
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 13B Q3_K - Medium        |   6.69 GiB |    14.66 B | SYCL       |  99 |           pp512 |       333.50 ± 20.98 |
...

I'm aware of the "small bar" warning as I'm running on older hardware (Asus Z170-A + Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz) with Arc A750:

$ sycl-ls
WARNING: Small BAR detected for device 0000:03:00.0
WARNING: Small BAR detected for device 0000:03:00.0
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A750 Graphics 12.55.8 [1.6.34666]
[opencl:fpga][opencl:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.18.12.0.05_160000]
[opencl:cpu][opencl:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz OpenCL 3.0 (Build 0) [2024.18.12.0.05_160000]
[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO  [25.31.34666]

I'm using Arch Linux and the latest version of intel-compute-runtime I could find:

$ uname -a
Linux hostname 6.16.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 15 Aug 2025 16:04:43 +0000 x86_64 GNU/Linux
$ pacman -Q intel-compute-runtime
intel-compute-runtime 25.31.34666.3-1

Some more crash details with GDB:

$ gdb --args ./bin/llama-bench -m models/phi-4-Q3_K_M.gguf
...
Thread 1 "llama-bench" received signal SIGBUS, Bus error.
(gdb) bt full
#0  0x00007ffff636e087 in ?? () from /usr/lib/libc.so.6
No symbol table info available.
#1  0x00007fffdcab128c in memcpy_s (dst=0x7ffcfaa60000, destSize=<optimized out>, src=0x5d556b0, count=<optimized out>)
    at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/shared/source/helpers/string.h:71
No locals.
#2  L0::CommandListCoreFamilyImmediate<(GFXCORE_FAMILY)3079>::performCpuMemcpy (this=this@entry=0x5d5a6c0, cpuMemCopyInfo=...,
    hSignalEvent=hSignalEvent@entry=0x2cd9e18, numWaitEvents=numWaitEvents@entry=0, phWaitEvents=phWaitEvents@entry=0x0)
    at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/level_zero/core/source/cmdlist/cmdlist_hw_immediate.inl:1444
        lockingFailed = false
        srcLockPointer = <optimized out>
        dstLockPointer = <optimized out>
        signalEvent = 0x2cd9e10
        cpuMemcpySrcPtr = 0x5d556b0
        cpuMemcpyDstPtr = 0x7ffcfaa60000
#3  0x00007fffdcabf240 in L0::CommandListCoreFamilyImmediate<(GFXCORE_FAMILY)3079>::appendMemoryCopy (this=0x5d5a6c0, dstptr=0xffffd556aaa00000,
    srcptr=0x5d556b0, size=20480, hSignalEvent=0x2cd9e18, numWaitEvents=0, phWaitEvents=0x0, memoryCopyParams=...)
    at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/level_zero/core/source/cmdlist/cmdlist_hw_immediate.inl:683
        estimatedSize = <optimized out>
        hasStallindCmds = false
        ret = <optimized out>
        cpuMemCopyInfo = {dstPtr = 0xffffd556aaa00000, srcPtr = 0x5d556b0, size = 20480, dstAllocData = 0x5c78bc0, srcAllocData = 0x0,
          dstIsImportedHostPtr = false, srcIsImportedHostPtr = false}
        direction = 32767
        isSplitNeeded = <optimized out>
#4  0x00007fffdc8e35f2 in L0::zeCommandListAppendMemoryCopy (hCommandList=<optimized out>, dstptr=<optimized out>, srcptr=<optimized out>, size=20480,
    hSignalEvent=0x2cd9e18, numWaitEvents=<optimized out>, phWaitEvents=0x0)
    at /src/arch/intel-compute-runtime/src/compute-runtime-25.31.34666.3/level_zero/api/core/ze_copy_api_entrypoints.h:32
        cmdList = 0x5d5a6c0
        ret = ZE_RESULT_ERROR_NOT_AVAILABLE
        memoryCopyParams = {relaxedOrderingDispatch = false, forceDisableCopyOnlyInOrderSignaling = false, copyOffloadAllowed = false}
#5  0x00007fffee78237b in enqueueMemCopyHelper(ur_command_t, ur_queue_handle_legacy_t_*, void*, unsigned char, unsigned long, void const*, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**, bool) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_level_zero.so.0
No symbol table info available.
#6  0x00007fffee78bf87 in ur_queue_handle_legacy_t_::enqueueUSMMemcpy(bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_level_zero.so.0
No symbol table info available.
#7  0x00007fffe0cf4db7 in ur_loader::urEnqueueUSMMemcpy(ur_queue_handle_t_*, bool, void*, void const*, unsigned long, unsigned int, ur_event_handle_t_* const*, ur_event_handle_t_**) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
No symbol table info available.
#8  0x00007fffe0d07eff in urEnqueueUSMMemcpy () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
No symbol table info available.
#9  0x00007fffe1c45aa9 in sycl::_V1::detail::MemoryManager::copy_usm(void const*, std::shared_ptr<sycl::_V1::detail::queue_impl>, unsigned long, void*, std::vector<ur_event_handle_t_*, std::allocator<ur_event_handle_t_*> >, ur_event_handle_t_**, std::shared_ptr<sycl::_V1::detail::event_impl> const&) ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
No symbol table info available.
#10 0x00007fffe1c8b83d in sycl::_V1::detail::queue_impl::memcpy(std::shared_ptr<sycl::_V1::detail::queue_impl> const&, void*, void const*, unsigned long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&, bool, sycl::_V1::detail::code_location const&) ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
No symbol table info available.
#11 0x00007fffe1d36421 in sycl::_V1::queue::memcpy(void*, void const*, unsigned long, sycl::_V1::detail::code_location const&) ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
No symbol table info available.
#12 0x00007ffff6a37f9d in ggml_backend_sycl_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long)::{lambda()#2}::operator()() const (this=<optimized out>) at /src/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:399
        e = <optimized out>
(gdb) list
...
1444        memcpy_s(cpuMemcpyDstPtr, cpuMemCopyInfo.size, cpuMemcpySrcPtr, cpuMemCopyInfo.size);
...
(gdb) p cpuMemCopyInfo
$7 = (const L0::CpuMemCopyInfo &) @0x7fffffffaea0: {dstPtr = 0xffffd556aaa00000, srcPtr = 0x5d556b0, size = 20480, dstAllocData = 0x5c78bc0,
  srcAllocData = 0x0, dstIsImportedHostPtr = false, srcIsImportedHostPtr = false}
(gdb) p cpuMemcpyDstPtr
$8 = (void *) 0x7ffcfaa60000
(gdb) info proc mappings
Mapped address spaces:

Start Addr         End Addr           Size               Offset             Perms File
...
0x00000000004d5000 0x0000000005d6a000 0x5895000          0x0                rw-p  [heap]
0x00007ffcfaa60000 0x00007ffdf9000000 0xfe5a0000         0x1e929a000        rw-s  anon_inode:i915.gem
0x00007ffdf9000000 0x00007fffa59b7000 0x1ac9b7000        0x0                r--s  /data/llama-models/phi-4-Q3_K_M.gguf
0x00007fffa5a00000 0x00007fffa5a3f000 0x3f000            0x0                r--p  /usr/lib/libopencl-clang.so.15
...

While I understand the issue might be due to the "small BAR" error, I would appreciate a helpful error message rather than a SIGBUS that requires rebuilding the intel-compute-runtime with debug symbols to understand where the issue is coming from. Even better - make level_zero work with small BAR, even if with reduced performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions