This repository was archived by the owner on Jan 28, 2026. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
SYCL multi-GPU inference fails with UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY on Intel Arc Pro (single GPU works) #13335
Copy link
Copy link
Open
Labels
Description
Summary
Multi-GPU inference using the SYCL / oneAPI backend fails with an out-of-device-memory error during a SYCL memcpy().wait() call, even though sufficient VRAM is available on each GPU. The same model and configuration work reliably on a single Intel GPU.
This appears to be a multi-GPU SYCL / Level Zero pipeline or cross-device copy issue, not a real VRAM exhaustion problem.
Environment
- OS: Windows 11
- Ollama build: ollama-ipex-llm 2.3.0b20250725 (Windows portable ZIP)
- Ollama version: 0.9.3
- Backend: SYCL / oneAPI (ggml-sycl)
- oneAPI / Level Zero: oneAPI 2024.2 (bundled with build)
- Model: nemotron-mini (GGUF, Q4_K, ~2.5 GiB)
- Context size: 4096 (also reproduced with 2048)
- Batch size: 512 (also reproduced with smaller values)
- Parallel sequences: 1
- KV cache: f16
GPUs
- 2× Intel Arc Pro B60 (24 GiB VRAM each)
- NVIDIA GPU also present in system, but Intel SYCL backend is explicitly selected
Reproduction Steps
Works (single GPU)
set ONEAPI_DEVICE_SELECTOR=level_zero:0
start-ollama.bat
ollama run nemotron-mini:latest
Model loads and runs correctly
Stable inference, high tokens/sec
Fails (multi-GPU)
set ONEAPI_DEVICE_SELECTOR=level_zero:0;level_zero:1
start-ollama.bat
ollama run nemotron-mini:latest
Model loads successfully
Fails on first inference request
Observed Error
Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)
Exception caught at:
ggml-sycl.cpp:4602
func: operator()
SYCL error:
CHECK_TRY_ERROR(
(stream)->memcpy(data, (const char *)tensor->data + offset, size).wait()
)
in function:
ggml_backend_sycl_get_tensor_async
common.hpp:115: SYCL error
ERROR source=server.go:827 msg="post predict"
ERROR source=server.go:484 msg="llama runner terminated"
exit status 0xc0000409
Key Observations
This is not a real VRAM exhaustion issue
- Each GPU has ~22 GiB free at runtime
- Model uses <3 GiB weights + ~512 MiB KV cache
The failure happens after successful model load, during inference
The error is triggered inside a SYCL memcpy + wait, suggesting: - cross-device tensor movement
- pipeline parallelism
- or Level Zero memory management issues
With two GPUs visible, logs show: - pipeline parallelism enabled
- multiple graph splits
With one GPU: - no pipeline parallelism
- stable execution
Any help would be appreciated.
Reactions are currently unavailable