-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Description
Name and Version
$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
version: 7515 (dfc959b)
built with GNU 13.3.0 for Linux aarch64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./llama-server -m ~/Models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf \
-t 4 \
-c 131072 \
-n 8192 \
--embedding \
--pooling last \
--parallel 30 \
--cont-batching \
--jinja \
--reasoning-format deepseek \
--chat-template-kwargs '{"reasoning_effort": "low", "reasoning_format": "auto"}' \
--n-gpu-layers 999 \
--swa-full \
--flash-attn on \
-b 32768 \
-ub 16384 \
--no-mmap \
--cache-type-k f16 \
--cache-type-v f16 \
--cache-ram 0 \
--cache-reuse 256 \
--defrag-thold 0.1 \
--temp 1.0 \
--top-p 1.0 \
--top-k 0 \
--min-p 0 \
--host 0.0.0.0 \
--port 8888 \
--api-key "blahblah.."Problem description & steps to reproduce
Summary
A race condition exists in llama_context::decode() where ggml_backend_tensor_get_async() is called but no synchronization occurs before the function returns. This causes random crashes with "CUDA error: an illegal memory access was encountered" when the async copy is still in progress while subsequent operations (like buffer reallocation or defragmentation) modify the source memory.
Environment
- OS: Linux (aarch64)
- GPU: NVIDIA GB10 (unified memory architecture)
- CUDA Version: 13.0
- llama.cpp version: master (tested on commit
dfc959b88) - Model: 20B parameter model (non-T5 architecture)
Symptoms
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_get_tensor_async at ggml-cuda.cu:2786
cudaMemcpyAsync(data, (const char *)tensor->data + offset, size, cudaMemcpyDeviceToHost, cuda_ctx->stream())
Backtrace:
#4 ggml_backend_cuda_get_tensor_async()
#5 llama_context::decode()
#6 llama_decode()
#7 server_context_impl::update_slots()
Root Cause Analysis
The Problem
In src/llama-context.cpp, the decode() function calls ggml_backend_tensor_get_async() to copy logits/embeddings from GPU to host:
// Line 1221
ggml_backend_tensor_get_async(backend_res, t_logits, logits_out, 0, n_outputs*n_vocab*sizeof(float));
// Line 1240, 1255, 1270 - similar calls for embeddingsHowever, the synchronization at the end is commented out:
// Line 1333-1334
// wait for the computation to finish (automatically done when obtaining the model output)
//synchronize();Why This Causes Crashes
decode()initiates async copy operations- Function returns immediately without waiting
- Next batch processing or defragmentation begins
- GPU memory is reallocated/modified while async copy still in progress
cudaMemcpyAsyncaccesses freed/invalid memory- CRASH
The Flawed Assumption
The comment says synchronization is "automatically done when obtaining the model output". This refers to functions like llama_get_logits_ith() which does call synchronize():
// Line 2552-2553
float * llama_get_logits_ith(llama_context * ctx, int32_t i) {
ctx->synchronize(); // OK
...
}However, get_logits() does NOT synchronize:
// Line 614-617
float * llama_context::get_logits() {
output_reorder(); // NO synchronize!
return logits;
}More critically, the synchronization happens too late. The crash occurs during update_slots() before any logits retrieval function is called.
T5 Exception
Interestingly, T5 models have explicit synchronization:
// Line 1007-1010
if (model.arch == LLM_ARCH_T5 && t_embd) {
synchronize(); // Only for T5!
...
}This explains why T5 models work correctly while other architectures crash.
Reproduction
Step 1: Start server with these settings
llama-server \
-m model.gguf \
-c 131072 \
--parallel 10 \
--cont-batching \
-b 32768 -ub 16384 \
--defrag-thold 0.1 \
--flash-attnStep 2: Send continuous parallel requests
The bug triggers when multiple slots are actively processing requests simultaneously. For example, sending 10 concurrent embedding requests with ~200-500 tokens each:
# Example: 10 parallel requests continuously
# Each request contains ~200-500 tokens of text
for i in {1..1000}; do
for j in {1..10}; do
curl -s http://localhost:8888/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "A paragraph of text with approximately 200-500 tokens..."}' &
done
wait
doneIn our case, we were processing 70,000 text samples (average ~300 tokens each) with 10-30 parallel embedding requests. The crash typically occurred within a few minutes of sustained parallel load, when the race condition between async tensor copy and memory operations (defrag, buffer reallocation) is triggered.
The crash is more likely with:
- Large context sizes
- Multiple parallel slots
- Continuous batching enabled
- Defragmentation enabled
- High batch sizes
Related Commits
f30ea47a8(Mar 2024): Pipeline parallelism - original comment-out with flawed assumption5266379bc(Dec 2025): Partial fix - addedsynchronize()before output buffer reallocatione0dbec0bc(Mar 2025): Refactor that preserved the commented-out synchronize
Proposed Fix
Uncomment the synchronize() call at the end of decode():
File: src/llama-context.cpp (around line 1333)
// BEFORE (buggy):
// wait for the computation to finish (automatically done when obtaining the model output)
//synchronize();
return 0;
}
// AFTER (fixed):
// wait for the computation to finish
synchronize();
return 0;
}Performance Note
This was NOT a performance optimization. Looking at the original comment:
// wait for the computation to finish (automatically done when obtaining the model output)
//synchronize();The developer assumed synchronization would happen automatically when calling llama_get_logits() or similar functions. This was a design mistake, not a deliberate performance trade-off.
In reality:
llama_get_logits_ith()→ callssynchronize()✓llama_get_logits()→ does NOT synchronize ✗- The crash occurs in
update_slots()before any logits retrieval function is called
Even if there were performance concerns, the actual impact would be minimal:
synchronize()only waits for the async copy to complete (GPU compute is already done)- Data size is small: ~600KB for logits (1 token × 152K vocab × 4 bytes)
- Copy time: < 0.1ms on modern hardware
- This is negligible compared to token generation time (tens of ms)
Unexpected Performance Improvement
After applying the patch, we observed not only zero crashes but also a 14x speed improvement:
| Metric | Before Patch | After Patch |
|---|---|---|
| Crashes | Frequent (every few minutes) | None |
| Speed | ~3 samples/s | ~44 samples/s |
| Parallel requests | 30 | 30 |
We expected synchronize() to add overhead and slow things down. Instead, it dramatically increased throughput. The likely explanation:
- Before: Frequent crashes → server restarts → model reloading → unstable GPU pipeline
- After: Stable execution → no restart overhead → GPU runs at full capacity
Ironically, adding synchronization made everything faster by eliminating the instability caused by the race condition.
Workaround
Until fixed, users can try:
- Disable defragmentation:
--defrag-thold -1 - Reduce parallel slots:
--parallel 1 - Use smaller batch sizes:
-b 2048 -ub 1024
First Bad Commit
No response