Skip to content

Misc. bug: Race condition in decode(): missing synchronize() after async tensor copy #18310

@AmesianX

Description

@AmesianX

Name and Version

$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
version: 7515 (dfc959b)
built with GNU 13.3.0 for Linux aarch64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server -m ~/Models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf \
    -t 4 \
    -c 131072 \
    -n 8192 \
    --embedding \
    --pooling last \
    --parallel 30 \
    --cont-batching \
    --jinja \
    --reasoning-format deepseek \
    --chat-template-kwargs '{"reasoning_effort": "low", "reasoning_format": "auto"}' \
    --n-gpu-layers 999 \
    --swa-full \
    --flash-attn on \
    -b 32768 \
    -ub 16384 \
    --no-mmap \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --cache-ram 0 \
    --cache-reuse 256 \
    --defrag-thold 0.1 \
    --temp 1.0 \
    --top-p 1.0 \
    --top-k 0 \
    --min-p 0 \
    --host 0.0.0.0 \
    --port 8888 \
    --api-key "blahblah.."

Problem description & steps to reproduce

Summary

A race condition exists in llama_context::decode() where ggml_backend_tensor_get_async() is called but no synchronization occurs before the function returns. This causes random crashes with "CUDA error: an illegal memory access was encountered" when the async copy is still in progress while subsequent operations (like buffer reallocation or defragmentation) modify the source memory.

Environment

  • OS: Linux (aarch64)
  • GPU: NVIDIA GB10 (unified memory architecture)
  • CUDA Version: 13.0
  • llama.cpp version: master (tested on commit dfc959b88)
  • Model: 20B parameter model (non-T5 architecture)

Symptoms

CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_get_tensor_async at ggml-cuda.cu:2786
  cudaMemcpyAsync(data, (const char *)tensor->data + offset, size, cudaMemcpyDeviceToHost, cuda_ctx->stream())

Backtrace:

#4  ggml_backend_cuda_get_tensor_async()
#5  llama_context::decode()
#6  llama_decode()
#7  server_context_impl::update_slots()

Root Cause Analysis

The Problem

In src/llama-context.cpp, the decode() function calls ggml_backend_tensor_get_async() to copy logits/embeddings from GPU to host:

// Line 1221
ggml_backend_tensor_get_async(backend_res, t_logits, logits_out, 0, n_outputs*n_vocab*sizeof(float));

// Line 1240, 1255, 1270 - similar calls for embeddings

However, the synchronization at the end is commented out:

// Line 1333-1334
// wait for the computation to finish (automatically done when obtaining the model output)
//synchronize();

Why This Causes Crashes

  1. decode() initiates async copy operations
  2. Function returns immediately without waiting
  3. Next batch processing or defragmentation begins
  4. GPU memory is reallocated/modified while async copy still in progress
  5. cudaMemcpyAsync accesses freed/invalid memory
  6. CRASH

The Flawed Assumption

The comment says synchronization is "automatically done when obtaining the model output". This refers to functions like llama_get_logits_ith() which does call synchronize():

// Line 2552-2553
float * llama_get_logits_ith(llama_context * ctx, int32_t i) {
    ctx->synchronize();  // OK
    ...
}

However, get_logits() does NOT synchronize:

// Line 614-617
float * llama_context::get_logits() {
    output_reorder();    // NO synchronize!
    return logits;
}

More critically, the synchronization happens too late. The crash occurs during update_slots() before any logits retrieval function is called.

T5 Exception

Interestingly, T5 models have explicit synchronization:

// Line 1007-1010
if (model.arch == LLM_ARCH_T5 && t_embd) {
    synchronize();  // Only for T5!
    ...
}

This explains why T5 models work correctly while other architectures crash.

Reproduction

Step 1: Start server with these settings

llama-server \
    -m model.gguf \
    -c 131072 \
    --parallel 10 \
    --cont-batching \
    -b 32768 -ub 16384 \
    --defrag-thold 0.1 \
    --flash-attn

Step 2: Send continuous parallel requests

The bug triggers when multiple slots are actively processing requests simultaneously. For example, sending 10 concurrent embedding requests with ~200-500 tokens each:

# Example: 10 parallel requests continuously
# Each request contains ~200-500 tokens of text
for i in {1..1000}; do
  for j in {1..10}; do
    curl -s http://localhost:8888/v1/embeddings \
      -H "Content-Type: application/json" \
      -d '{"input": "A paragraph of text with approximately 200-500 tokens..."}' &
  done
  wait
done

In our case, we were processing 70,000 text samples (average ~300 tokens each) with 10-30 parallel embedding requests. The crash typically occurred within a few minutes of sustained parallel load, when the race condition between async tensor copy and memory operations (defrag, buffer reallocation) is triggered.

The crash is more likely with:

  • Large context sizes
  • Multiple parallel slots
  • Continuous batching enabled
  • Defragmentation enabled
  • High batch sizes

Related Commits

  • f30ea47a8 (Mar 2024): Pipeline parallelism - original comment-out with flawed assumption
  • 5266379bc (Dec 2025): Partial fix - added synchronize() before output buffer reallocation
  • e0dbec0bc (Mar 2025): Refactor that preserved the commented-out synchronize

Proposed Fix

Uncomment the synchronize() call at the end of decode():

File: src/llama-context.cpp (around line 1333)

// BEFORE (buggy):
    // wait for the computation to finish (automatically done when obtaining the model output)
    //synchronize();

    return 0;
}

// AFTER (fixed):
    // wait for the computation to finish
    synchronize();

    return 0;
}

Performance Note

This was NOT a performance optimization. Looking at the original comment:

// wait for the computation to finish (automatically done when obtaining the model output)
//synchronize();

The developer assumed synchronization would happen automatically when calling llama_get_logits() or similar functions. This was a design mistake, not a deliberate performance trade-off.

In reality:

  • llama_get_logits_ith() → calls synchronize()
  • llama_get_logits() → does NOT synchronize ✗
  • The crash occurs in update_slots() before any logits retrieval function is called

Even if there were performance concerns, the actual impact would be minimal:

  • synchronize() only waits for the async copy to complete (GPU compute is already done)
  • Data size is small: ~600KB for logits (1 token × 152K vocab × 4 bytes)
  • Copy time: < 0.1ms on modern hardware
  • This is negligible compared to token generation time (tens of ms)

Unexpected Performance Improvement

After applying the patch, we observed not only zero crashes but also a 14x speed improvement:

Metric Before Patch After Patch
Crashes Frequent (every few minutes) None
Speed ~3 samples/s ~44 samples/s
Parallel requests 30 30

We expected synchronize() to add overhead and slow things down. Instead, it dramatically increased throughput. The likely explanation:

  • Before: Frequent crashes → server restarts → model reloading → unstable GPU pipeline
  • After: Stable execution → no restart overhead → GPU runs at full capacity

Ironically, adding synchronization made everything faster by eliminating the instability caused by the race condition.

Workaround

Until fixed, users can try:

  • Disable defragmentation: --defrag-thold -1
  • Reduce parallel slots: --parallel 1
  • Use smaller batch sizes: -b 2048 -ub 1024

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions