Misc. bug: Race condition in decode(): missing synchronize() after async tensor copy

### Name and Version

$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
version: 7515 (dfc959b88)
built with GNU 13.3.0 for Linux aarch64

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
./llama-server -m ~/Models/lmstudio-community/gpt-oss-20b-GGUF/gpt-oss-20b-MXFP4.gguf \
    -t 4 \
    -c 131072 \
    -n 8192 \
    --embedding \
    --pooling last \
    --parallel 30 \
    --cont-batching \
    --jinja \
    --reasoning-format deepseek \
    --chat-template-kwargs '{"reasoning_effort": "low", "reasoning_format": "auto"}' \
    --n-gpu-layers 999 \
    --swa-full \
    --flash-attn on \
    -b 32768 \
    -ub 16384 \
    --no-mmap \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --cache-ram 0 \
    --cache-reuse 256 \
    --defrag-thold 0.1 \
    --temp 1.0 \
    --top-p 1.0 \
    --top-k 0 \
    --min-p 0 \
    --host 0.0.0.0 \
    --port 8888 \
    --api-key "blahblah.."
```

### Problem description & steps to reproduce

## Summary

A race condition exists in `llama_context::decode()` where `ggml_backend_tensor_get_async()` is called but no synchronization occurs before the function returns. This causes random crashes with "CUDA error: an illegal memory access was encountered" when the async copy is still in progress while subsequent operations (like buffer reallocation or defragmentation) modify the source memory.

## Environment

- **OS**: Linux (aarch64)
- **GPU**: NVIDIA GB10 (unified memory architecture)
- **CUDA Version**: 13.0
- **llama.cpp version**: master (tested on commit `dfc959b88`)
- **Model**: 20B parameter model (non-T5 architecture)

## Symptoms

```
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_get_tensor_async at ggml-cuda.cu:2786
  cudaMemcpyAsync(data, (const char *)tensor->data + offset, size, cudaMemcpyDeviceToHost, cuda_ctx->stream())
```

Backtrace:
```
#4  ggml_backend_cuda_get_tensor_async()
#5  llama_context::decode()
#6  llama_decode()
#7  server_context_impl::update_slots()
```

## Root Cause Analysis

### The Problem

In `src/llama-context.cpp`, the `decode()` function calls `ggml_backend_tensor_get_async()` to copy logits/embeddings from GPU to host:

```cpp
// Line 1221
ggml_backend_tensor_get_async(backend_res, t_logits, logits_out, 0, n_outputs*n_vocab*sizeof(float));

// Line 1240, 1255, 1270 - similar calls for embeddings
```

However, the synchronization at the end is **commented out**:

```cpp
// Line 1333-1334
// wait for the computation to finish (automatically done when obtaining the model output)
//synchronize();
```

### Why This Causes Crashes

1. `decode()` initiates async copy operations
2. Function returns immediately without waiting
3. Next batch processing or defragmentation begins
4. GPU memory is reallocated/modified while async copy still in progress
5. `cudaMemcpyAsync` accesses freed/invalid memory
6. **CRASH**

### The Flawed Assumption

The comment says synchronization is "automatically done when obtaining the model output". This refers to functions like `llama_get_logits_ith()` which does call `synchronize()`:

```cpp
// Line 2552-2553
float * llama_get_logits_ith(llama_context * ctx, int32_t i) {
    ctx->synchronize();  // OK
    ...
}
```

However, `get_logits()` does NOT synchronize:

```cpp
// Line 614-617
float * llama_context::get_logits() {
    output_reorder();    // NO synchronize!
    return logits;
}
```

More critically, **the synchronization happens too late**. The crash occurs during `update_slots()` before any logits retrieval function is called.

### T5 Exception

Interestingly, T5 models have explicit synchronization:

```cpp
// Line 1007-1010
if (model.arch == LLM_ARCH_T5 && t_embd) {
    synchronize();  // Only for T5!
    ...
}
```

This explains why T5 models work correctly while other architectures crash.

## Reproduction

### Step 1: Start server with these settings
```bash
llama-server \
    -m model.gguf \
    -c 131072 \
    --parallel 10 \
    --cont-batching \
    -b 32768 -ub 16384 \
    --defrag-thold 0.1 \
    --flash-attn
```

### Step 2: Send continuous parallel requests
The bug triggers when multiple slots are actively processing requests simultaneously. For example, sending 10 concurrent embedding requests with ~200-500 tokens each:

```bash
# Example: 10 parallel requests continuously
# Each request contains ~200-500 tokens of text
for i in {1..1000}; do
  for j in {1..10}; do
    curl -s http://localhost:8888/v1/embeddings \
      -H "Content-Type: application/json" \
      -d '{"input": "A paragraph of text with approximately 200-500 tokens..."}' &
  done
  wait
done
```

In our case, we were processing 70,000 text samples (average ~300 tokens each) with 10-30 parallel embedding requests. The crash typically occurred within a few minutes of sustained parallel load, when the race condition between async tensor copy and memory operations (defrag, buffer reallocation) is triggered.

The crash is more likely with:
- Large context sizes
- Multiple parallel slots
- Continuous batching enabled
- Defragmentation enabled
- High batch sizes

## Related Commits

- `f30ea47a8` (Mar 2024): Pipeline parallelism - original comment-out with flawed assumption
- `5266379bc` (Dec 2025): Partial fix - added `synchronize()` before output buffer reallocation
- `e0dbec0bc` (Mar 2025): Refactor that preserved the commented-out synchronize

## Proposed Fix

Uncomment the `synchronize()` call at the end of `decode()`:

**File:** `src/llama-context.cpp` (around line 1333)

```cpp
// BEFORE (buggy):
    // wait for the computation to finish (automatically done when obtaining the model output)
    //synchronize();

    return 0;
}

// AFTER (fixed):
    // wait for the computation to finish
    synchronize();

    return 0;
}
```

### Performance Note

**This was NOT a performance optimization.** Looking at the original comment:

```cpp
// wait for the computation to finish (automatically done when obtaining the model output)
//synchronize();
```

The developer assumed synchronization would happen automatically when calling `llama_get_logits()` or similar functions. This was a **design mistake**, not a deliberate performance trade-off.

In reality:
- `llama_get_logits_ith()` → calls `synchronize()` ✓
- `llama_get_logits()` → does NOT synchronize ✗
- The crash occurs in `update_slots()` **before** any logits retrieval function is called

Even if there were performance concerns, the actual impact would be minimal:
- `synchronize()` only waits for the async copy to complete (GPU compute is already done)
- Data size is small: ~600KB for logits (1 token × 152K vocab × 4 bytes)
- Copy time: < 0.1ms on modern hardware
- This is negligible compared to token generation time (tens of ms)

### Unexpected Performance Improvement

After applying the patch, we observed not only zero crashes but also a **14x speed improvement**:

| Metric | Before Patch | After Patch |
|--------|--------------|-------------|
| Crashes | Frequent (every few minutes) | None |
| Speed | ~3 samples/s | ~44 samples/s |
| Parallel requests | 30 | 30 |

We expected `synchronize()` to add overhead and slow things down. Instead, it dramatically increased throughput. The likely explanation:

- **Before**: Frequent crashes → server restarts → model reloading → unstable GPU pipeline
- **After**: Stable execution → no restart overhead → GPU runs at full capacity

Ironically, adding synchronization made everything faster by eliminating the instability caused by the race condition.

## Workaround

Until fixed, users can try:
- Disable defragmentation: `--defrag-thold -1`
- Reduce parallel slots: `--parallel 1`
- Use smaller batch sizes: `-b 2048 -ub 1024`

### First Bad Commit

_No response_

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Race condition in decode(): missing synchronize() after async tensor copy #18310

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Summary

Environment

Symptoms

Root Cause Analysis

The Problem

Why This Causes Crashes

The Flawed Assumption

T5 Exception

Reproduction

Step 1: Start server with these settings

Step 2: Send continuous parallel requests

Related Commits

Proposed Fix

Performance Note

Unexpected Performance Improvement

Workaround

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Before Patch	After Patch
Crashes	Frequent (every few minutes)	None
Speed	~3 samples/s	~44 samples/s
Parallel requests	30	30

Misc. bug: Race condition in decode(): missing synchronize() after async tensor copy #18310

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Summary

Environment

Symptoms

Root Cause Analysis

The Problem

Why This Causes Crashes

The Flawed Assumption

T5 Exception

Reproduction

Step 1: Start server with these settings

Step 2: Send continuous parallel requests

Related Commits

Proposed Fix

Performance Note

Unexpected Performance Improvement

Workaround

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions