Skip to content

BatchRotatingKVCache.merge() crashes on concurrent requests with different prompt lengths #983

@juliensimon

Description

@juliensimon

Description

mlx_lm.server crashes when handling concurrent requests to models that use RotatingKVCache (e.g. models with sliding window attention). The crash occurs in BatchRotatingKVCache.merge() when trying to merge caches with different _idx values, producing a shape mismatch.

This is reproducible with any client that sends multiple requests concurrently (e.g. OpenCode, which fires a short system probe and a full prompt simultaneously).

Environment

  • mlx-lm: 0.31.1
  • mlx: 0.24.2
  • macOS: 15.5 (Apple M3 Max, 48 GB)
  • Python: 3.12
  • Model: mlx-community/Trinity-Nano-Preview-8bit (afmoe architecture, uses RotatingKVCache)

Also reproduced with mlx-community/Trinity-Mini-8bit.

Steps to Reproduce

  1. Start the server:
uvx --from "mlx-lm>=0.28.4" mlx_lm.server \
    --model mlx-community/Trinity-Nano-Preview-8bit \
    --port 8080
  1. Send two concurrent requests with different prompt lengths:
# In parallel (e.g. using & or a client that sends concurrent requests)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}' &

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "system", "content": "You are a helpful assistant. '"$(python3 -c "print('x ' * 5000)")"'"}, {"role": "user", "content": "Hello"}], "max_tokens": 10}' &

wait

The first request succeeds, but the second crashes the server's generate thread.

Error

Exception in thread Thread-2 (_generate):
Traceback (most recent call last):
  File ".../mlx_lm/server.py", line 948, in _generate
    responses = batch_generator.next()
  File ".../mlx_lm/generate.py", line 1329, in next
    return self._next()
  File ".../mlx_lm/generate.py", line 1257, in _next
    batch = self._process_prompts(prompts)
  File ".../mlx_lm/generate.py", line 1126, in _process_prompts
    prompt_cache = _merge_caches(caches)
  File ".../mlx_lm/generate.py", line 922, in _merge_caches
    batch_cache.append(caches[0][i].merge([c[i] for c in caches]))
  File ".../mlx_lm/models/cache.py", line 580, in merge
    return BatchRotatingKVCache.merge(caches)
  File ".../mlx_lm/models/cache.py", line 1364, in merge
    keys[i : i + 1, :, p : p + c._idx] = c._temporal_order(c.keys)
ValueError: [broadcast_shapes] Shapes (1,2,3935,128) and (1,2,2048,128) cannot be broadcast.

Analysis

In cache.py:1364, BatchRotatingKVCache.merge() assumes all caches being merged have compatible _idx values. When two requests have very different prompt lengths (e.g. 519 tokens vs 10,081 tokens), their RotatingKVCache entries end up with different _idx values (e.g. 3935 vs 2048), and the slice assignment fails because the shapes don't match.

The issue is in the merge logic — it allocates a target tensor based on one cache's dimensions but then tries to copy data from another cache with a different _idx.

Workaround

Disabling batching in server.py avoids the crash:

# In ModelProvider.load(), after the is_batchable check:
self.is_batchable = False

This forces sequential request processing. The --prompt-cache-size 1 and --decode-concurrency 1 flags do not prevent the crash because the server still attempts to batch concurrent requests.

Expected Behavior

Concurrent requests with different prompt lengths should either:

  • Be merged correctly (padding/truncating caches to compatible shapes), or
  • Fall back to sequential processing when cache shapes are incompatible

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions