-
Notifications
You must be signed in to change notification settings - Fork 483
Description
Description
mlx_lm.server crashes when handling concurrent requests to models that use RotatingKVCache (e.g. models with sliding window attention). The crash occurs in BatchRotatingKVCache.merge() when trying to merge caches with different _idx values, producing a shape mismatch.
This is reproducible with any client that sends multiple requests concurrently (e.g. OpenCode, which fires a short system probe and a full prompt simultaneously).
Environment
- mlx-lm: 0.31.1
- mlx: 0.24.2
- macOS: 15.5 (Apple M3 Max, 48 GB)
- Python: 3.12
- Model:
mlx-community/Trinity-Nano-Preview-8bit(afmoe architecture, uses RotatingKVCache)
Also reproduced with mlx-community/Trinity-Mini-8bit.
Steps to Reproduce
- Start the server:
uvx --from "mlx-lm>=0.28.4" mlx_lm.server \
--model mlx-community/Trinity-Nano-Preview-8bit \
--port 8080- Send two concurrent requests with different prompt lengths:
# In parallel (e.g. using & or a client that sends concurrent requests)
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}' &
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "system", "content": "You are a helpful assistant. '"$(python3 -c "print('x ' * 5000)")"'"}, {"role": "user", "content": "Hello"}], "max_tokens": 10}' &
waitThe first request succeeds, but the second crashes the server's generate thread.
Error
Exception in thread Thread-2 (_generate):
Traceback (most recent call last):
File ".../mlx_lm/server.py", line 948, in _generate
responses = batch_generator.next()
File ".../mlx_lm/generate.py", line 1329, in next
return self._next()
File ".../mlx_lm/generate.py", line 1257, in _next
batch = self._process_prompts(prompts)
File ".../mlx_lm/generate.py", line 1126, in _process_prompts
prompt_cache = _merge_caches(caches)
File ".../mlx_lm/generate.py", line 922, in _merge_caches
batch_cache.append(caches[0][i].merge([c[i] for c in caches]))
File ".../mlx_lm/models/cache.py", line 580, in merge
return BatchRotatingKVCache.merge(caches)
File ".../mlx_lm/models/cache.py", line 1364, in merge
keys[i : i + 1, :, p : p + c._idx] = c._temporal_order(c.keys)
ValueError: [broadcast_shapes] Shapes (1,2,3935,128) and (1,2,2048,128) cannot be broadcast.
Analysis
In cache.py:1364, BatchRotatingKVCache.merge() assumes all caches being merged have compatible _idx values. When two requests have very different prompt lengths (e.g. 519 tokens vs 10,081 tokens), their RotatingKVCache entries end up with different _idx values (e.g. 3935 vs 2048), and the slice assignment fails because the shapes don't match.
The issue is in the merge logic — it allocates a target tensor based on one cache's dimensions but then tries to copy data from another cache with a different _idx.
Workaround
Disabling batching in server.py avoids the crash:
# In ModelProvider.load(), after the is_batchable check:
self.is_batchable = FalseThis forces sequential request processing. The --prompt-cache-size 1 and --decode-concurrency 1 flags do not prevent the crash because the server still attempts to batch concurrent requests.
Expected Behavior
Concurrent requests with different prompt lengths should either:
- Be merged correctly (padding/truncating caches to compatible shapes), or
- Fall back to sequential processing when cache shapes are incompatible