BatchRotatingKVCache.merge() crashes on concurrent requests with different prompt lengths

## Description

`mlx_lm.server` crashes when handling concurrent requests to models that use `RotatingKVCache` (e.g. models with sliding window attention). The crash occurs in `BatchRotatingKVCache.merge()` when trying to merge caches with different `_idx` values, producing a shape mismatch.

This is reproducible with any client that sends multiple requests concurrently (e.g. OpenCode, which fires a short system probe and a full prompt simultaneously).

## Environment

- **mlx-lm:** 0.31.1
- **mlx:** 0.24.2
- **macOS:** 15.5 (Apple M3 Max, 48 GB)
- **Python:** 3.12
- **Model:** `mlx-community/Trinity-Nano-Preview-8bit` (afmoe architecture, uses RotatingKVCache)

Also reproduced with `mlx-community/Trinity-Mini-8bit`.

## Steps to Reproduce

1. Start the server:
```bash
uvx --from "mlx-lm>=0.28.4" mlx_lm.server \
    --model mlx-community/Trinity-Nano-Preview-8bit \
    --port 8080
```

2. Send two concurrent requests with different prompt lengths:
```bash
# In parallel (e.g. using & or a client that sends concurrent requests)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}' &

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "system", "content": "You are a helpful assistant. '"$(python3 -c "print('x ' * 5000)")"'"}, {"role": "user", "content": "Hello"}], "max_tokens": 10}' &

wait
```

The first request succeeds, but the second crashes the server's generate thread.

## Error

```
Exception in thread Thread-2 (_generate):
Traceback (most recent call last):
  File ".../mlx_lm/server.py", line 948, in _generate
    responses = batch_generator.next()
  File ".../mlx_lm/generate.py", line 1329, in next
    return self._next()
  File ".../mlx_lm/generate.py", line 1257, in _next
    batch = self._process_prompts(prompts)
  File ".../mlx_lm/generate.py", line 1126, in _process_prompts
    prompt_cache = _merge_caches(caches)
  File ".../mlx_lm/generate.py", line 922, in _merge_caches
    batch_cache.append(caches[0][i].merge([c[i] for c in caches]))
  File ".../mlx_lm/models/cache.py", line 580, in merge
    return BatchRotatingKVCache.merge(caches)
  File ".../mlx_lm/models/cache.py", line 1364, in merge
    keys[i : i + 1, :, p : p + c._idx] = c._temporal_order(c.keys)
ValueError: [broadcast_shapes] Shapes (1,2,3935,128) and (1,2,2048,128) cannot be broadcast.
```

## Analysis

In `cache.py:1364`, `BatchRotatingKVCache.merge()` assumes all caches being merged have compatible `_idx` values. When two requests have very different prompt lengths (e.g. 519 tokens vs 10,081 tokens), their `RotatingKVCache` entries end up with different `_idx` values (e.g. 3935 vs 2048), and the slice assignment fails because the shapes don't match.

The issue is in the merge logic — it allocates a target tensor based on one cache's dimensions but then tries to copy data from another cache with a different `_idx`.

## Workaround

Disabling batching in `server.py` avoids the crash:

```python
# In ModelProvider.load(), after the is_batchable check:
self.is_batchable = False
```

This forces sequential request processing. The `--prompt-cache-size 1` and `--decode-concurrency 1` flags do not prevent the crash because the server still attempts to batch concurrent requests.

## Expected Behavior

Concurrent requests with different prompt lengths should either:
- Be merged correctly (padding/truncating caches to compatible shapes), or
- Fall back to sequential processing when cache shapes are incompatible

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchRotatingKVCache.merge() crashes on concurrent requests with different prompt lengths #983

Description

Environment

Steps to Reproduce

Error

Analysis

Workaround

Expected Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BatchRotatingKVCache.merge() crashes on concurrent requests with different prompt lengths #983

Description

Description

Environment

Steps to Reproduce

Error

Analysis

Workaround

Expected Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions