Add continuous batching support for concurrent LLM inference by ronaldmannak · Pull Request #150 · ml-explore/mlx-swift-lm

ronaldmannak · 2026-03-16T22:23:54Z

Proposed changes

Continuous batching engine that transparently serves multiple concurrent requests in a shared decode loop, with zero overhead for single requests
BatchKVCache & BatchRotatingKVCache implementations with left-padding strategy for batched attention across sequences of different lengths
InferenceScheduler actor that starts in single-request mode (TokenIterator) and automatically upgrades to BatchTokenIterator when a second request arrives; 3rd+ requests join the existing batch on the fly
LRUPromptCache with trie-based prefix matching and LRU eviction for reusing KV state across requests with shared prefixes
Batch-aware RoPE via the new applyRotaryPosition helper — all 45 modified MLXLLM model files with RoPE-based attention paths now use batch-aware position handling that works in both single and batched modes
Automatic fallback for incompatible requests (VLMs, hybrid/SSM models using MambaCache or CacheList, and quantized KV-cache requests) — routed to the single-request path with no caller changes needed

Design

This PR adds batching support for text-generation LLMs. VLM requests are not batched in this PR, and embedding models are out of scope for this batching path.

Most LLM model types are supported. Batched generation is currently limited to models whose cache stack uses KVCacheSimple and/or RotatingKVCache, and is unavailable for models that use MambaCache or CacheList. Requests using those models automatically fall back to the single inference path. Support for MambaCache and CacheList will be added in a separate PR.

QuantizedKVCache is also not supported for batching. Requests with kvBits != nil, or requests that already carry quantized KV cache state, automatically fall back to the single inference path.

The batching system is opt-in via ModelContainer.scheduler:

let container = ModelContainer(context: context)
container.scheduler = InferenceScheduler()
// Existing generate() calls now batch transparently when the request is compatible

Incompatible models that use MambaCache and/or CacheList are:

falcon_h1
baichuan_m1
lfm2
lfm2_moe
granitemoehybrid
qwen3_next
qwen3_5, qwen3_5_text, qwen3_5_moe
nemotron_h
jamba_3b

Checklist

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

Set up .factory/ directory with worker skills, services manifest, init script, and library knowledge files for the batching mission. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

…hing Add Libraries/MLXLMCommon/Batching/BatchKVCache.swift porting Python mlx-lm's BatchKVCache. Includes: init with leftPadding, update with step-based buffer allocation, filter(batchIndices:) with left-shift optimization, extend(other:) with right-justification, extract(idx:) returning KVCacheSimple with padding stripped, merge([KVCache]) class method, fromSingle/toSingle conversion, state serialization, and empty batch handling. Add comprehensive XCTest suite in Tests/MLXLMTests/BatchKVCacheTests.swift with 22 test cases covering all validation contract assertions (VAL-CACHE-001 through VAL-CACHE-021). Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

…taryPosition helper - Add leftPadding parameter to createCausalMask() for per-sequence padding masks (backward compatible) - Implement makeMask() on BatchKVCache that always masks padding (including n=1 decode steps) - Create BatchPositionedKVCache protocol with batchOffset for per-sequence RoPE offsets - Implement applyRotaryPosition() dispatching to ArrayOffsetLayer for batch, OffsetLayer for single - Add isBatchCompatible() detection for CacheList, MambaCache, and QuantizedKVCache - Make BatchKVCache conform to BatchPositionedKVCache - Add 18 unit tests covering all validation assertions Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Port BatchRotatingKVCache from Python mlx-lm for models using sliding-window attention. Supports init with maxSize/leftPadding, multi-token concat path, single-token in-place rotation, temporal ordering, filter/extend/extract, merge from RotatingKVCache instances (with maxSize mismatch rejection), makeMask with window and left-padding, and fromSingle/toSingle conversions. Conforms to BatchPositionedKVCache protocol. Extract returns RotatingKVCache. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

The MLX Metal shader library (.metallib) is not bundled in SPM debug builds, causing tests that trigger GPU evaluation to crash the test runner. This adds an MLXMetalGuard helper that probes Metal availability using withError/eval, and XCTSkipUnless/.enabled(if:) guards to all MLX-dependent tests across the test suite. Changes: - New MLXMetalGuard.swift with cached Metal availability detection - skipIfMetalUnavailable() helper for XCTest-based tests - BatchKVCacheTests: all 22 tests guarded, fixed always-true 'is' check - BatchMaskingAndPositionTests: 11 Metal tests guarded, fixed unused k/v bindings - BatchRotatingKVCacheTests: all 22 tests guarded, fixed always-true 'is' checks - KVCacheTests: .enabled(if:) guard for Swift Testing - ChatSessionTests, EvalTests, SampleTests, NemotronHTests, MediaProcessingTests: guarded Metal-dependent tests swift test --filter MLXLMTests now exits with code 0 (117 skipped, 20 pass). Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Auto-format 4 batch files using swift-format: fix import ordering (@testable imports after regular imports) in 3 test files, and fix line length violation in BatchKVCache.swift. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

Capture the milestone synthesis, feature review reports, and shared validation knowledge for the next fix round. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>