Skip to content

Commit 27fc9b9

Browse files
committed
Update SKILL.md
1 parent 1160f8a commit 27fc9b9

File tree

8 files changed

+609
-12
lines changed

8 files changed

+609
-12
lines changed

skills/mlx-swift-lm/SKILL.md

Lines changed: 84 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
name: swift-mlx-lm
3-
description: MLX Swift LM - Run LLMs and VLMs on Apple Silicon using MLX. Covers local inference, streaming, wired memory coordination, tool calling, LoRA fine-tuning, embeddings, and model porting.
3+
description: MLX Swift LM - Run LLMs and VLMs on Apple Silicon using MLX. Covers local inference, streaming, batched inference, wired memory coordination, tool calling, LoRA fine-tuning, embeddings, and model porting.
44
triggers:
55
- mlx
66
- mlx-swift
@@ -14,26 +14,34 @@ triggers:
1414
- wired memory ticket
1515
- model porting
1616
- add model support
17+
- batching
18+
- batch inference
19+
- continuous batching
20+
- inference scheduler
21+
- prompt cache
1722
---
1823

1924
# mlx-swift-lm Skill
2025

2126
## 1. Overview & Triggers
2227

23-
mlx-swift-lm is a Swift package for running Large Language Models (LLMs) and Vision-Language Models (VLMs) on Apple Silicon using MLX. It supports local inference, streaming generation, wired-memory coordination, tool calling, LoRA/DoRA fine-tuning, and embeddings.
28+
mlx-swift-lm is a Swift package for running Large Language Models (LLMs) and Vision-Language Models (VLMs) on Apple Silicon using MLX. It supports local inference, streaming generation, continuous batching (multiple concurrent requests), wired-memory coordination, prompt caching, tool calling, LoRA/DoRA fine-tuning, and embeddings.
2429

2530
### When to Use This Skill
2631
- Running LLM/VLM inference on macOS/iOS with Apple Silicon
2732
- Streaming text generation from local models
33+
- Batching multiple concurrent inference requests for throughput
2834
- Coordinating concurrent inference with wired-memory policies and tickets
35+
- Caching prompt prefill KV state across requests
2936
- Tool calling / function calling with models
3037
- LoRA adapter training and fine-tuning
3138
- Text embeddings for RAG/semantic search
3239
- Porting model architectures from Python MLX-LM to Swift
3340

3441
### Architecture Overview
3542
```
36-
MLXLMCommon - Core infra (ModelContainer, ChatSession, Evaluate, KVCache, wired memory helpers)
43+
MLXLMCommon - Core infra (ModelContainer, ChatSession, Evaluate, KVCache, Batching, wired memory helpers)
44+
Batching/ - InferenceScheduler, BatchKVCache, BatchTokenIterator, LRUPromptCache
3745
MLXLLM - Text-only LLM support (Llama, Qwen, Gemma, Phi, DeepSeek, etc.)
3846
MLXVLM - Vision-Language Models (Qwen-VL, PaliGemma, Gemma3, etc.)
3947
MLXEmbedders - Embedding models and pooling utilities
@@ -47,6 +55,11 @@ MLXEmbedders - Embedding models and pooling utilities
4755
| Simplified chat API | `Libraries/MLXLMCommon/ChatSession.swift` |
4856
| Generation & streaming APIs | `Libraries/MLXLMCommon/Evaluate.swift` |
4957
| KV cache types | `Libraries/MLXLMCommon/KVCache.swift` |
58+
| Batch inference scheduler | `Libraries/MLXLMCommon/Batching/InferenceScheduler.swift` |
59+
| Batch KV caches | `Libraries/MLXLMCommon/Batching/BatchKVCache.swift`, `BatchRotatingKVCache.swift` |
60+
| Batch token engine | `Libraries/MLXLMCommon/Batching/BatchTokenIterator.swift` |
61+
| Batch-aware RoPE helper | `Libraries/MLXLMCommon/Batching/BatchPositionedCache.swift` |
62+
| Prompt cache (LRU) | `Libraries/MLXLMCommon/Batching/LRUPromptCache.swift` |
5063
| Wired-memory policies | `Libraries/MLXLMCommon/WiredMemoryPolicies.swift` |
5164
| Wired-memory measurement helpers | `Libraries/MLXLMCommon/WiredMemoryUtils.swift` |
5265
| Model configuration | `Libraries/MLXLMCommon/ModelConfiguration.swift` |
@@ -224,8 +237,14 @@ let params = GenerateParameters(
224237
quantizedKVStart: 0, // Token index to start KV quantization
225238
temperature: 0.7, // 0 = greedy / argmax
226239
topP: 0.9, // Nucleus sampling
240+
topK: 40, // Top-K sampling (0 disables)
241+
minP: 0.05, // Min-P threshold relative to max prob (0 disables)
227242
repetitionPenalty: 1.1, // Penalize repeats
228243
repetitionContextSize: 20, // Penalty window
244+
presencePenalty: 0.0, // Additive penalty for tokens in recent context
245+
presenceContextSize: 20, // Presence penalty window
246+
frequencyPenalty: 0.0, // Additive penalty scaling with token frequency
247+
frequencyContextSize: 20, // Frequency penalty window
229248
prefillStepSize: 512 // Prompt prefill chunk size
230249
)
231250
```
@@ -256,6 +275,46 @@ for await generation in stream {
256275

257276
For policy selection, reservations, and measurement-based budgeting, see [references/wired-memory.md](references/wired-memory.md).
258277

278+
### Batched Inference (Continuous Batching)
279+
280+
Enable transparent batching to serve multiple concurrent requests through a single model:
281+
282+
```swift
283+
let scheduler = InferenceScheduler()
284+
let promptCache = LRUPromptCache(maxSize: 10)
285+
286+
let container = try await LLMModelFactory.shared.loadContainer(
287+
configuration: .init(id: "mlx-community/Qwen3-4B-4bit")
288+
)
289+
container.scheduler = scheduler
290+
container.promptCache = promptCache
291+
292+
// Multiple concurrent requests are automatically batched
293+
async let r1 = container.generate(input: input1, parameters: params)
294+
async let r2 = container.generate(input: input2, parameters: params)
295+
```
296+
297+
The scheduler uses a single-first upgrade strategy:
298+
- First request runs via fast `TokenIterator` path (zero batch overhead)
299+
- When a second request arrives, the scheduler upgrades to `BatchTokenIterator` by migrating KV caches
300+
- State machine: `.idle` `.single` `.batched`
301+
302+
Raw token batching is also supported:
303+
```swift
304+
let tokenStream = try await container.generateTokens(
305+
input: lmInput,
306+
parameters: params
307+
)
308+
for await event in tokenStream {
309+
switch event {
310+
case .token(let tokenID): print(tokenID)
311+
case .info(let info): print("stop=\(info.stopReason)")
312+
}
313+
}
314+
```
315+
316+
See [references/batching.md](references/batching.md) for full API details.
317+
259318
### Prompt Caching / History Re-hydration
260319

261320
```swift
@@ -331,6 +390,14 @@ await task.value
331390
// DO: Use wired tickets when coordinating concurrent workloads
332391
let ticket = WiredSumPolicy().ticket(size: estimatedBytes)
333392
let _ = try await modelContainer.generate(input: lmInput, parameters: params, wiredMemoryTicket: ticket)
393+
394+
// DO: Enable batching for multi-user/multi-request scenarios
395+
container.scheduler = InferenceScheduler()
396+
container.promptCache = LRUPromptCache(maxSize: 10)
397+
398+
// DO: Use applyRotaryPosition() in model implementations for batch compatibility
399+
queries = applyRotaryPosition(rope, to: queries, cache: cache)
400+
keys = applyRotaryPosition(rope, to: keys, cache: cache)
334401
```
335402

336403
### DON'T
@@ -348,6 +415,13 @@ for await item in stream {
348415
if shouldStop { break }
349416
}
350417
// await task.value is required for deterministic cleanup
418+
419+
// DON'T: Use scalar rope(x, offset: cache.offset) in models.
420+
// Use applyRotaryPosition(rope, to: x, cache: cache) instead.
421+
// Scalar offset breaks RoPE for left-padded batch sequences.
422+
423+
// DON'T: Use deprecated createAttentionMask(h:cache:[KVCache]?)
424+
// Use cache.makeMask(n:windowSize:returnArray:) or the single-cache overload
351425
```
352426

353427
### Thread Safety
@@ -370,6 +444,7 @@ await session.clear()
370444
|-----------|-------------|
371445
| [references/model-container.md](references/model-container.md) | Loading models, ModelContainer API, ModelConfiguration |
372446
| [references/generation.md](references/generation.md) | `generate`, `generateTask`, raw token streaming APIs |
447+
| [references/batching.md](references/batching.md) | InferenceScheduler, BatchKVCache, BatchTokenIterator, LRUPromptCache |
373448
| [references/wired-memory.md](references/wired-memory.md) | Wired tickets, policies, budgeting, reservations |
374449
| [references/kv-cache.md](references/kv-cache.md) | Cache types, memory optimization, cache serialization |
375450
| [references/concurrency.md](references/concurrency.md) | Thread safety, SerialAccessContainer, async patterns |
@@ -389,7 +464,8 @@ await session.clear()
389464
| `perform { model, tokenizer in }` | `perform { context in }` |
390465
| `TokenIterator(prompt: MLXArray)` | `TokenIterator(input: LMInput)` |
391466
| `ModelRegistry` typealias | `LLMRegistry` or `VLMRegistry` |
392-
| `createAttentionMask(h:cache:[KVCache]?)` | `createAttentionMask(h:cache:KVCache?)` |
467+
| `createAttentionMask(h:cache:[KVCache]?)` | `createAttentionMask(h:cache:KVCache?)` or `cache.makeMask(n:windowSize:returnArray:)` |
468+
| `rope(x, offset: cache.offset)` (scalar) | `applyRotaryPosition(rope, to: x, cache: cache)` (batch-safe) |
393469

394470
## 9. Automatic vs Manual Configuration
395471

@@ -415,5 +491,9 @@ await session.clear()
415491
| `toolCallFormat` | Override auto-detected tool parser format |
416492
| `maxKVSize` | Enable sliding window cache |
417493
| `kvBits`, `kvGroupSize`, `quantizedKVStart` | Enable and tune KV quantization |
494+
| `topK`, `minP` | Enable top-K / min-P sampling filters |
495+
| `presencePenalty`, `frequencyPenalty` | Penalize repeated tokens by presence/frequency |
418496
| `prefillStepSize` | Tune prompt prefill chunking/perf tradeoff |
419497
| `wiredMemoryTicket` | Coordinate policy-based wired-memory limits |
498+
| `container.scheduler` | Enable continuous batching for concurrent requests |
499+
| `container.promptCache` | Enable LRU prompt cache across requests |

0 commit comments

Comments
 (0)