You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: skills/mlx-swift-lm/SKILL.md
+84-4Lines changed: 84 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
name: swift-mlx-lm
3
-
description: MLX Swift LM - Run LLMs and VLMs on Apple Silicon using MLX. Covers local inference, streaming, wired memory coordination, tool calling, LoRA fine-tuning, embeddings, and model porting.
3
+
description: MLX Swift LM - Run LLMs and VLMs on Apple Silicon using MLX. Covers local inference, streaming, batched inference, wired memory coordination, tool calling, LoRA fine-tuning, embeddings, and model porting.
4
4
triggers:
5
5
- mlx
6
6
- mlx-swift
@@ -14,26 +14,34 @@ triggers:
14
14
- wired memory ticket
15
15
- model porting
16
16
- add model support
17
+
- batching
18
+
- batch inference
19
+
- continuous batching
20
+
- inference scheduler
21
+
- prompt cache
17
22
---
18
23
19
24
# mlx-swift-lm Skill
20
25
21
26
## 1. Overview & Triggers
22
27
23
-
mlx-swift-lm is a Swift package for running Large Language Models (LLMs) and Vision-Language Models (VLMs) on Apple Silicon using MLX. It supports local inference, streaming generation, wired-memory coordination, tool calling, LoRA/DoRA fine-tuning, and embeddings.
28
+
mlx-swift-lm is a Swift package for running Large Language Models (LLMs) and Vision-Language Models (VLMs) on Apple Silicon using MLX. It supports local inference, streaming generation, continuous batching (multiple concurrent requests), wired-memory coordination, prompt caching, tool calling, LoRA/DoRA fine-tuning, and embeddings.
24
29
25
30
### When to Use This Skill
26
31
- Running LLM/VLM inference on macOS/iOS with Apple Silicon
27
32
- Streaming text generation from local models
33
+
- Batching multiple concurrent inference requests for throughput
28
34
- Coordinating concurrent inference with wired-memory policies and tickets
35
+
- Caching prompt prefill KV state across requests
29
36
- Tool calling / function calling with models
30
37
- LoRA adapter training and fine-tuning
31
38
- Text embeddings for RAG/semantic search
32
39
- Porting model architectures from Python MLX-LM to Swift
0 commit comments