oMLX - MLX inference server with paged SSD caching for coding agents on Apple Silicon #3203
jundot
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone, big thanks to the MLX team - MLX is what made this project possible.
I built oMLX, an MLX-based LLM inference server with a native macOS menu bar app, designed to make local coding agents (Claude Code, OpenClaw, Cursor) actually usable on Apple Silicon.
The problem
Coding agents send dozens of requests where the prompt prefix keeps shifting. Every existing MLX server invalidates the KV cache when this happens, forcing a full re-computation of the entire context. A few turns into a coding session and you're waiting 30-90s per response. This made local inference practically unusable for agentic workflows.
What oMLX does differently
The core feature is paged SSD caching - KV cache blocks are persisted to disk, so when a previous prefix comes back, it's restored instantly instead of being recomputed. Users are reporting TTFT dropping from 30-90s down to 1-3s on long contexts.
Features
Quick start
Download the DMG from Releases, drag to Applications, done. No terminal required to get started. The web dashboard generates the exact CLI command for tools like OpenClaw and Claude Code.
Uses vllm-mlx as a starting point for the basic MLX serving layer. Everything on top, SSD tiering, continuous batching, VLM, Anthropic API, the native macOS app is original work. 100% open source, Apache 2.0.
Currently at v0.2.2 with 110+ GitHub stars and 200 commits. I'd love any feedback on the architecture or feature requests - happy to answer questions!
GitHub: https://github.com/jundot/omlx
Beta Was this translation helpful? Give feedback.
All reactions