oMLX - MLX inference server with paged SSD caching for coding agents on Apple Silicon #3203

jundot · 2026-03-04T15:07:57Z

jundot
Mar 4, 2026

Hey everyone, big thanks to the MLX team - MLX is what made this project possible.

I built oMLX, an MLX-based LLM inference server with a native macOS menu bar app, designed to make local coding agents (Claude Code, OpenClaw, Cursor) actually usable on Apple Silicon.

The problem

Coding agents send dozens of requests where the prompt prefix keeps shifting. Every existing MLX server invalidates the KV cache when this happens, forcing a full re-computation of the entire context. A few turns into a coding session and you're waiting 30-90s per response. This made local inference practically unusable for agentic workflows.

What oMLX does differently

The core feature is paged SSD caching - KV cache blocks are persisted to disk, so when a previous prefix comes back, it's restored instantly instead of being recomputed. Users are reporting TTFT dropping from 30-90s down to 1-3s on long contexts.

Features

Paged SSD caching for KV blocks with fast prefix restoration
Continuous batching for concurrent requests
Multi-model serving - run LLM + embedding + reranker simultaneously
OpenAI and Anthropic compatible APIs (drop-in backend for Claude Code, OpenClaw, Cursor)
Tool calling support (native mlx-lm formats)
Vision-Language Model support (v0.2.0)
Native macOS menu bar app (PyObjC, not Electron) + web admin dashboard

Quick start

Download the DMG from Releases, drag to Applications, done. No terminal required to get started. The web dashboard generates the exact CLI command for tools like OpenClaw and Claude Code.

Uses vllm-mlx as a starting point for the basic MLX serving layer. Everything on top, SSD tiering, continuous batching, VLM, Anthropic API, the native macOS app is original work. 100% open source, Apache 2.0.

Currently at v0.2.2 with 110+ GitHub stars and 200 commits. I'd love any feedback on the architecture or feature requests - happy to answer questions!

GitHub: https://github.com/jundot/omlx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oMLX - MLX inference server with paged SSD caching for coding agents on Apple Silicon #3203

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

oMLX - MLX inference server with paged SSD caching for coding agents on Apple Silicon #3203

Uh oh!

Uh oh!

jundot Mar 4, 2026

The problem

What oMLX does differently

Features

Quick start

Replies: 0 comments

jundot
Mar 4, 2026