Releases: jundot/omlx
v0.2.5
What's New
Features
- Presence penalty & min_p sampling: added
presence_penaltyandmin_pas new sampling parameters for finer control over generation behavior. configurable per-model from the admin panel's model settings. (#94)
Bug Fixes
- Metal crash on concurrent add_request: serialized
add_requestcalls through the MLX executor to prevent Metal GPU crashes under concurrent request submission. (#95) - HuggingFace model search broken: removed deprecated
directionparameter fromhuggingface_hub.list_models()that was silently breaking model search results.
Dependencies
- mlx-vlm updated to 348466f: adds support for new VLM model types (MiniCPM-O, Phi-4-reasoning-vision, Phi-4-Multimodal) and includes various bug fixes. oMLX's model discovery and vision input pipeline updated accordingly.
Thanks to @rsnow for reporting the Metal crash issue!
v0.2.4
What's New
Features
- Skip API key verification (localhost): when the server is bound to localhost, you can now disable API key verification for all API endpoints from global settings. makes local-only workflows frictionless, no more dummy keys needed. the option automatically resets when switching to a public host. (#92)
- Model alias: set a custom API-visible name for any model via the model settings modal.
/v1/modelsreturns the alias instead of the directory name, and requests accept both the alias and the original name. useful when switching between inference providers without reconfiguring clients. (#92) - Version display: the CLI now shows the version in the startup banner, and the admin navbar displays the running version. (#90)
Bug Fixes
- Loaded model lost after re-discovery: deleting a model or changing settings triggered model re-discovery, which dropped already-loaded engines from the pool. loaded models now preserve their runtime state across re-discovery. (#89)
- Text-only VLM quant misdetection: text-only quantizations of natively multimodal models (e.g. Qwen 3.5 122B converted via
mlx_lm.convert) were misdetected as VLM, causing a failed load attempt on every restart. now correctly classified as LLM whenvision_configis absent. (#84) - SSD cache utilization over 100%: cache utilization could exceed 100% when available disk space shrank after initial calculation. now clamped properly.
- Reasoning model output token caching: output tokens from reasoning models (with
<think>tags) were being cached unnecessarily. now skipped to avoid polluting the prefix cache.
UI Improvements
- Model settings modal reordered: alias / model type / ctx window / max tokens / temperature / top p / top k / rep. penalty / ttl / load defaults
- Alias badge shown next to model name in both model settings list and model manager
New Contributors
Thanks to @rsnow for the contribution!
v0.2.3.post4
Hotfix: Fix crash when running multiple models simultaneously
Fixed a bug where the server process terminates when two or more models receive requests at the same time.
Symptom: Server crashes when multiple models are used concurrently (e.g., VLM as interface model + LLM for chat in Open WebUI)
Cause: Each model engine ran GPU operations on a separate thread, causing Metal command buffer races on Apple Silicon
Fix: All model GPU operations now run on a single shared thread. No impact on single-model performance.
v0.2.3.post3
Hotfix
Bug fixes
- Fix VLM concurrent request GPU race condition causing TransferEncodingError and server crash (#80)
- Remove
mx.clear_cache()from event loop thread to prevent Metal GPU contention with_mlx_executorduring concurrent VLM requests - Always synchronize
generation_streamon request completion regardless of cache setting (previously skipped when oMLX cache was disabled) - Add
clear_pending_embeddings()to normal completion path for consistency with abort path
- Remove
v0.2.3.post2
Hotfix
Bug fixes
- Fix VLM multi-request blocking: second request now starts immediately instead of waiting for the first to finish
- Fix segfault when sending concurrent VLM image requests by ensuring all scheduler steps run on the MLX executor thread (#81)
- Fix missing mcp package crash on server start
- Fix memory limit UI showing incorrect label when set to 0
v0.2.3
Highlight: Vision-Language Model Support with Tiered Caching
Starting with v0.2.0, oMLX sees the world - not just text.
Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well - it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.
For full v0.2.0 feature details, see v0.2.0 release notes.
New Features (v0.2.3)
Option to disable model memory limit
- Added option to disable model memory limit by setting the slider to 0 in the admin dashboard
Bug Fixes (v0.2.3)
Streaming response corruption on keep-alive connections (#80)
- Fixed
TransferEncodingErrorwhen sending a second message in the same Open WebUI conversation over a local connection - Removed duplicate ASGI
receive()consumers that corrupted HTTP keep-alive state - Replaced
BaseHTTPMiddlewarewith a pure ASGI middleware to avoid streaming response pipe interference
VLM batch generation shape mismatch (#79)
- Fixed shape mismatch error during VLM batch generation
Homebrew install failure (#78)
- Fixed brew install by making MCP an optional dependency
SSD cache fallback robustness (#74, #75)
- Fixed block metadata not being rolled back when SSD cache save fails
- Fixed SSD fallback block registration in paged cache
Scheduler cache corruption recovery
- Fixed broadened recovery to also catch
AttributeErrorandValueError
Full changelog: v0.2.2...v0.2.3
New Contributors
Thanks to @lyonsno for the contribution!
v0.2.2
Highlight: Vision-Language Model Support with Tiered Caching
Starting with v0.2.0, oMLX sees the world — not just text.
Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well — it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.
For full v0.2.0 feature details, see v0.2.0 release notes.
New Features (v0.2.2)
Model type override and VLM-to-LLM fallback (#72)
- Added model type override support — manually set a model as LLM or VLM regardless of auto-detection
- VLM models can fall back to LLM mode for text-only workloads
MCP tool auto-injection
- Added automatic MCP tool injection into chat completion requests
- Added MCP config loading from
settings.jsonwithmcpServerskey support
Bug Fixes (v0.2.2)
RGBA image broadcast error
- Fixed crash when loading RGBA images by converting to RGB before processing
MCP tool definition serialization
- Fixed Pydantic
ToolDefinitionnot being converted to dict before MCP merge
Admin dashboard layout
- Fixed repetition penalty label abbreviation and reordered sampling parameter row to top_p / top_k / rep_penalty
Full changelog: v0.2.1...v0.2.2
v0.2.1
Highlight: Vision-Language Model Support with Tiered Caching
Starting with v0.2.0, oMLX sees the world — not just text.
Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well — it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.
For full v0.2.0 feature details, see v0.2.0 release notes.
Bug Fixes (v0.2.1)
VLM multi-turn image token mismatch (#69)
- Fixed "Image features and image tokens do not match: tokens: 0, features N" error when using VLM with multi-turn conversation history
- oMLX now uses content-aware assignment that places image placeholders on whichever user turn actually contains image content, regardless of position
VLM abort crash during prefill
- Fixed crash when aborting a VLM request during the prefill phase (batch_generator None check)
Responses API content format support
- Added
input_text/input_imagecontent type normalization for clients using the OpenAI Responses API format
Full changelog: v0.2.0...v0.2.1
v0.2.0
Highlight: Vision-Language Model Support with Tiered Caching
Starting with v0.2.0, oMLX sees the world — not just text.
Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well — it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.
What's New
VLM Engine (omlx/engine/vlm.py, omlx/models/vlm.py)
- Vision-Language Model engine via mlx-vlm integration for vision encoding + mlx-lm
BatchGeneratorfor inference - VLMModelAdapter wrapping VLM's
language_modelfor fullBatchGeneratorcompatibility - Batched VLM prefill with per-UID embeddings in
_BoundarySnapshotBatchGenerator - Chunked prefill support with embedding offset tracking for large vision inputs
- Prefix cache and paged cache support for VLM requests (vision context reuse)
Image Processing (omlx/utils/image.py)
- Image input support: base64 data URIs, HTTP/HTTPS URLs, local file paths
- Multi-image chat for supported models (Qwen2.5-VL, GLM-4V, etc.)
- SHA256 image hashing for prefix cache deduplication
- Anthropic API vision support: base64
image_urlconversion for/v1/messages
OCR Models
- Auto-prompts for DeepSeek-OCR, DOTS-OCR, GLM-OCR with forced
temperature=0.0 - Stop token resolution for OCR-specific sequences (
<|user|>,<|im_end|>, etc.)
Tool Calling for VLM
- mlx-lm native tool parser injection into VLM tokenizer at engine start
- Image + tool calling: tool definitions included in vision prompts via HF
apply_chat_template - Supports json_tools, qwen3_coder, glm47, mistral, and all mlx-lm parsers
Benchmark
- VLM image benchmark: "Include sample image" checkbox in continuous batching tests
Model Discovery
- VLM auto-detection via
mlx-vlmconfig patterns (vision_config, processor files) - VLM model settings modal in admin dashboard
- Bench model filter updated to include VLM models
Tests
test_vlm_engine.py— 30 tests covering tool calling injection, chat template, OCR prompts, message processing, vision inputstest_vlm_model_adapter.py— VLM adapter property, cache, embedding, forward pass teststest_image_utils.py— Image loading, extraction, hashing teststest_model_discovery.py— VLM model detection tests
33 files changed, +3,414 / -68 lines
Full changelog: v0.1.15...v0.2.0
v0.1.15
What's New
Features
- Persistent serving stats: All-time token usage stats now survive server restarts, saved to
~/.omlx/stats.json. Session and all-time stats are displayed separately in both the admin dashboard and menubar app. Includes confirmation dialog before clearing stats. (#51) - Configurable initial cache blocks: Allow setting
initial_cache_blocksto control pre-allocated KV cache memory at startup. (#35) - Internationalization: Admin dashboard now supports English, Korean, Japanese, and Chinese with self-hosted Noto Sans CJK fonts and language-based switching. (#64, #67)
- Claude Code mode toggle: Replace single model selector with mode (Cloud/Local) toggle and per-tier (Opus/Sonnet/Haiku) model selectors in the admin dashboard. (#66)
Bug Fixes
- Fix build script error when using
--skip-venvby usingonerrorcallback instead of doublermtreewithignore_errors. (#65) - Fix Claude Code tier selectors layout to horizontal 3-column arrangement.
- Restore scaling/target fields in Claude Code log output.
New Contributors
Thanks to @Heyjoy, @buftar, and @thornad for their contributions!