Skip to content

Releases: jundot/omlx

v0.2.5

07 Mar 18:31

Choose a tag to compare

What's New

Features

  • Presence penalty & min_p sampling: added presence_penalty and min_p as new sampling parameters for finer control over generation behavior. configurable per-model from the admin panel's model settings. (#94)

Bug Fixes

  • Metal crash on concurrent add_request: serialized add_request calls through the MLX executor to prevent Metal GPU crashes under concurrent request submission. (#95)
  • HuggingFace model search broken: removed deprecated direction parameter from huggingface_hub.list_models() that was silently breaking model search results.

Dependencies

  • mlx-vlm updated to 348466f: adds support for new VLM model types (MiniCPM-O, Phi-4-reasoning-vision, Phi-4-Multimodal) and includes various bug fixes. oMLX's model discovery and vision input pipeline updated accordingly.

Thanks to @rsnow for reporting the Metal crash issue!

v0.2.4

06 Mar 17:38

Choose a tag to compare

What's New

Features

  • Skip API key verification (localhost): when the server is bound to localhost, you can now disable API key verification for all API endpoints from global settings. makes local-only workflows frictionless, no more dummy keys needed. the option automatically resets when switching to a public host. (#92)
  • Model alias: set a custom API-visible name for any model via the model settings modal. /v1/models returns the alias instead of the directory name, and requests accept both the alias and the original name. useful when switching between inference providers without reconfiguring clients. (#92)
  • Version display: the CLI now shows the version in the startup banner, and the admin navbar displays the running version. (#90)

Bug Fixes

  • Loaded model lost after re-discovery: deleting a model or changing settings triggered model re-discovery, which dropped already-loaded engines from the pool. loaded models now preserve their runtime state across re-discovery. (#89)
  • Text-only VLM quant misdetection: text-only quantizations of natively multimodal models (e.g. Qwen 3.5 122B converted via mlx_lm.convert) were misdetected as VLM, causing a failed load attempt on every restart. now correctly classified as LLM when vision_config is absent. (#84)
  • SSD cache utilization over 100%: cache utilization could exceed 100% when available disk space shrank after initial calculation. now clamped properly.
  • Reasoning model output token caching: output tokens from reasoning models (with <think> tags) were being cached unnecessarily. now skipped to avoid polluting the prefix cache.

UI Improvements

  • Model settings modal reordered: alias / model type / ctx window / max tokens / temperature / top p / top k / rep. penalty / ttl / load defaults
  • Alias badge shown next to model name in both model settings list and model manager

New Contributors

  • @rsnow made their first contribution in #84

Thanks to @rsnow for the contribution!

v0.2.3.post4

06 Mar 06:13

Choose a tag to compare

Hotfix: Fix crash when running multiple models simultaneously

Fixed a bug where the server process terminates when two or more models receive requests at the same time.

Symptom: Server crashes when multiple models are used concurrently (e.g., VLM as interface model + LLM for chat in Open WebUI)

Cause: Each model engine ran GPU operations on a separate thread, causing Metal command buffer races on Apple Silicon

Fix: All model GPU operations now run on a single shared thread. No impact on single-model performance.

Closes #85 / Ref #80

v0.2.3.post3

06 Mar 02:26

Choose a tag to compare

Hotfix

Bug fixes

  • Fix VLM concurrent request GPU race condition causing TransferEncodingError and server crash (#80)
    • Remove mx.clear_cache() from event loop thread to prevent Metal GPU contention with _mlx_executor during concurrent VLM requests
    • Always synchronize generation_stream on request completion regardless of cache setting (previously skipped when oMLX cache was disabled)
    • Add clear_pending_embeddings() to normal completion path for consistency with abort path

v0.2.3.post2

05 Mar 18:41

Choose a tag to compare

Hotfix

Bug fixes

  • Fix VLM multi-request blocking: second request now starts immediately instead of waiting for the first to finish
    • Reverted vision encoding to use _mlx_executor instead of asyncio.to_thread() to avoid Metal GPU thread contention (#80, #81)
    • Changed prefill_batch_size default to prevent continuous batching from being disabled when it equaled completion_batch_size
  • Fix segfault when sending concurrent VLM image requests by ensuring all scheduler steps run on the MLX executor thread (#81)
  • Fix missing mcp package crash on server start
  • Fix memory limit UI showing incorrect label when set to 0

v0.2.3

05 Mar 07:17

Choose a tag to compare

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world - not just text.

Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well - it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.

For full v0.2.0 feature details, see v0.2.0 release notes.

New Features (v0.2.3)

Option to disable model memory limit

  • Added option to disable model memory limit by setting the slider to 0 in the admin dashboard

Bug Fixes (v0.2.3)

Streaming response corruption on keep-alive connections (#80)

  • Fixed TransferEncodingError when sending a second message in the same Open WebUI conversation over a local connection
  • Removed duplicate ASGI receive() consumers that corrupted HTTP keep-alive state
  • Replaced BaseHTTPMiddleware with a pure ASGI middleware to avoid streaming response pipe interference

VLM batch generation shape mismatch (#79)

  • Fixed shape mismatch error during VLM batch generation

Homebrew install failure (#78)

  • Fixed brew install by making MCP an optional dependency

SSD cache fallback robustness (#74, #75)

  • Fixed block metadata not being rolled back when SSD cache save fails
  • Fixed SSD fallback block registration in paged cache

Scheduler cache corruption recovery

  • Fixed broadened recovery to also catch AttributeError and ValueError

Full changelog: v0.2.2...v0.2.3

New Contributors

Thanks to @lyonsno for the contribution!

v0.2.2

04 Mar 13:58

Choose a tag to compare

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world — not just text.

Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well — it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.

For full v0.2.0 feature details, see v0.2.0 release notes.

New Features (v0.2.2)

Model type override and VLM-to-LLM fallback (#72)

  • Added model type override support — manually set a model as LLM or VLM regardless of auto-detection
  • VLM models can fall back to LLM mode for text-only workloads

MCP tool auto-injection

  • Added automatic MCP tool injection into chat completion requests
  • Added MCP config loading from settings.json with mcpServers key support

Bug Fixes (v0.2.2)

RGBA image broadcast error

  • Fixed crash when loading RGBA images by converting to RGB before processing

MCP tool definition serialization

  • Fixed Pydantic ToolDefinition not being converted to dict before MCP merge

Admin dashboard layout

  • Fixed repetition penalty label abbreviation and reordered sampling parameter row to top_p / top_k / rep_penalty

Full changelog: v0.2.1...v0.2.2

v0.2.1

03 Mar 15:43

Choose a tag to compare

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world — not just text.

Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well — it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.

For full v0.2.0 feature details, see v0.2.0 release notes.

Bug Fixes (v0.2.1)

VLM multi-turn image token mismatch (#69)

  • Fixed "Image features and image tokens do not match: tokens: 0, features N" error when using VLM with multi-turn conversation history
  • oMLX now uses content-aware assignment that places image placeholders on whichever user turn actually contains image content, regardless of position

VLM abort crash during prefill

  • Fixed crash when aborting a VLM request during the prefill phase (batch_generator None check)

Responses API content format support

  • Added input_text / input_image content type normalization for clients using the OpenAI Responses API format

Full changelog: v0.2.0...v0.2.1

v0.2.0

03 Mar 10:56

Choose a tag to compare

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world — not just text.

Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well — it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.

What's New

VLM Engine (omlx/engine/vlm.py, omlx/models/vlm.py)

  • Vision-Language Model engine via mlx-vlm integration for vision encoding + mlx-lm BatchGenerator for inference
  • VLMModelAdapter wrapping VLM's language_model for full BatchGenerator compatibility
  • Batched VLM prefill with per-UID embeddings in _BoundarySnapshotBatchGenerator
  • Chunked prefill support with embedding offset tracking for large vision inputs
  • Prefix cache and paged cache support for VLM requests (vision context reuse)

Image Processing (omlx/utils/image.py)

  • Image input support: base64 data URIs, HTTP/HTTPS URLs, local file paths
  • Multi-image chat for supported models (Qwen2.5-VL, GLM-4V, etc.)
  • SHA256 image hashing for prefix cache deduplication
  • Anthropic API vision support: base64 image_url conversion for /v1/messages

OCR Models

  • Auto-prompts for DeepSeek-OCR, DOTS-OCR, GLM-OCR with forced temperature=0.0
  • Stop token resolution for OCR-specific sequences (<|user|>, <|im_end|>, etc.)

Tool Calling for VLM

  • mlx-lm native tool parser injection into VLM tokenizer at engine start
  • Image + tool calling: tool definitions included in vision prompts via HF apply_chat_template
  • Supports json_tools, qwen3_coder, glm47, mistral, and all mlx-lm parsers

Benchmark

  • VLM image benchmark: "Include sample image" checkbox in continuous batching tests

Model Discovery

  • VLM auto-detection via mlx-vlm config patterns (vision_config, processor files)
  • VLM model settings modal in admin dashboard
  • Bench model filter updated to include VLM models

Tests

  • test_vlm_engine.py — 30 tests covering tool calling injection, chat template, OCR prompts, message processing, vision inputs
  • test_vlm_model_adapter.py — VLM adapter property, cache, embedding, forward pass tests
  • test_image_utils.py — Image loading, extraction, hashing tests
  • test_model_discovery.py — VLM model detection tests

33 files changed, +3,414 / -68 lines

Full changelog: v0.1.15...v0.2.0

v0.1.15

03 Mar 04:29

Choose a tag to compare

What's New

Features

  • Persistent serving stats: All-time token usage stats now survive server restarts, saved to ~/.omlx/stats.json. Session and all-time stats are displayed separately in both the admin dashboard and menubar app. Includes confirmation dialog before clearing stats. (#51)
  • Configurable initial cache blocks: Allow setting initial_cache_blocks to control pre-allocated KV cache memory at startup. (#35)
  • Internationalization: Admin dashboard now supports English, Korean, Japanese, and Chinese with self-hosted Noto Sans CJK fonts and language-based switching. (#64, #67)
  • Claude Code mode toggle: Replace single model selector with mode (Cloud/Local) toggle and per-tier (Opus/Sonnet/Haiku) model selectors in the admin dashboard. (#66)

Bug Fixes

  • Fix build script error when using --skip-venv by using onerror callback instead of double rmtree with ignore_errors. (#65)
  • Fix Claude Code tier selectors layout to horizontal 3-column arrangement.
  • Restore scaling/target fields in Claude Code log output.

New Contributors

Thanks to @Heyjoy, @buftar, and @thornad for their contributions!