Releases · jundot/omlx

07 Mar 18:31

jundot

v0.2.5

f00e039

v0.2.5 Latest

Latest

What's New

Features

Presence penalty & min_p sampling: added presence_penalty and min_p as new sampling parameters for finer control over generation behavior. configurable per-model from the admin panel's model settings. (#94)

Bug Fixes

Metal crash on concurrent add_request: serialized add_request calls through the MLX executor to prevent Metal GPU crashes under concurrent request submission. (#95)
HuggingFace model search broken: removed deprecated direction parameter from huggingface_hub.list_models() that was silently breaking model search results.

Dependencies

mlx-vlm updated to 348466f: adds support for new VLM model types (MiniCPM-O, Phi-4-reasoning-vision, Phi-4-Multimodal) and includes various bug fixes. oMLX's model discovery and vision input pipeline updated accordingly.

Thanks to @rsnow for reporting the Metal crash issue!

Contributors

rsnow

Assets 3

06 Mar 17:38

jundot

v0.2.4

1f328a8

v0.2.4

What's New

Features

Skip API key verification (localhost): when the server is bound to localhost, you can now disable API key verification for all API endpoints from global settings. makes local-only workflows frictionless, no more dummy keys needed. the option automatically resets when switching to a public host. (#92)
Model alias: set a custom API-visible name for any model via the model settings modal. /v1/models returns the alias instead of the directory name, and requests accept both the alias and the original name. useful when switching between inference providers without reconfiguring clients. (#92)
Version display: the CLI now shows the version in the startup banner, and the admin navbar displays the running version. (#90)

Bug Fixes

Loaded model lost after re-discovery: deleting a model or changing settings triggered model re-discovery, which dropped already-loaded engines from the pool. loaded models now preserve their runtime state across re-discovery. (#89)
Text-only VLM quant misdetection: text-only quantizations of natively multimodal models (e.g. Qwen 3.5 122B converted via mlx_lm.convert) were misdetected as VLM, causing a failed load attempt on every restart. now correctly classified as LLM when vision_config is absent. (#84)
SSD cache utilization over 100%: cache utilization could exceed 100% when available disk space shrank after initial calculation. now clamped properly.
Reasoning model output token caching: output tokens from reasoning models (with <think> tags) were being cached unnecessarily. now skipped to avoid polluting the prefix cache.

UI Improvements

Model settings modal reordered: alias / model type / ctx window / max tokens / temperature / top p / top k / rep. penalty / ttl / load defaults
Alias badge shown next to model name in both model settings list and model manager

New Contributors

@rsnow made their first contribution in #84

Thanks to @rsnow for the contribution!

Contributors

rsnow

Assets 3

06 Mar 06:13

jundot

v0.2.3.post4

80b776b

v0.2.3.post4

Hotfix: Fix crash when running multiple models simultaneously

Fixed a bug where the server process terminates when two or more models receive requests at the same time.

Symptom: Server crashes when multiple models are used concurrently (e.g., VLM as interface model + LLM for chat in Open WebUI)

Cause: Each model engine ran GPU operations on a separate thread, causing Metal command buffer races on Apple Silicon

Fix: All model GPU operations now run on a single shared thread. No impact on single-model performance.

Closes #85 / Ref #80

Assets 3

06 Mar 02:26

jundot

v0.2.3.post3

ca4051b

v0.2.3.post3

Hotfix

Bug fixes

Fix VLM concurrent request GPU race condition causing TransferEncodingError and server crash (#80)
- Remove mx.clear_cache() from event loop thread to prevent Metal GPU contention with _mlx_executor during concurrent VLM requests
- Always synchronize generation_stream on request completion regardless of cache setting (previously skipped when oMLX cache was disabled)
- Add clear_pending_embeddings() to normal completion path for consistency with abort path

Assets 3

05 Mar 18:41

jundot

v0.2.3.post2

be81f22

v0.2.3.post2

Hotfix

Bug fixes

Fix VLM multi-request blocking: second request now starts immediately instead of waiting for the first to finish
- Reverted vision encoding to use _mlx_executor instead of asyncio.to_thread() to avoid Metal GPU thread contention (#80, #81)
- Changed prefill_batch_size default to prevent continuous batching from being disabled when it equaled completion_batch_size
Fix segfault when sending concurrent VLM image requests by ensuring all scheduler steps run on the MLX executor thread (#81)
Fix missing mcp package crash on server start
Fix memory limit UI showing incorrect label when set to 0

Assets 3

05 Mar 07:17

jundot

v0.2.3

2b45285

v0.2.3

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world - not just text.

Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well - it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.

For full v0.2.0 feature details, see v0.2.0 release notes.

New Features (v0.2.3)

Option to disable model memory limit

Added option to disable model memory limit by setting the slider to 0 in the admin dashboard

Bug Fixes (v0.2.3)

Streaming response corruption on keep-alive connections (#80)

Fixed TransferEncodingError when sending a second message in the same Open WebUI conversation over a local connection
Removed duplicate ASGI receive() consumers that corrupted HTTP keep-alive state
Replaced BaseHTTPMiddleware with a pure ASGI middleware to avoid streaming response pipe interference

VLM batch generation shape mismatch (#79)

Fixed shape mismatch error during VLM batch generation

Homebrew install failure (#78)

Fixed brew install by making MCP an optional dependency

SSD cache fallback robustness (#74, #75)

Fixed block metadata not being rolled back when SSD cache save fails
Fixed SSD fallback block registration in paged cache

Scheduler cache corruption recovery

Fixed broadened recovery to also catch AttributeError and ValueError

Full changelog: v0.2.2...v0.2.3

New Contributors

@lyonsno made their first contribution in #74

Thanks to @lyonsno for the contribution!

Contributors

lyonsno

Assets 3

04 Mar 13:58

jundot

v0.2.2

071b9a7

v0.2.2

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world — not just text.

Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well — it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.

For full v0.2.0 feature details, see v0.2.0 release notes.

New Features (v0.2.2)

Model type override and VLM-to-LLM fallback (#72)

Added model type override support — manually set a model as LLM or VLM regardless of auto-detection
VLM models can fall back to LLM mode for text-only workloads

MCP tool auto-injection

Added automatic MCP tool injection into chat completion requests
Added MCP config loading from settings.json with mcpServers key support

Bug Fixes (v0.2.2)

RGBA image broadcast error

Fixed crash when loading RGBA images by converting to RGB before processing

MCP tool definition serialization

Fixed Pydantic ToolDefinition not being converted to dict before MCP merge

Admin dashboard layout

Fixed repetition penalty label abbreviation and reordered sampling parameter row to top_p / top_k / rep_penalty

Full changelog: v0.2.1...v0.2.2

Assets 3

03 Mar 15:43

jundot

v0.2.1

01ccf3b

v0.2.1

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world — not just text.

Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well — it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.

For full v0.2.0 feature details, see v0.2.0 release notes.

Bug Fixes (v0.2.1)

VLM multi-turn image token mismatch (#69)

Fixed "Image features and image tokens do not match: tokens: 0, features N" error when using VLM with multi-turn conversation history
oMLX now uses content-aware assignment that places image placeholders on whichever user turn actually contains image content, regardless of position

VLM abort crash during prefill

Fixed crash when aborting a VLM request during the prefill phase (batch_generator None check)

Responses API content format support

Added input_text / input_image content type normalization for clients using the OpenAI Responses API format

Full changelog: v0.2.0...v0.2.1

Assets 3

03 Mar 10:56

jundot

v0.2.0

f976822

v0.2.0

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world — not just text.

Vision-language models now run natively on your Mac with the same continuous batching, paged KV cache, and SSD-tiered caching that powers text inference. Combined with production-grade tool calling, your Apple Silicon machine becomes a local inference server that doesn't just demo well — it actually works. Agentic coding, OpenClaw, multi-turn vision chat: real workloads, real performance, no cloud required.

What's New

VLM Engine (`omlx/engine/vlm.py`, `omlx/models/vlm.py`)

Vision-Language Model engine via mlx-vlm integration for vision encoding + mlx-lm BatchGenerator for inference
VLMModelAdapter wrapping VLM's language_model for full BatchGenerator compatibility
Batched VLM prefill with per-UID embeddings in _BoundarySnapshotBatchGenerator
Chunked prefill support with embedding offset tracking for large vision inputs
Prefix cache and paged cache support for VLM requests (vision context reuse)

Image Processing (`omlx/utils/image.py`)

Image input support: base64 data URIs, HTTP/HTTPS URLs, local file paths
Multi-image chat for supported models (Qwen2.5-VL, GLM-4V, etc.)
SHA256 image hashing for prefix cache deduplication
Anthropic API vision support: base64 image_url conversion for /v1/messages

OCR Models

Auto-prompts for DeepSeek-OCR, DOTS-OCR, GLM-OCR with forced temperature=0.0
Stop token resolution for OCR-specific sequences (<|user|>, <|im_end|>, etc.)

Tool Calling for VLM

mlx-lm native tool parser injection into VLM tokenizer at engine start
Image + tool calling: tool definitions included in vision prompts via HF apply_chat_template
Supports json_tools, qwen3_coder, glm47, mistral, and all mlx-lm parsers

Benchmark

VLM image benchmark: "Include sample image" checkbox in continuous batching tests

Model Discovery

VLM auto-detection via mlx-vlm config patterns (vision_config, processor files)
VLM model settings modal in admin dashboard
Bench model filter updated to include VLM models

Tests

test_vlm_engine.py — 30 tests covering tool calling injection, chat template, OCR prompts, message processing, vision inputs
test_vlm_model_adapter.py — VLM adapter property, cache, embedding, forward pass tests
test_image_utils.py — Image loading, extraction, hashing tests
test_model_discovery.py — VLM model detection tests

33 files changed, +3,414 / -68 lines

Full changelog: v0.1.15...v0.2.0

Assets 3

03 Mar 04:29

jundot

v0.1.15

5c796e1

v0.1.15

What's New

Features

Persistent serving stats: All-time token usage stats now survive server restarts, saved to ~/.omlx/stats.json. Session and all-time stats are displayed separately in both the admin dashboard and menubar app. Includes confirmation dialog before clearing stats. (#51)
Configurable initial cache blocks: Allow setting initial_cache_blocks to control pre-allocated KV cache memory at startup. (#35)
Internationalization: Admin dashboard now supports English, Korean, Japanese, and Chinese with self-hosted Noto Sans CJK fonts and language-based switching. (#64, #67)
Claude Code mode toggle: Replace single model selector with mode (Cloud/Local) toggle and per-tier (Opus/Sonnet/Haiku) model selectors in the admin dashboard. (#66)

Bug Fixes

Fix build script error when using --skip-venv by using onerror callback instead of double rmtree with ignore_errors. (#65)
Fix Claude Code tier selectors layout to horizontal 3-column arrangement.
Restore scaling/target fields in Claude Code log output.

New Contributors

@Heyjoy made their first contribution in #64
@buftar made their first contribution in #66

Thanks to @Heyjoy, @buftar, and @thornad for their contributions!

Contributors

Heyjoy, thornad, and buftar

Assets 3

Releases: jundot/omlx

v0.2.5

What's New

Features

Bug Fixes

Dependencies

Contributors

Uh oh!

v0.2.4

What's New

Features

Bug Fixes

UI Improvements

New Contributors

Contributors

Uh oh!

v0.2.3.post4

Hotfix: Fix crash when running multiple models simultaneously

Uh oh!

v0.2.3.post3

Hotfix

Bug fixes

Uh oh!

v0.2.3.post2

Hotfix

Bug fixes

Uh oh!

v0.2.3

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world - not just text.

New Features (v0.2.3)

Option to disable model memory limit

Bug Fixes (v0.2.3)

Streaming response corruption on keep-alive connections (#80)

VLM batch generation shape mismatch (#79)

Homebrew install failure (#78)

SSD cache fallback robustness (#74, #75)

Scheduler cache corruption recovery

New Contributors

Contributors

Uh oh!

v0.2.2

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world — not just text.

New Features (v0.2.2)

Model type override and VLM-to-LLM fallback (#72)

MCP tool auto-injection

Bug Fixes (v0.2.2)

RGBA image broadcast error

MCP tool definition serialization

Admin dashboard layout

Uh oh!

v0.2.1

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world — not just text.

Bug Fixes (v0.2.1)

VLM multi-turn image token mismatch (#69)

VLM abort crash during prefill

Responses API content format support

Uh oh!

v0.2.0

Highlight: Vision-Language Model Support with Tiered Caching

Starting with v0.2.0, oMLX sees the world — not just text.

What's New

VLM Engine (omlx/engine/vlm.py, omlx/models/vlm.py)

Image Processing (omlx/utils/image.py)

OCR Models

Tool Calling for VLM

Benchmark

Model Discovery

Tests

Uh oh!

v0.1.15

What's New

Features

Bug Fixes

New Contributors

Contributors

Uh oh!

VLM Engine (`omlx/engine/vlm.py`, `omlx/models/vlm.py`)

Image Processing (`omlx/utils/image.py`)