Minimal OpenAI-compatible server for GPT-OSS/Harmony models on Apple Silicon.
Built with mlx-lm (inference), openai-harmony (prompt formatting), and FastAPI (HTTP API).
- OpenAI-style
/v1/chat/completionsendpoint - Streaming (
SSE) and non-streaming responses - Harmony
reasoning_effortsupport (low,medium,high) - OpenAI tool-calling response format
- Robust Harmony tool-calling parser and stream recovery paths
- Usage token counts in responses
/healthqueue stats and/v1/modelscompatibility endpoint- Single-model runtime with FIFO request queueing
- macOS on Apple Silicon
- Python
>=3.11
pip install mlx-gpt-oss
mlx-gpt-oss --model mlx-community/gpt-oss-20b-MXFP4-Q8Default bind: http://0.0.0.0:8000
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
mlx-gpt-oss --model mlx-community/gpt-oss-20b-MXFP4-Q8| Endpoint | Method | Purpose |
|---|---|---|
/health |
GET |
Server health + active/queued request counts |
/v1/models |
GET |
Loaded model metadata |
/v1/chat/completions |
POST |
OpenAI-compatible chat completion |
modelis required for compatibility, but the server always uses the single model loaded at startup.- Supports OpenAI-style
messages,stream,tools,tool_choice,stop, and common sampling params. top_kis accepted but generation remains pinned totop_k=0for GPT-OSS behavior.reasoning_effortcan be set directly, or viachat_template_kwargs.reasoning_effort.- Streaming returns
chat.completion.chunkevents and ends with[DONE].
- Uses official Harmony assistant-action stop tokens from
openai-harmony(no hardcoded token IDs). - Handles streaming edge cases: unfinished tool-call endings, buffered fallback dedupe, and repeated identical tool calls.
- Addresses a class of tool-calling failures seen in other MLX servers.
| Flag | Default | Description |
|---|---|---|
--model |
required | Model path or Hugging Face ID |
--host |
0.0.0.0 |
Bind address |
--port |
8000 |
Bind port |
--context-length |
8196 |
Max KV cache context length |
--log-level |
INFO |
DEBUG, INFO, WARNING, ERROR |
--log-file |
disabled | Optional rotating file log output |
--debug-raw-preview-chars |
0 |
In DEBUG, preview N chars of prompts/output |
--http-access-log |
False |
Emit one access log line per HTTP request |
- No built-in auth or API key checks, this is your responsibility.
- Default host is
0.0.0.0for local/LAN self-hosting. - CORS is permissive (
*, credentials disabled). - Use
--host 127.0.0.1for local-only access.