# install
uv tool install --python=python3.12 mlx-omni-server@latest
# upgrade
uv tool upgrade --all
# start on default port 10240
mlx-omni-serverCurrent models:
- Text - mlx-community/Qwen3-4B-4bit, mlx-community/Qwen3-8B-4bit-DWQ, mlx-community/gemme-3-4b-it-4bit-DWQ, mlx-community/gemma-3-12b-it-4bit-DWQ
- TTS - mlx-community/Dia-1.6B-4bit
- STT - mlx-community/whisper-large-v3-turbo
- Image - dhairyashil/FLUX.1-schnell-mflux-4bit
- Embeddings - mlx-community/all-MiniLM-L6-v2-4bit
Open WebUI and MCPO can be hosted as docker containers.
Tip
Each tool needs to be added individually at a different route.
servers.json
{
"mcpServers": {
"time": {
"args": ["mcp-server-time", "--local-timezone=America/New_York"],
"command": "uvx"
},
"context7": {
"args": ["-y", "@upstash/context7-mcp@latest"],
"command": "bunx"
},
"memory": {
"args": ["-y", "@modelcontextprotocol/server-memory"],
"command": "bunx"
},
"fetch": {
"args": ["mcp-server-fetch"],
"command": "uvx"
},
"sequentialthinking": {
"args": ["-y", "@modelcontextprotocol/server-sequential-thinking"],
"command": "bunx"
}
}
}@COPILOT_API_KEY = api_key
###
# Get available models
GET https://api.githubcopilot.com/models HTTP/1.1
Authorization: Bearer {{COPILOT_API_KEY}}
Copilot-Integration-Id: vscode-chat
Content-Type: application/jsonLM-Studio can be configured to start a server as a service.
Image Generation
- Current LLM: argmaxinc/mlx-FLUX.1-schnell-4bit-quantized
- Other: Dreamshaper XL 2.1 Turbo
- LORAs: Detailed Style XL and Detailed Perfection Style XL
- Negative embeddings: BadDream + UnrealisticDream and FastNegativeV2
- Upscaler: RealESRGAN x2
Install ComfyUI and ComfyUI-Manager according to the instructions. Copy [./llm/pyproject.toml] to the ComfyUI directory and run uv sync. Enable Dev Mode.
Custom plist can be moved to ~/Library/LaunchAgents to automatically start on login listening on 0.0.0.0 for local network access.
- Current workflow: Flux.1 Schnell
- Custom nodes (use ComfyUI-Manager to install):
- ComfyUI Impact Pack
- WAS Node Suite
- rgthree's ComfyUI Nodes
- ComfyUI-Custom-Scripts
- ComfyUI MLX Nodes
- ComfyUI Easy Use
Note
To expose on local network, edit: ~/Library/LaunchAgents/homebrew.mxcl.ollama.plist by adding the following:
<key>EnvironmentVariables</key>
<dict>
<key>OLLAMA_HOST</key>
<string>0.0.0.0</string>
</dict>Ollama can be started as a service by running brew services start ollama.
Tip
Generate Ollama Modelfiles for the LLMs with the following format to use multiple GPU threads:
FROM deepseek-coder-v2:16b-lite-instruct-q3_K_M
PARAMETER num_gpu 99
- Select embedding model. Current model is
sentence-transformers/all-MiniLM-L6-v2. Top Kis set to10,Chunk Sizeis set to2000,Chunk Overlapis set to500.
Prompt:
**Generate Response to User Query**
**Step 1: Parse Context Information**
Extract and utilize relevant knowledge from the provided context within `<context></context>` XML tags.
**Step 2: Analyze User Query**
Carefully read and comprehend the user's query, pinpointing the key concepts, entities, and intent behind the question.
**Step 3: Determine Response**
If the answer to the user's query can be directly inferred from the context information, provide a concise and accurate response in the same language as the user's query.
**Step 4: Handle Uncertainty**
If the answer is not clear, ask the user for clarification to ensure an accurate response.
**Step 5: Avoid Context Attribution**
When formulating your response, do not indicate that the information was derived from the context.
**Step 6: Respond in User's Language**
Maintain consistency by ensuring the response is in the same language as the user's query.
**Step 7: Provide Response**
Generate a clear, concise, and informative response to the user's query, adhering to the guidelines outlined above.
User Query: [query]
<context>
[context]
</context>