| title | Local Inference |
|---|---|
| description | Run Morphik completely offline with local embedding and completion models |
Morphik comes with built-in support for running both embeddings and completions locally, ensuring your data never leaves your machine. Choose between two powerful local inference engines:
- Lemonade - Windows-only, optimized for AMD GPUs and NPUs
- Ollama - Cross-platform (Windows, macOS, Linux), supports various hardware
Both are pre-configured in Morphik and can be selected through the UI or configuration file.
Running models locally provides several key advantages:
- Complete Privacy: Your data never leaves your machine
- No API Costs: Eliminate ongoing API expenses
- Low Latency: No network round-trips for inference
- Offline Capability: Work without internet connectivity
- Hardware Acceleration: Leverage your local GPU, NPU, or specialized AI processors
Lemonade SDK provides high-performance local inference on Windows, with optimizations for AMD hardware. It exposes an OpenAI-compatible API and is **already configured in Morphik**.
<Note>
**Built-in Support**: Lemonade models are pre-configured in `morphik.toml` for both embeddings and completions. Simply install Lemonade Server and select the models in the UI.
</Note>
### System Requirements
- **Windows 10/11 only** (x86/x64)
- **8GB+ RAM** (16GB recommended)
- **Python 3.10+**
- **Optional but recommended**:
- AMD Ryzen AI 300 series (NPU acceleration)
- AMD Radeon 7000/9000 series (GPU acceleration)
### Quick Start
<Steps>
<Step title="Download Lemonade">
Download and install Lemonade from the official site:
[lemonade-server.ai](https://lemonade-server.ai/).
</Step>
<Step title="Start Lemonade Server">
Start the Lemonade server following their documentation. Make sure it is running and note the port.
The API is OpenAI-compatible (e.g., `/api/v1/models`).
</Step>
<Step title="Configure Morphik - Two Options">
### Option 1: Using the UI (Recommended)
1. Open the Morphik UI and go to Settings → API Keys
2. Select "Lemonade" (🍋). No API key is required
3. Enter the host and port where Lemonade is running
<img src="/images/add_port_to_lemonade.png"
alt="Lemonade provider settings with host and port"
style={{maxWidth: '640px', margin: '16px 0'}} />
4. Open Chat and use the model selector pill (top left) to pick a Lemonade model
<img src="/images/see_lemonade_models_in_chat.png"
alt="Chat model selector showing Lemonade models"
style={{maxWidth: '640px', margin: '16px 0'}} />
<Warning>
Running inside Docker? Use `host.docker.internal` instead of `localhost` for the host field.
</Warning>
<Warning>
If you are not using a vision-capable model, turn off ColPali in chat settings (settings → ColPali) to avoid vision-dependent paths.
</Warning>
### Option 2: Edit morphik.toml
You can also set Lemonade models directly in `morphik.toml` so they're used by default.
Ensure the `api_base` points to your Lemonade server:
```toml
lemonade_qwen = {
model_name = "openai/Qwen2.5-VL-7B-Instruct-GGUF",
api_base = "http://localhost:8020/api/v1",
vision = true
}
lemonade_embedding = {
model_name = "openai/nomic-embed-text-v1-GGUF",
api_base = "http://localhost:8020/api/v1"
}
[completion]
model = "lemonade_qwen"
[embedding]
model = "lemonade_embedding"
```
</Step>
</Steps>
<Warning>
If your system has under 16GB RAM, prefer models under ~4B parameters or smaller quantizations
(e.g., Q4/Q5). Larger models may fail to load or will be very slow on low-memory systems.
</Warning>
### Performance Tips
- **Model Quantization**: Use GGUF quantized models for better performance
- **Low-memory systems**: Under 16GB RAM, prefer models under 4B parameters
- **Hardware Acceleration**: Automatically detects and uses AMD GPUs/NPUs when available
- **Memory Management**: Models are cached after first download
### Troubleshooting
<AccordionGroup>
<Accordion title="Connection Issues">
- Verify server health: `curl http://localhost:8020/health`
- List models: `curl http://localhost:8020/api/v1/models`
- For Docker: Use `host.docker.internal` instead of `localhost`
- Check firewall settings for port 8020
</Accordion>
<Accordion title="Model Loading Errors">
- Ensure sufficient disk space (5-15GB per model)
- Try smaller quantized versions (Q4, Q5)
- Check model compatibility with `lemonade list`
</Accordion>
<Accordion title="Performance Issues">
- Use GGUF quantized models for better performance
- Monitor GPU/NPU usage with system tools
- Adjust batch size and context length in model config
</Accordion>
</AccordionGroup>
Ollama provides cross-platform local inference for both embeddings and completions. It's **already configured in Morphik** and supports various hardware accelerators.
<Note>
**Built-in Support**: Ollama models are pre-configured in `morphik.toml` for both embeddings and completions. Simply install Ollama and select the models in the UI.
</Note>
### System Requirements
- **macOS**: Apple Silicon (M1/M2/M3) or Intel Mac with 8GB+ RAM
- **Linux**: x86_64 or ARM64, 8GB+ RAM, optional NVIDIA GPU
- **Windows**: Windows 10/11, 8GB+ RAM, optional NVIDIA GPU
### Quick Start
<Steps>
<Step title="Install Ollama">
<Tabs>
<Tab title="macOS">
```bash
brew install ollama
# Or: curl -fsSL https://ollama.com/install.sh | sh
```
</Tab>
<Tab title="Linux">
```bash
curl -fsSL https://ollama.com/install.sh | sh
```
</Tab>
<Tab title="Windows">
Download installer from [ollama.com/download](https://ollama.com/download/windows)
</Tab>
</Tabs>
</Step>
<Step title="Start Ollama">
```bash
# Start Ollama service
ollama serve
```
Or use Docker Compose with Morphik:
```bash
docker compose --profile ollama -f docker-compose.run.yml up -d
```
</Step>
<Step title="Configure Morphik - Two Options">
### Option 1: Using the UI (Recommended)
1. Open Morphik UI and navigate to Settings
2. Select Ollama models from the dropdown for:
- **Completion Model**: `ollama_qwen_vision` or `ollama_llama_vision`
- **Embedding Model**: `ollama_embedding` (nomic-embed-text)
### Option 2: Edit morphik.toml
Morphik comes with pre-configured Ollama models:
```toml
# Already configured in morphik.toml
ollama_qwen_vision = {
model_name = "ollama_chat/qwen2.5vl:latest",
api_base = "http://localhost:11434",
vision = true
}
ollama_embedding = {
model_name = "ollama/nomic-embed-text",
api_base = "http://localhost:11434"
}
# To use Ollama as default:
[completion]
model = "ollama_qwen_vision"
[embedding]
model = "ollama_embedding"
```
<Warning>
When running Morphik in Docker, change `localhost` to `ollama:11434` if using the Ollama profile, or `host.docker.internal:11434` if running Ollama separately.
</Warning>
</Step>
<Step title="Download and Use Models">
Pull the pre-configured models:
```bash
# For embeddings (required for RAG)
ollama pull nomic-embed-text
# For completions (choose one)
ollama pull qwen2.5vl:latest # Vision-capable, 7B
ollama pull llama3.2-vision # Vision-capable, 11B
ollama pull qwen2:1.5b # Text-only, fast
```
Then select them in the UI chat interface!
</Step>
</Steps>
### Hardware Acceleration
**Apple Silicon (M1/M2/M3)**
- Ollama automatically uses Metal for GPU acceleration
- No additional configuration needed
- Excellent performance on unified memory architecture
**NVIDIA GPUs**
- Install CUDA drivers (11.8+ recommended)
- Ollama auto-detects and uses available GPUs
- Monitor usage: `nvidia-smi`
**AMD GPUs (Linux)**
- ROCm support is experimental
- Set environment variable: `HSA_OVERRIDE_GFX_VERSION=10.3.0`
### Performance Tuning
**Memory Management**
```bash
# Set GPU memory limit (NVIDIA)
OLLAMA_MAX_VRAM=8GB ollama serve
# Adjust number of parallel requests
OLLAMA_NUM_PARALLEL=4 ollama serve
# Keep models loaded in memory
OLLAMA_KEEP_ALIVE=30m ollama serve
```
**Model Quantization**
Ollama supports various quantization levels:
- `q4_0` - 4-bit quantization (smallest, fastest)
- `q5_1` - 5-bit quantization (balanced)
- `q8_0` - 8-bit quantization (best quality)
```bash
# Pull specific quantization
ollama pull llama3.2:3b-q4_0 # Smaller, faster
ollama pull llama3.2:3b-q8_0 # Better quality
```
### Monitoring & Management
**Check Status**
```bash
# List loaded models
ollama list
# View running models
ollama ps
# Check API health
curl http://localhost:11434/api/tags
```
**Resource Usage**
```bash
# Monitor in real-time
watch -n 1 ollama ps
# Check model details
ollama show llama3.2 --modelfile
```
### Creating Custom Models
Create specialized models for your use case:
```dockerfile
# Modelfile
FROM llama3.2:3b
# Set parameters
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
# Add system prompt
SYSTEM """You are a helpful assistant specialized in document analysis
and information retrieval. Always provide accurate, concise responses
based on the provided context."""
```
Build and use:
```bash
ollama create morphik-assistant -f Modelfile
ollama run morphik-assistant
```
