docs/local-inference.mdx at main · morphik-org/docs

title	Local Inference
description	Run Morphik completely offline with local embedding and completion models

Morphik comes with built-in support for running both embeddings and completions locally, ensuring your data never leaves your machine. Choose between two powerful local inference engines:

Lemonade - Windows-only, optimized for AMD GPUs and NPUs
Ollama - Cross-platform (Windows, macOS, Linux), supports various hardware

Both are pre-configured in Morphik and can be selected through the UI or configuration file.

Why Local Inference?

Running models locally provides several key advantages:

Complete Privacy: Your data never leaves your machine
No API Costs: Eliminate ongoing API expenses
Low Latency: No network round-trips for inference
Offline Capability: Work without internet connectivity
Hardware Acceleration: Leverage your local GPU, NPU, or specialized AI processors

🍋

Lemonade

Run embeddings & completions locally with AMD GPU/NPU acceleration

Lemonade SDK provides high-performance local inference on Windows, with optimizations for AMD hardware. It exposes an OpenAI-compatible API and is **already configured in Morphik**.

<Note>
  **Built-in Support**: Lemonade models are pre-configured in `morphik.toml` for both embeddings and completions. Simply install Lemonade Server and select the models in the UI.
</Note>

### System Requirements

- **Windows 10/11 only** (x86/x64)
- **8GB+ RAM** (16GB recommended)
- **Python 3.10+**
- **Optional but recommended**: 
  - AMD Ryzen AI 300 series (NPU acceleration)
  - AMD Radeon 7000/9000 series (GPU acceleration)

### Quick Start

<Steps>
  <Step title="Download Lemonade">
    Download and install Lemonade from the official site: 
    [lemonade-server.ai](https://lemonade-server.ai/).
  </Step>
  
  <Step title="Start Lemonade Server">
    Start the Lemonade server following their documentation. Make sure it is running and note the port.
    The API is OpenAI-compatible (e.g., `/api/v1/models`).
  </Step>
  
  <Step title="Configure Morphik - Two Options">

    ### Option 1: Using the UI (Recommended)

    1. Open the Morphik UI and go to Settings → API Keys
    2. Select "Lemonade" (🍋). No API key is required
    3. Enter the host and port where Lemonade is running
      <img src="/images/add_port_to_lemonade.png" 
           alt="Lemonade provider settings with host and port"
           style={{maxWidth: '640px', margin: '16px 0'}} />
    4. Open Chat and use the model selector pill (top left) to pick a Lemonade model
      <img src="/images/see_lemonade_models_in_chat.png" 
           alt="Chat model selector showing Lemonade models"
           style={{maxWidth: '640px', margin: '16px 0'}} />

    <Warning>
      Running inside Docker? Use `host.docker.internal` instead of `localhost` for the host field.
    </Warning>

    <Warning>
      If you are not using a vision-capable model, turn off ColPali in chat settings (settings → ColPali) to avoid vision-dependent paths.
    </Warning>

    ### Option 2: Edit morphik.toml

    You can also set Lemonade models directly in `morphik.toml` so they're used by default.
    Ensure the `api_base` points to your Lemonade server:

    ```toml
    lemonade_qwen = {
      model_name = "openai/Qwen2.5-VL-7B-Instruct-GGUF",
      api_base = "http://localhost:8020/api/v1",
      vision = true
    }
    lemonade_embedding = {
      model_name = "openai/nomic-embed-text-v1-GGUF",
      api_base = "http://localhost:8020/api/v1"
    }

    [completion]
    model = "lemonade_qwen"

    [embedding]
    model = "lemonade_embedding"
    ```
  </Step>
</Steps>

<Warning>
  If your system has under 16GB RAM, prefer models under ~4B parameters or smaller quantizations
  (e.g., Q4/Q5). Larger models may fail to load or will be very slow on low-memory systems.
</Warning>

### Performance Tips

- **Model Quantization**: Use GGUF quantized models for better performance
- **Low-memory systems**: Under 16GB RAM, prefer models under 4B parameters
- **Hardware Acceleration**: Automatically detects and uses AMD GPUs/NPUs when available
- **Memory Management**: Models are cached after first download

### Troubleshooting

<AccordionGroup>
  <Accordion title="Connection Issues">
    - Verify server health: `curl http://localhost:8020/health`
    - List models: `curl http://localhost:8020/api/v1/models`
    - For Docker: Use `host.docker.internal` instead of `localhost`
    - Check firewall settings for port 8020
  </Accordion>
  
  <Accordion title="Model Loading Errors">
    - Ensure sufficient disk space (5-15GB per model)
    - Try smaller quantized versions (Q4, Q5)
    - Check model compatibility with `lemonade list`
  </Accordion>
  
  <Accordion title="Performance Issues">
    - Use GGUF quantized models for better performance
    - Monitor GPU/NPU usage with system tools
    - Adjust batch size and context length in model config
  </Accordion>
</AccordionGroup>

Ollama - All Platforms

Run embeddings & completions locally on Windows, macOS, or Linux

Ollama provides cross-platform local inference for both embeddings and completions. It's **already configured in Morphik** and supports various hardware accelerators.

<Note>
  **Built-in Support**: Ollama models are pre-configured in `morphik.toml` for both embeddings and completions. Simply install Ollama and select the models in the UI.
</Note>

### System Requirements

- **macOS**: Apple Silicon (M1/M2/M3) or Intel Mac with 8GB+ RAM
- **Linux**: x86_64 or ARM64, 8GB+ RAM, optional NVIDIA GPU
- **Windows**: Windows 10/11, 8GB+ RAM, optional NVIDIA GPU

### Quick Start

<Steps>
  <Step title="Install Ollama">
    <Tabs>
      <Tab title="macOS">
        ```bash
        brew install ollama
        # Or: curl -fsSL https://ollama.com/install.sh | sh
        ```
      </Tab>
      <Tab title="Linux">
        ```bash
        curl -fsSL https://ollama.com/install.sh | sh
        ```
      </Tab>
      <Tab title="Windows">
        Download installer from [ollama.com/download](https://ollama.com/download/windows)
      </Tab>
    </Tabs>
  </Step>
  
  <Step title="Start Ollama">
    ```bash
    # Start Ollama service
    ollama serve
    ```
    
    Or use Docker Compose with Morphik:
    ```bash
    docker compose --profile ollama -f docker-compose.run.yml up -d
    ```
  </Step>
  
  <Step title="Configure Morphik - Two Options">
    
    ### Option 1: Using the UI (Recommended)
    
    1. Open Morphik UI and navigate to Settings
    2. Select Ollama models from the dropdown for:
       - **Completion Model**: `ollama_qwen_vision` or `ollama_llama_vision`
       - **Embedding Model**: `ollama_embedding` (nomic-embed-text)
    
    ### Option 2: Edit morphik.toml
    
    Morphik comes with pre-configured Ollama models:
    
    ```toml
    # Already configured in morphik.toml
    ollama_qwen_vision = { 
      model_name = "ollama_chat/qwen2.5vl:latest", 
      api_base = "http://localhost:11434", 
      vision = true 
    }
    ollama_embedding = { 
      model_name = "ollama/nomic-embed-text", 
      api_base = "http://localhost:11434" 
    }
    
    # To use Ollama as default:
    [completion]
    model = "ollama_qwen_vision"
    
    [embedding]
    model = "ollama_embedding"
    ```
    
    <Warning>
      When running Morphik in Docker, change `localhost` to `ollama:11434` if using the Ollama profile, or `host.docker.internal:11434` if running Ollama separately.
    </Warning>
  </Step>
  
  <Step title="Download and Use Models">
    Pull the pre-configured models:
    
    ```bash
    # For embeddings (required for RAG)
    ollama pull nomic-embed-text
    
    # For completions (choose one)
    ollama pull qwen2.5vl:latest    # Vision-capable, 7B
    ollama pull llama3.2-vision      # Vision-capable, 11B
    ollama pull qwen2:1.5b          # Text-only, fast
    ```
    
    Then select them in the UI chat interface!
  </Step>
</Steps>

### Hardware Acceleration

**Apple Silicon (M1/M2/M3)**
- Ollama automatically uses Metal for GPU acceleration
- No additional configuration needed
- Excellent performance on unified memory architecture

**NVIDIA GPUs**
- Install CUDA drivers (11.8+ recommended)
- Ollama auto-detects and uses available GPUs
- Monitor usage: `nvidia-smi`

**AMD GPUs (Linux)**
- ROCm support is experimental
- Set environment variable: `HSA_OVERRIDE_GFX_VERSION=10.3.0`

### Performance Tuning

**Memory Management**
```bash
# Set GPU memory limit (NVIDIA)
OLLAMA_MAX_VRAM=8GB ollama serve

# Adjust number of parallel requests
OLLAMA_NUM_PARALLEL=4 ollama serve

# Keep models loaded in memory
OLLAMA_KEEP_ALIVE=30m ollama serve
```

**Model Quantization**

Ollama supports various quantization levels:
- `q4_0` - 4-bit quantization (smallest, fastest)
- `q5_1` - 5-bit quantization (balanced)
- `q8_0` - 8-bit quantization (best quality)

```bash
# Pull specific quantization
ollama pull llama3.2:3b-q4_0  # Smaller, faster
ollama pull llama3.2:3b-q8_0  # Better quality
```

### Monitoring & Management

**Check Status**
```bash
# List loaded models
ollama list

# View running models
ollama ps

# Check API health
curl http://localhost:11434/api/tags
```

**Resource Usage**
```bash
# Monitor in real-time
watch -n 1 ollama ps

# Check model details
ollama show llama3.2 --modelfile
```

### Creating Custom Models

Create specialized models for your use case:

```dockerfile
# Modelfile
FROM llama3.2:3b

# Set parameters
PARAMETER temperature 0.1
PARAMETER num_ctx 4096

# Add system prompt
SYSTEM """You are a helpful assistant specialized in document analysis 
and information retrieval. Always provide accurate, concise responses 
based on the provided context."""
```

Build and use:
```bash
ollama create morphik-assistant -f Modelfile
ollama run morphik-assistant
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Local Inference?

Lemonade

Ollama - All Platforms

FilesExpand file tree

local-inference.mdx

Latest commit

History

local-inference.mdx

File metadata and controls

Why Local Inference?

Lemonade

Ollama - All Platforms