LLMKube Metal Agent for macOS

This directory contains the macOS launchd configuration for the LLMKube Metal Agent, which enables Metal GPU acceleration for local Kubernetes LLM deployments.

Prerequisites

macOS with Apple Silicon (M1/M2/M3/M4) or Intel Mac with Metal 2+ support
Access to a Kubernetes cluster — either a remote cluster (recommended) or local minikube
llama.cpp with Metal support:
```
brew install llama.cpp
```

LLMKube operator installed in your cluster:

kubectl apply -f https://github.com/defilantech/llmkube/releases/latest/download/install.yaml

--host-ip flag (required when using a remote cluster): the Metal Agent must be started with --host-ip <your-mac-ip> so that Kubernetes endpoints point to the Mac's reachable IP address instead of localhost

Installation

Option 1: Using Makefile (Recommended)

# Build and install Metal agent
make install-metal-agent

This will:

Build the Metal agent binary
Install to /usr/local/bin/llmkube-metal-agent
Install launchd service
Start the service automatically

Option 2: Manual Installation

# Build the agent
make build-metal-agent

# Copy to /usr/local/bin
sudo cp bin/llmkube-metal-agent /usr/local/bin/

# Install launchd plist
mkdir -p ~/Library/LaunchAgents
cp deployment/macos/com.llmkube.metal-agent.plist ~/Library/LaunchAgents/

# Load the service
launchctl load ~/Library/LaunchAgents/com.llmkube.metal-agent.plist

Usage

Once installed, the Metal agent runs automatically in the background and watches for InferenceService resources in your Kubernetes cluster.

Deploy a Model with Metal Acceleration

# Deploy from catalog
llmkube deploy llama-3.1-8b --accelerator metal

# Or deploy custom model
llmkube deploy my-model --accelerator metal \
  --source https://huggingface.co/.../model.gguf

Check Agent Status

# Check if agent is running
launchctl list | grep llmkube

# View agent logs
tail -f /tmp/llmkube-metal-agent.log

# Check running processes
ps aux | grep llmkube-metal-agent

# Health check (liveness)
curl http://localhost:9090/healthz

# Readiness check (at least one process healthy, or no processes yet)
curl http://localhost:9090/readyz

Verify Metal Acceleration

# Check Metal support
system_profiler SPDisplaysDataType | grep Metal

# Monitor GPU usage while inference is running
sudo powermetrics --samplers gpu_power -i 1000

Configuration

The launchd plist can be customized by editing com.llmkube.metal-agent.plist:

<key>ProgramArguments</key>
<array>
    <string>/usr/local/bin/llmkube-metal-agent</string>
    <string>--namespace</string>
    <string>default</string>              <!-- Kubernetes namespace to watch -->
    <string>--model-store</string>
    <string>/tmp/llmkube-models</string>  <!-- Where to store downloaded models -->
    <string>--llama-server</string>
    <string>/usr/local/bin/llama-server</string>  <!-- Path to llama-server binary -->
    <string>--port</string>
    <string>9090</string>                 <!-- Agent metrics port -->
</array>

`--host-ip` flag (remote cluster)

When your Kubernetes cluster runs on a different machine (Linux server, cloud, etc.), the Metal Agent needs to register the Mac's reachable IP address so that pods in the cluster can route traffic to llama-server:

# Find your Mac's IP on the local network
ipconfig getifaddr en0

# Start the agent with --host-ip
llmkube-metal-agent --host-ip 192.168.1.50

# Or with a Tailscale / WireGuard address
llmkube-metal-agent --host-ip 100.64.0.10

Without --host-ip, the agent registers localhost as the endpoint — which only works when K8s is on the same machine (e.g. minikube).

To set this in the launchd plist, add these lines to the ProgramArguments array:

    <string>--host-ip</string>
    <string>192.168.1.50</string>         <!-- Your Mac's reachable IP -->

After editing, reload the service:

launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
launchctl load ~/Library/LaunchAgents/com.llmkube.metal-agent.plist

`--memory-fraction` flag (memory budget)

The Metal Agent estimates model memory requirements (weights + KV cache + overhead) before starting llama-server. If the model won't fit in the memory budget, the agent refuses to start it and sets the InferenceService status to InsufficientMemory.

By default, the budget is auto-detected based on total system RAM:

Total RAM	Default Fraction	Budget
16 GB	67%	~10.7 GB
36 GB	67%	~24.1 GB
48 GB	75%	36 GB
64 GB	75%	48 GB

To override:

# Use 50% of memory (conservative, leaves room for other apps)
llmkube-metal-agent --memory-fraction 0.5

# Use 90% of memory (dedicated inference machine)
llmkube-metal-agent --memory-fraction 0.9

To set this in the launchd plist:

    <string>--memory-fraction</string>
    <string>0.75</string>                 <!-- 75% of system memory -->

Health Checks & Monitoring

The Metal Agent exposes an HTTP server on 127.0.0.1:9090 (configurable via --port) with health check and Prometheus metrics endpoints. The server binds to localhost only; to expose it for remote Prometheus scraping, use a reverse proxy or SSH tunnel.

Endpoints

Endpoint	Purpose	Success	Failure
`GET /healthz`	Liveness probe — agent process is alive	Always 200	—
`GET /readyz`	Readiness probe — at least one process healthy (or no processes)	200	503 (all unhealthy)
`GET /metrics`	Prometheus metrics	200	—

Prometheus Metrics

Metric	Type	Description
`llmkube_metal_agent_managed_processes`	Gauge	Number of llama-server processes currently managed
`llmkube_metal_agent_process_healthy`	Gauge	Whether a process is healthy (1) or not (0). Labels: `name`, `namespace`
`llmkube_metal_agent_process_restarts_total`	Counter	Total process restarts triggered by health monitoring. Labels: `name`, `namespace`
`llmkube_metal_agent_health_check_duration_seconds`	Histogram	Duration of health check probes. Labels: `name`, `namespace`
`llmkube_metal_agent_memory_budget_bytes`	Gauge	Total memory budget for model serving
`llmkube_metal_agent_memory_estimated_bytes`	Gauge	Estimated memory per process. Labels: `name`, `namespace`

Standard Go runtime and process metrics (go_*, process_*) are also available.

Continuous Health Monitoring

The agent polls each managed llama-server process every 30 seconds via its /health endpoint. On failure:

The process is marked unhealthy (Healthy=false, process_healthy gauge set to 0)
The agent re-fetches the InferenceService from Kubernetes
ensureProcess() is called to restart the llama-server
The process_restarts_total counter is incremented

When a previously unhealthy process recovers, it is marked healthy again automatically.

Scraping with Prometheus

The health server binds to 127.0.0.1 by default. If Prometheus runs on the same Mac, scrape directly:

scrape_configs:
  - job_name: 'llmkube-metal-agent'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: 'metal-agent'

For remote Prometheus, use an SSH tunnel: ssh -L 9090:localhost:9090 <your-mac>.

Quick verification:

# Check all endpoints
curl http://localhost:9090/healthz   # → "ok"
curl http://localhost:9090/readyz    # → "ready" or "not ready"
curl http://localhost:9090/metrics   # → Prometheus text format

# Check specific metric
curl -s http://localhost:9090/metrics | grep llmkube_metal_agent_managed_processes

Troubleshooting

Agent won't start

# Check logs
cat /tmp/llmkube-metal-agent.log

# Verify llama-server is installed
which llama-server

# Verify Metal support
llmkube-metal-agent --version

Metal not detected

# Verify GPU info
system_profiler SPDisplaysDataType

# Check for Metal support
system_profiler SPDisplaysDataType | grep "Metal"

Model rejected with InsufficientMemory

The Metal Agent performs a pre-flight memory check before starting each model. If the estimated memory exceeds the budget, the InferenceService status will show InsufficientMemory:

# Check the scheduling status
kubectl get inferenceservices -o wide

# View the detailed message
kubectl get isvc <name> -o jsonpath='{.status.schedulingMessage}'

To resolve:

Use a smaller quantization (e.g. Q4_K_M instead of Q8_0) to reduce model weight size
Reduce context size in the InferenceService spec to lower KV cache requirements
Increase the memory fraction with --memory-fraction 0.9 if this is a dedicated inference machine
Close other applications to free unified memory

Can't connect to Kubernetes

# Verify kubectl can reach your cluster
kubectl get nodes

# Check which context is active
kubectl config current-context

# Check kubeconfig path
echo $KUBECONFIG

# If using minikube locally
minikube status

Remote cluster: pods can't reach llama-server

# Verify --host-ip was set correctly
# The IP must be reachable from the K8s nodes
ping <your-mac-ip>   # run from a K8s node

# Check that the endpoint was registered with the right IP
kubectl get endpoints -l llmkube.dev/accelerator=metal

# Verify firewall isn't blocking the llama-server port (default 8080+)
# macOS may prompt to allow incoming connections on first run

# If using Tailscale / WireGuard, verify the tunnel is up
tailscale status   # or wg show

Uninstallation

# Using Makefile
make uninstall-metal-agent

# Or manually
launchctl unload ~/Library/LaunchAgents/com.llmkube.metal-agent.plist
sudo rm /usr/local/bin/llmkube-metal-agent
rm ~/Library/LaunchAgents/com.llmkube.metal-agent.plist

How It Works

Metal Agent runs as a native macOS process (not in Kubernetes)
Watches for InferenceService resources in Kubernetes
Downloads models from HuggingFace when needed
Validates that the model fits in the system's memory budget
Spawns llama-server processes with Metal acceleration
Registers service endpoints back to Kubernetes
Monitors process health every 30s and auto-restarts on failure
Exposes health checks and Prometheus metrics on port 9090
Pods access the Metal-accelerated inference via Service endpoints

Remote cluster (Recommended)

K8s runs on a Linux server or cloud; the Mac dedicates all resources to inference:

┌──────────────────────────────┐        ┌──────────────────────────────┐
│ Linux Server / Cloud         │        │ macOS (Your Mac)             │
│                              │        │                              │
│  ┌────────────────────────┐  │  LAN/  │  ┌────────────────────────┐  │
│  │ Kubernetes             │  │  VPN/  │  │ Metal Agent            │  │
│  │  LLMKube Operator      │  │  TLS   │  │  --host-ip <mac-ip>   │  │
│  │  InferenceService CRD  │◄─┼────────┼─►│  Watches K8s API      │  │
│  │  Service → Mac IP      │  │        │  │  Spawns llama-server  │  │
│  └────────────────────────┘  │        │  └────────────────────────┘  │
│                              │        │               ↓              │
│                              │        │  ┌────────────────────────┐  │
│                              │        │  │ llama-server (Metal)   │  │
│                              │        │  │  Direct GPU access ✅  │  │
│                              │        │  │  All unified memory    │  │
│                              │        │  └────────────────────────┘  │
└──────────────────────────────┘        └──────────────────────────────┘

Co-located (minikube on same Mac)

Everything on one machine — simpler but minikube consumes resources:

┌─────────────────────────────────────────────────┐
│              macOS (Your Mac)                    │
│                                                  │
│  ┌──────────────────────────────────────────┐   │
│  │   Minikube (Kubernetes in VM)            │   │
│  │   - Creates InferenceService CRD         │   │
│  │   - Service points to host               │   │
│  └──────────────────────────────────────────┘   │
│                     ↓                            │
│  ┌──────────────────────────────────────────┐   │
│  │   Metal Agent (Native Process)           │   │
│  │   - Watches K8s for InferenceService     │   │
│  │   - Spawns llama-server with Metal       │   │
│  └──────────────────────────────────────────┘   │
│                     ↓                            │
│  ┌──────────────────────────────────────────┐   │
│  │   llama-server (Metal Accelerated)       │   │
│  │   - Runs on localhost:8080+              │   │
│  │   - Direct Metal GPU access ✅           │   │
│  └──────────────────────────────────────────┘   │
└─────────────────────────────────────────────────┘

Performance

Expected performance on M4 Max (32 GPU cores):

Llama 3.2 3B: 80-120 tok/s (generation)
Llama 3.1 8B: 40-60 tok/s (generation)
Mistral 7B: 45-65 tok/s (generation)

Performance comparable to Ollama, but with Kubernetes orchestration!

Security

Agent runs as your user (not root)
Models stored in /tmp/llmkube-models (configurable)
Processes bind to localhost only
Service endpoints use ClusterIP (not exposed externally)

Support

GitHub Issues: https://github.com/defilantech/llmkube/issues
Documentation: https://github.com/defilantech/llmkube#metal-support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLMKube Metal Agent for macOS

Prerequisites

Installation

Option 1: Using Makefile (Recommended)

Option 2: Manual Installation

Usage

Deploy a Model with Metal Acceleration

Check Agent Status

Verify Metal Acceleration

Configuration

`--host-ip` flag (remote cluster)

`--memory-fraction` flag (memory budget)

Health Checks & Monitoring

Endpoints

Prometheus Metrics

Continuous Health Monitoring

Scraping with Prometheus

Troubleshooting

Agent won't start

Metal not detected

Model rejected with InsufficientMemory

Can't connect to Kubernetes

Remote cluster: pods can't reach llama-server

Uninstallation

How It Works

Remote cluster (Recommended)

Co-located (minikube on same Mac)

Performance

Security

Support

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

LLMKube Metal Agent for macOS

Prerequisites

Installation

Option 1: Using Makefile (Recommended)

Option 2: Manual Installation

Usage

Deploy a Model with Metal Acceleration

Check Agent Status

Verify Metal Acceleration

Configuration

--host-ip flag (remote cluster)

--memory-fraction flag (memory budget)

Health Checks & Monitoring

Endpoints

Prometheus Metrics

Continuous Health Monitoring

Scraping with Prometheus

Troubleshooting

Agent won't start

Metal not detected

Model rejected with InsufficientMemory

Can't connect to Kubernetes

Remote cluster: pods can't reach llama-server

Uninstallation

How It Works

Remote cluster (Recommended)

Co-located (minikube on same Mac)

Performance

Security

Support

`--host-ip` flag (remote cluster)

`--memory-fraction` flag (memory budget)