Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,80 @@ The response will contain the model's reply:
}
```

## NVIDIA NIM Support

Docker Model Runner supports running NVIDIA NIM (NVIDIA Inference Microservices) containers directly. This provides a simplified workflow for deploying NVIDIA's optimized inference containers.

### Prerequisites

- Docker with NVIDIA GPU support (nvidia-docker2 or Docker with NVIDIA Container Runtime)
- NGC API Key (optional, but required for some NIM models)
- Docker login to nvcr.io registry

### Quick Start

1. **Login to NVIDIA Container Registry:**

```bash
docker login nvcr.io
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>
```

2. **Set NGC API Key (if required by the model):**

```bash
export NGC_API_KEY=<PASTE_API_KEY_HERE>
```

3. **Run a NIM model:**

```bash
docker model run nvcr.io/nim/google/gemma-3-1b-it:latest
```

That's it! The Docker Model Runner will:
- Automatically detect that this is a NIM image
- Pull the NIM container image
- Configure it with proper GPU support, shared memory (16GB), and NGC credentials
- Start the container and wait for it to be ready
- Provide an interactive chat interface

### Features

- **Automatic GPU Detection**: Automatically configures NVIDIA GPU support if available
- **Persistent Caching**: Models are cached in `~/.cache/nim` (or `$LOCAL_NIM_CACHE` if set)
- **Interactive Chat**: Supports both single prompt and interactive chat modes
- **Container Reuse**: Existing NIM containers are reused across runs

### Example Usage

**Single prompt:**
```bash
docker model run nvcr.io/nim/google/gemma-3-1b-it:latest "Explain quantum computing"
```

**Interactive chat:**
```bash
docker model run nvcr.io/nim/google/gemma-3-1b-it:latest
> Tell me a joke
...
> /bye
```

### Configuration

- **NGC_API_KEY**: Set this environment variable to authenticate with NVIDIA's services
- **LOCAL_NIM_CACHE**: Override the default cache location (default: `~/.cache/nim`)

### Technical Details

NIM containers:
- Run on port 8000 (localhost only)
- Use 16GB shared memory by default
- Mount `~/.cache/nim` for model caching
- Support NVIDIA GPU acceleration when available

## Metrics

The Model Runner exposes [the metrics endpoint](https://github.com/ggml-org/llama.cpp/tree/master/tools/server#get-metrics-prometheus-compatible-metrics-exporter) of llama.cpp server at the `/metrics` endpoint. This allows you to monitor model performance, request statistics, and resource usage.
Expand Down
Loading
Loading