You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+74Lines changed: 74 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -282,6 +282,80 @@ The response will contain the model's reply:
282
282
}
283
283
```
284
284
285
+
## NVIDIA NIM Support
286
+
287
+
Docker Model Runner supports running NVIDIA NIM (NVIDIA Inference Microservices) containers directly. This provides a simplified workflow for deploying NVIDIA's optimized inference containers.
288
+
289
+
### Prerequisites
290
+
291
+
- Docker with NVIDIA GPU support (nvidia-docker2 or Docker with NVIDIA Container Runtime)
292
+
- NGC API Key (optional, but required for some NIM models)
293
+
- Docker login to nvcr.io registry
294
+
295
+
### Quick Start
296
+
297
+
1.**Login to NVIDIA Container Registry:**
298
+
299
+
```bash
300
+
docker login nvcr.io
301
+
Username: $oauthtoken
302
+
Password: <PASTE_API_KEY_HERE>
303
+
```
304
+
305
+
2.**Set NGC API Key (if required by the model):**
306
+
307
+
```bash
308
+
export NGC_API_KEY=<PASTE_API_KEY_HERE>
309
+
```
310
+
311
+
3.**Run a NIM model:**
312
+
313
+
```bash
314
+
docker model run nvcr.io/nim/google/gemma-3-1b-it:latest
315
+
```
316
+
317
+
That's it! The Docker Model Runner will:
318
+
- Automatically detect that this is a NIM image
319
+
- Pull the NIM container image
320
+
- Configure it with proper GPU support, shared memory (16GB), and NGC credentials
321
+
- Start the container and wait for it to be ready
322
+
- Provide an interactive chat interface
323
+
324
+
### Features
325
+
326
+
-**Automatic GPU Detection**: Automatically configures NVIDIA GPU support if available
327
+
-**Persistent Caching**: Models are cached in `~/.cache/nim` (or `$LOCAL_NIM_CACHE` if set)
328
+
-**Interactive Chat**: Supports both single prompt and interactive chat modes
329
+
-**Container Reuse**: Existing NIM containers are reused across runs
330
+
331
+
### Example Usage
332
+
333
+
**Single prompt:**
334
+
```bash
335
+
docker model run nvcr.io/nim/google/gemma-3-1b-it:latest "Explain quantum computing"
336
+
```
337
+
338
+
**Interactive chat:**
339
+
```bash
340
+
docker model run nvcr.io/nim/google/gemma-3-1b-it:latest
341
+
> Tell me a joke
342
+
...
343
+
> /bye
344
+
```
345
+
346
+
### Configuration
347
+
348
+
-**NGC_API_KEY**: Set this environment variable to authenticate with NVIDIA's services
349
+
-**LOCAL_NIM_CACHE**: Override the default cache location (default: `~/.cache/nim`)
350
+
351
+
### Technical Details
352
+
353
+
NIM containers:
354
+
- Run on port 8000 (localhost only)
355
+
- Use 16GB shared memory by default
356
+
- Mount `~/.cache/nim` for model caching
357
+
- Support NVIDIA GPU acceleration when available
358
+
285
359
## Metrics
286
360
287
361
The Model Runner exposes [the metrics endpoint](https://github.com/ggml-org/llama.cpp/tree/master/tools/server#get-metrics-prometheus-compatible-metrics-exporter) of llama.cpp server at the `/metrics` endpoint. This allows you to monitor model performance, request statistics, and resource usage.
0 commit comments