Skip to content

RaidanAi/bring-llms-to-nim

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NVIDIA AI Blueprint: Bring Your LLM to NIM

Deploy your own Large Language Model (LLM) using NVIDIA NIM - a powerful service that automatically optimizes your model for maximum performance on NVIDIA GPUs.

What is NIM?

NVIDIA NIM is a containerized solution that:

  • Automatically optimizes your LLM for the best performance
  • Handles all the complexity of model deployment and serving
  • Provides an OpenAI-compatible API for easy integration
  • Supports multiple model formats without manual conversion

What You'll Learn

This repository contains three hands-on notebooks showing different deployment approaches:

Best for: Quick deployment of popular models from HuggingFace

  • Deploy models directly from HuggingFace with a single command
  • Download and deploy from local storage for offline use
  • Choose between TensorRT-LLM and vLLM backends

Best for: Maximum performance with custom optimization

  • Convert HuggingFace models to TensorRT-LLM format
  • Compile optimized engines for production deployment
  • Fine-tune performance parameters

Best for: Memory-efficient deployment with quantized models

  • Deploy pre-quantized GGUF models (4-bit, 8-bit, etc.)
  • Reduce GPU memory requirements significantly
  • Handle external configuration requirements

Important

NVIDIA cannot guarantee the security of any models hosted on non-NVIDIA systems such as HuggingFace. Malicious or insecure models can result in serious security risks up to and including full remote code execution. We strongly recommend that before attempting to load it you manually verify the safety of any model not provided by NVIDIA, through such mechanisms as a) ensuring that the model weights are serialized using the safetensors format, b) conducting a manual review of any model or inference code to ensure that it is free of obfuscated or malicious code, and c) validating the signature of the model, if available, to ensure that it comes from a trusted source and has not been modified.

What You'll Learn

  • What is NIM? NVIDIA Inference Microservice (NIM) is a containerized solution that automatically optimizes LLM deployment
  • Why use NIM? Handles model optimization, backend selection, and scaling automatically
  • What you need to know: Basic Docker commands and Python. No prior LLM deployment experience required.

Quick Start: Deploy Your First Model in 3 Minutes

Step 1: Pull the NIM container

docker pull nvcr.io/nim/nvidia/llm-nim:latest

Step 2: Deploy a small model

docker run -it --rm --name=my-first-nim \
  --runtime=nvidia --gpus all \
  -p 8000:8000 \
  -e NIM_MODEL_NAME="hf://Qwen/Qwen2.5-0.5B" \
  nvcr.io/nim/nvidia/llm-nim:latest

Step 3: Test it

curl http://localhost:8000/v1/completions \
  -d '{"model": "Qwen/Qwen2.5-0.5B", "prompt": "Hello world"}'

Overview

This blueprint demonstrates how to deploy almost any Large Language Model using NVIDIA NIM (NVIDIA Inference Microservice). NIM provides a streamlined way to deploy and serve LLMs with optimized performance, handling model analysis, backend selection, and configuration automatically.

Key capabilities:

  • Deploy models directly from Hugging Face Hub
  • Deploy models from local storage
  • Multiple inference backends (TensorRT-LLM, vLLM)
  • Customizable deployment parameters
  • OpenAI-compatible REST API

Software Components

  • NVIDIA NIM LLM Container: nvcr.io/nim/nvidia/llm-nim:latest
  • Supported Inference Backends:
    • TensorRT-LLM (high performance)
    • vLLM (versatile deployment)
  • Container Runtime: Docker with NVIDIA Container Runtime
  • Model Sources: Hugging Face Hub, Local filesystem

Hardware Requirements

  • GPU: NVIDIA GPU with compute capability 7.0+ (V100, T4, A10, A100, H100, etc.)
  • GPU Memory: Varies by model size (minimum 8GB for small models, 40GB+ for large models)
  • System Memory: 16GB+ RAM recommended
  • Storage: Sufficient disk space for model weights and cache
  • NVIDIA Driver: Version 535+ required

API Definition

The deployed NIM service provides OpenAI-compatible REST API endpoints:

  • Health Check: GET /v1/health/ready
  • Completions: POST /v1/completions
  • Chat Completions: POST /v1/chat/completions
  • Models: GET /v1/models

Example completion request:

{
  "model": "model-name",
  "prompt": "Your prompt here",
  "max_tokens": 250,
  "temperature": 0.7
}

Deployment

Prerequisites

  • NVIDIA Developer License: Access to NVIDIA Container Registry
  • NGC API Key: Required for pulling NIM containers (Generate here)
  • Hugging Face Token: Required for downloading models from Hugging Face Hub
  • Docker: With NVIDIA Container Runtime installed

About

Nvidia

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 100.0%