Skip to content

Commit baa747d

Browse files
committed
Add NVIDIA TensorRT-LLM optimization guide for GPT-OSS models
1 parent 5567658 commit baa747d

File tree

1 file changed

+123
-0
lines changed

1 file changed

+123
-0
lines changed

articles/run-nvidia.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Optimizing OpenAI GPT-OSS Models with NVIDIA TensorRT-LLM
2+
3+
This notebook provides a step-by-step guide on how to optimizing `gpt-oss` models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.
4+
5+
6+
TensorRT-LLM supports both models:
7+
- `gpt-oss-20b`
8+
- `gpt-oss-120b`
9+
10+
In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/blogs/tech_blog) deployment guide.
11+
12+
Note: It’s important to ensure that your input prompts follow the [harmony response](http://cookbook.openai.com/articles/openai-harmony) format as the model will not function correctly otherwise, not needed in this guide.
13+
14+
## Prerequisites
15+
16+
### Hardware
17+
To run the 20B model and the TensorRT-LLM build process, you will need an NVIDIA GPU with at least 16GB+ of VRAM.
18+
19+
> Recommended GPUs: NVIDIA RTX 50 Series (e.g. RTX 5090), NVIDIA H100, or L40S.
20+
21+
### Software
22+
- CUDA Toolkit 12.8 or later
23+
- Python 3.12 or later
24+
25+
## Installling TensorRT-LLM
26+
27+
There are various ways to install TensorRT-LLM, in this guide, we will using pre-built docker container from NVIDIA NGC and build it from source.
28+
29+
## Using NGC
30+
31+
Pull the pre-built TensorRT-LLM container for GPT-OSS from NVIDIA NGC.
32+
This is the easiest way to get started and ensures all dependencies are included.
33+
34+
```bash
35+
docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev
36+
docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev
37+
```
38+
39+
## Using Docker (build from source)
40+
41+
Alternatively, you can build the TensorRT-LLM container from source.
42+
This is useful if you want to modify the source code or use a custom branch.
43+
See the official instructions here: https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker
44+
45+
The following commands will install required dependencies, clone the repository,
46+
check out the GPT-OSS feature branch, and build the Docker container:
47+
48+
```bash
49+
#Update package lists and install required system packages
50+
sudo apt-get update && sudo apt-get -y install git git-lfs build-essential cmake
51+
52+
# Initialize Git LFS (Large File Storage) for handling large model files
53+
git lfs install
54+
55+
# Clone the TensorRT-LLM repository
56+
git clone https://github.com/NVIDIA/TensorRT-LLM.git
57+
cd TensorRT-LLM
58+
59+
# Check out the branch with GPT-OSS support
60+
git checkout feat/gpt-oss
61+
62+
# Initialize and update submodules (required for build)
63+
git submodule update --init --recursive
64+
65+
# Pull large files (e.g., model weights) managed by Git LFS
66+
git lfs pull
67+
68+
# Build the release Docker image
69+
make -C docker release_build
70+
71+
# Run the built Docker container
72+
make -C docker release_run
73+
```
74+
75+
TensorRT-LLM will be available through pip soon
76+
77+
> Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch.
78+
79+
# Verifying TensorRT-LLM Installation
80+
81+
```python
82+
from tensorrt_llm import LLM, SamplingParams
83+
```
84+
85+
# Utilizing TensorRT-LLM Python API
86+
87+
In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to:
88+
1. Downloads the specified model weights from Hugging Face
89+
2. Automatically build the TensorRT engine for your GPU architecture if it does not already exist.
90+
3. Load the model and prepare it for inference.
91+
4. Run a simple text generation example to verify everything is working.
92+
93+
**Note**: The first run may take several minutes as it downloads the model and builds the engine.
94+
Subsequent runs will be much faster, as the engine will be cached.
95+
96+
```python
97+
llm = LLM(model="openai/gpt-oss-20b")
98+
```
99+
100+
```python
101+
prompts = ["Hello, my name is", "The capital of France is"]
102+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
103+
for output in llm.generate(prompts, sampling_params):
104+
print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")
105+
```
106+
107+
# Conclusion and Next Steps
108+
Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API.
109+
110+
In this notebook, you have learned how to:
111+
- Set up your environment with the necessary dependencies.
112+
- Use the `tensorrt_llm.LLM` API to download a model from the Hugging Face Hub.
113+
- Automatically build a high-performance TensorRT engine tailored to your GPU.
114+
- Run inference with the optimized model.
115+
116+
117+
You can explore more advanced features to further improve performance and efficiency:
118+
119+
- Benchmarking: Try running a [benchmark](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/benchmarking-default-performance.html#benchmarking-with-trtllm-bench) to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time.
120+
121+
- Quantization: TensorRT-LLM [supports](https://github.com/NVIDIA/TensorRT-Model-Optimizer) various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware.
122+
123+
- Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/latest/) for robust, scalable, and multi-model serving.

0 commit comments

Comments
 (0)