Skip to content

Commit 1f7a931

Browse files
committed
Convert NVIDIA TensorRT guide to Jupyter notebook format
1 parent baa747d commit 1f7a931

File tree

4 files changed

+233
-123
lines changed

4 files changed

+233
-123
lines changed

articles/run-nvidia.ipynb

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Optimizing OpenAI GPT-OSS Models with NVIDIA TensorRT-LLM"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"This notebook provides a step-by-step guide on how to optimizing `gpt-oss` models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.\n",
15+
"\n",
16+
"\n",
17+
"TensorRT-LLM supports both models:\n",
18+
"- `gpt-oss-20b`\n",
19+
"- `gpt-oss-120b`\n",
20+
"\n",
21+
"In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/blogs/tech_blog) deployment guide."
22+
]
23+
},
24+
{
25+
"cell_type": "markdown",
26+
"metadata": {},
27+
"source": [
28+
"## Prerequisites"
29+
]
30+
},
31+
{
32+
"cell_type": "markdown",
33+
"metadata": {},
34+
"source": [
35+
"### Hardware\n",
36+
"To run the 20B model and the TensorRT-LLM build process, you will need an NVIDIA GPU with at least 20 GB of VRAM.\n",
37+
"\n",
38+
"> Recommended GPUs: NVIDIA RTX 50 Series (e.g.RTX 5090), NVIDIA H100, or L40S.\n",
39+
"\n",
40+
"### Software\n",
41+
"- CUDA Toolkit 12.8 or later\n",
42+
"- Python 3.12 or later\n",
43+
"- Access to the Orangina model checkpoint from Hugging Face"
44+
]
45+
},
46+
{
47+
"cell_type": "markdown",
48+
"metadata": {},
49+
"source": [
50+
"## Installling TensorRT-LLM"
51+
]
52+
},
53+
{
54+
"cell_type": "markdown",
55+
"metadata": {},
56+
"source": [
57+
"## Using NGC\n",
58+
"\n",
59+
"Pull the pre-built TensorRT-LLM container for GPT-OSS from NVIDIA NGC.\n",
60+
"This is the easiest way to get started and ensures all dependencies are included.\n",
61+
"\n",
62+
"`docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`\n",
63+
"`docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev`\n",
64+
"\n",
65+
"## Using Docker (build from source)\n",
66+
"\n",
67+
"Alternatively, you can build the TensorRT-LLM container from source.\n",
68+
"This is useful if you want to modify the source code or use a custom branch.\n",
69+
"See the official instructions here: https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker\n",
70+
"\n",
71+
"The following commands will install required dependencies, clone the repository,\n",
72+
"check out the GPT-OSS feature branch, and build the Docker container:\n",
73+
" ```\n",
74+
"#Update package lists and install required system packages\n",
75+
"sudo apt-get update && sudo apt-get -y install git git-lfs build-essential cmake\n",
76+
"\n",
77+
"# Initialize Git LFS (Large File Storage) for handling large model files\n",
78+
"git lfs install\n",
79+
"\n",
80+
"# Clone the TensorRT-LLM repository\n",
81+
"git clone https://github.com/NVIDIA/TensorRT-LLM.git\n",
82+
"cd TensorRT-LLM\n",
83+
"\n",
84+
"# Check out the branch with GPT-OSS support\n",
85+
"git checkout feat/gpt-oss\n",
86+
"\n",
87+
"# Initialize and update submodules (required for build)\n",
88+
"git submodule update --init --recursive\n",
89+
"\n",
90+
"# Pull large files (e.g., model weights) managed by Git LFS\n",
91+
"git lfs pull\n",
92+
"\n",
93+
"# Build the release Docker image\n",
94+
"make -C docker release_build\n",
95+
"\n",
96+
"# Run the built Docker container\n",
97+
"make -C docker release_run \n",
98+
"```"
99+
]
100+
},
101+
{
102+
"cell_type": "markdown",
103+
"metadata": {},
104+
"source": [
105+
"TensorRT-LLM will be available through pip soon"
106+
]
107+
},
108+
{
109+
"cell_type": "markdown",
110+
"metadata": {},
111+
"source": [
112+
"> Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch."
113+
]
114+
},
115+
{
116+
"cell_type": "markdown",
117+
"metadata": {},
118+
"source": [
119+
"# Verifying TensorRT-LLM Installation"
120+
]
121+
},
122+
{
123+
"cell_type": "code",
124+
"execution_count": null,
125+
"metadata": {},
126+
"outputs": [],
127+
"source": [
128+
"from tensorrt_llm import LLM, SamplingParams"
129+
]
130+
},
131+
{
132+
"cell_type": "markdown",
133+
"metadata": {},
134+
"source": [
135+
"# Utilizing TensorRT-LLM Python API"
136+
]
137+
},
138+
{
139+
"cell_type": "markdown",
140+
"metadata": {},
141+
"source": [
142+
"In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to:\n",
143+
"1. Download the specified model weights from Hugging Face (using your HF_TOKEN for authentication).\n",
144+
"2. Automatically build the TensorRT engine for your GPU architecture if it does not already exist.\n",
145+
"3. Load the model and prepare it for inference.\n",
146+
"4. Run a simple text generation example to verify everything is working.\n",
147+
"\n",
148+
"**Note**: The first run may take several minutes as it downloads the model and builds the engine.\n",
149+
"Subsequent runs will be much faster, as the engine will be cached."
150+
]
151+
},
152+
{
153+
"cell_type": "code",
154+
"execution_count": null,
155+
"metadata": {},
156+
"outputs": [],
157+
"source": [
158+
"llm = LLM(model=\"openai/gpt-oss-20b\")"
159+
]
160+
},
161+
{
162+
"cell_type": "code",
163+
"execution_count": null,
164+
"metadata": {},
165+
"outputs": [],
166+
"source": [
167+
"prompts = [\"Hello, my name is\", \"The capital of France is\"]\n",
168+
"sampling_params = SamplingParams(temperature=0.8, top_p=0.95)\n",
169+
"for output in llm.generate(prompts, sampling_params):\n",
170+
" print(f\"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}\")"
171+
]
172+
},
173+
{
174+
"cell_type": "markdown",
175+
"metadata": {},
176+
"source": [
177+
"# Conclusion and Next Steps\n",
178+
"Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API.\n",
179+
"\n",
180+
"In this notebook, you have learned how to:\n",
181+
"- Set up your environment with the necessary dependencies.\n",
182+
"- Use the `tensorrt_llm.LLM` API to download a model from the Hugging Face Hub.\n",
183+
"- Automatically build a high-performance TensorRT engine tailored to your GPU.\n",
184+
"- Run inference with the optimized model.\n",
185+
"\n",
186+
"\n",
187+
"You can explore more advanced features to further improve performance and efficiency:\n",
188+
"\n",
189+
"- Benchmarking: Try running a [benchmark](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/benchmarking-default-performance.html#benchmarking-with-trtllm-bench) to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time.\n",
190+
"\n",
191+
"- Quantization: TensorRT-LLM [supports](https://github.com/NVIDIA/TensorRT-Model-Optimizer) various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware.\n",
192+
"\n",
193+
"- Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/latest/) for robust, scalable, and multi-model serving.\n",
194+
"\n"
195+
]
196+
}
197+
],
198+
"metadata": {
199+
"kernelspec": {
200+
"display_name": "Python 3 (ipykernel)",
201+
"language": "python",
202+
"name": "python3"
203+
},
204+
"language_info": {
205+
"codemirror_mode": {
206+
"name": "ipython",
207+
"version": 3
208+
},
209+
"file_extension": ".py",
210+
"mimetype": "text/x-python",
211+
"name": "python",
212+
"nbconvert_exporter": "python",
213+
"pygments_lexer": "ipython3",
214+
"version": "3.12.3"
215+
}
216+
},
217+
"nbformat": 4,
218+
"nbformat_minor": 4
219+
}

articles/run-nvidia.md

Lines changed: 0 additions & 123 deletions
This file was deleted.

authors.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,11 @@
22

33
# You can optionally customize how your information shows up cookbook.openai.com over here.
44
# If your information is not present here, it will be pulled from your GitHub profile.
5+
jayrodge:
6+
name: "Jay Rodge"
7+
website: "https://www.linkedin.com/in/jayrodge/"
8+
avatar: "https://developer-blogs.nvidia.com/wp-content/uploads/2024/05/Jay-Rodge.png"
9+
510
rajpathak-openai:
611
name: "Raj Pathak"
712
website: "https://www.linkedin.com/in/rajpathakopenai/"

registry.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,15 @@
44
# should build pages for, and indicates metadata such as tags, creation date and
55
# authors for each page.
66

7+
- title: Using NVIDIA TensorRT-LLM to run the 20B model
8+
path: examples/articles/run-nvidia.ipynb
9+
date: 2025-08-05
10+
authors:
11+
- jayrodge
12+
tags:
13+
- nvidia
14+
- tensorrt-llm
15+
716
- title: Temporal Agents with Knowledge Graphs
817
path: examples/partners/temporal_agents_with_knowledge_graphs/temporal_agents_with_knowledge_graphs.ipynb
918
date: 2025-07-22

0 commit comments

Comments
 (0)