|
| 1 | +# Optimizing OpenAI GPT-OSS Models with NVIDIA TensorRT-LLM |
| 2 | + |
| 3 | +This notebook provides a step-by-step guide on how to optimizing `gpt-oss` models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way. |
| 4 | + |
| 5 | + |
| 6 | +TensorRT-LLM supports both models: |
| 7 | +- `gpt-oss-20b` |
| 8 | +- `gpt-oss-120b` |
| 9 | + |
| 10 | +In this guide, we will run `gpt-oss-20b`, if you want to try the larger model or want more customization refer to [this](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/blogs/tech_blog) deployment guide. |
| 11 | + |
| 12 | +Note: It’s important to ensure that your input prompts follow the [harmony response](http://cookbook.openai.com/articles/openai-harmony) format as the model will not function correctly otherwise, not needed in this guide. |
| 13 | + |
| 14 | +## Prerequisites |
| 15 | + |
| 16 | +### Hardware |
| 17 | +To run the 20B model and the TensorRT-LLM build process, you will need an NVIDIA GPU with at least 16GB+ of VRAM. |
| 18 | + |
| 19 | +> Recommended GPUs: NVIDIA RTX 50 Series (e.g. RTX 5090), NVIDIA H100, or L40S. |
| 20 | +
|
| 21 | +### Software |
| 22 | +- CUDA Toolkit 12.8 or later |
| 23 | +- Python 3.12 or later |
| 24 | + |
| 25 | +## Installling TensorRT-LLM |
| 26 | + |
| 27 | +There are various ways to install TensorRT-LLM, in this guide, we will using pre-built docker container from NVIDIA NGC and build it from source. |
| 28 | + |
| 29 | +## Using NGC |
| 30 | + |
| 31 | +Pull the pre-built TensorRT-LLM container for GPT-OSS from NVIDIA NGC. |
| 32 | +This is the easiest way to get started and ensures all dependencies are included. |
| 33 | + |
| 34 | +```bash |
| 35 | +docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev |
| 36 | +docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev |
| 37 | +``` |
| 38 | + |
| 39 | +## Using Docker (build from source) |
| 40 | + |
| 41 | +Alternatively, you can build the TensorRT-LLM container from source. |
| 42 | +This is useful if you want to modify the source code or use a custom branch. |
| 43 | +See the official instructions here: https://github.com/NVIDIA/TensorRT-LLM/tree/feat/gpt-oss/docker |
| 44 | + |
| 45 | +The following commands will install required dependencies, clone the repository, |
| 46 | +check out the GPT-OSS feature branch, and build the Docker container: |
| 47 | + |
| 48 | +```bash |
| 49 | +#Update package lists and install required system packages |
| 50 | +sudo apt-get update && sudo apt-get -y install git git-lfs build-essential cmake |
| 51 | + |
| 52 | +# Initialize Git LFS (Large File Storage) for handling large model files |
| 53 | +git lfs install |
| 54 | + |
| 55 | +# Clone the TensorRT-LLM repository |
| 56 | +git clone https://github.com/NVIDIA/TensorRT-LLM.git |
| 57 | +cd TensorRT-LLM |
| 58 | + |
| 59 | +# Check out the branch with GPT-OSS support |
| 60 | +git checkout feat/gpt-oss |
| 61 | + |
| 62 | +# Initialize and update submodules (required for build) |
| 63 | +git submodule update --init --recursive |
| 64 | + |
| 65 | +# Pull large files (e.g., model weights) managed by Git LFS |
| 66 | +git lfs pull |
| 67 | + |
| 68 | +# Build the release Docker image |
| 69 | +make -C docker release_build |
| 70 | + |
| 71 | +# Run the built Docker container |
| 72 | +make -C docker release_run |
| 73 | +``` |
| 74 | + |
| 75 | +TensorRT-LLM will be available through pip soon |
| 76 | + |
| 77 | +> Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch. |
| 78 | +
|
| 79 | +# Verifying TensorRT-LLM Installation |
| 80 | + |
| 81 | +```python |
| 82 | +from tensorrt_llm import LLM, SamplingParams |
| 83 | +``` |
| 84 | + |
| 85 | +# Utilizing TensorRT-LLM Python API |
| 86 | + |
| 87 | +In the next code cell, we will demonstrate how to use the TensorRT-LLM Python API to: |
| 88 | +1. Downloads the specified model weights from Hugging Face |
| 89 | +2. Automatically build the TensorRT engine for your GPU architecture if it does not already exist. |
| 90 | +3. Load the model and prepare it for inference. |
| 91 | +4. Run a simple text generation example to verify everything is working. |
| 92 | + |
| 93 | +**Note**: The first run may take several minutes as it downloads the model and builds the engine. |
| 94 | +Subsequent runs will be much faster, as the engine will be cached. |
| 95 | + |
| 96 | +```python |
| 97 | +llm = LLM(model="openai/gpt-oss-20b") |
| 98 | +``` |
| 99 | + |
| 100 | +```python |
| 101 | +prompts = ["Hello, my name is", "The capital of France is"] |
| 102 | +sampling_params = SamplingParams(temperature=0.8, top_p=0.95) |
| 103 | +for output in llm.generate(prompts, sampling_params): |
| 104 | + print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}") |
| 105 | +``` |
| 106 | + |
| 107 | +# Conclusion and Next Steps |
| 108 | +Congratulations! You have successfully optimized and run a large language model using the TensorRT-LLM Python API. |
| 109 | + |
| 110 | +In this notebook, you have learned how to: |
| 111 | +- Set up your environment with the necessary dependencies. |
| 112 | +- Use the `tensorrt_llm.LLM` API to download a model from the Hugging Face Hub. |
| 113 | +- Automatically build a high-performance TensorRT engine tailored to your GPU. |
| 114 | +- Run inference with the optimized model. |
| 115 | + |
| 116 | + |
| 117 | +You can explore more advanced features to further improve performance and efficiency: |
| 118 | + |
| 119 | +- Benchmarking: Try running a [benchmark](https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/benchmarking-default-performance.html#benchmarking-with-trtllm-bench) to compare the latency and throughput of the TensorRT-LLM engine against the original Hugging Face model. You can do this by iterating over a larger number of prompts and measuring the execution time. |
| 120 | + |
| 121 | +- Quantization: TensorRT-LLM [supports](https://github.com/NVIDIA/TensorRT-Model-Optimizer) various quantization techniques (like INT8 or FP8) to reduce model size and accelerate inference with minimal impact on accuracy. This is a powerful feature for deploying models on resource-constrained hardware. |
| 122 | + |
| 123 | +- Deploy with NVIDIA Dynamo: For production environments, you can deploy your TensorRT-LLM engine using the [NVIDIA Dynamo](https://docs.nvidia.com/dynamo/latest/) for robust, scalable, and multi-model serving. |
0 commit comments