The table below contains trtllm-serve commands that can be used to easily deploy popular models including DeepSeek-R1, gpt-oss, Llama 4, Qwen3, and more.
We maintain LLM API configuration files for these models containing recommended performance settings in two locations:
- Curated Examples: examples/configs/curated - Hand-picked configurations for common scenarios.
- Comprehensive Database: examples/configs/database - A more comprehensive set of known-good configurations for various GPUs and traffic patterns.
The TensorRT LLM Docker container makes these config files available at /app/tensorrt_llm/examples/configs/curated and /app/tensorrt_llm/examples/configs/database respectively. You can reference them as needed:
export TRTLLM_DIR="/app/tensorrt_llm" # path to the TensorRT LLM repo in your local environmentThis table is designed to provide a straightforward starting point; for detailed model-specific deployment guides, check out the guides below.
| Model Name | GPU | Inference Scenario | Config | Command |
|---|---|---|---|---|
| Nemotron v3 Super (NVFP4) | B200, GB200 | Max Throughput | nemotron-3-super-throughput.yaml | trtllm-serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --config ${TRTLLM_DIR}/examples/configs/curated/nemotron-3-super-throughput.yaml |
| DeepSeek-R1 | H100, H200 | Max Throughput | deepseek-r1-throughput.yaml | trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml |
| DeepSeek-R1 | B200, GB200 | Max Throughput | deepseek-r1-deepgemm.yaml | trtllm-serve deepseek-ai/DeepSeek-R1-0528 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-deepgemm.yaml |
| DeepSeek-R1 (NVFP4) | B200, GB200 | Max Throughput | deepseek-r1-throughput.yaml | trtllm-serve nvidia/DeepSeek-R1-FP4 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-throughput.yaml |
| DeepSeek-R1 (NVFP4) | B200, GB200 | Min Latency | deepseek-r1-latency.yaml | trtllm-serve nvidia/DeepSeek-R1-FP4-v2 --config ${TRTLLM_DIR}/examples/configs/curated/deepseek-r1-latency.yaml |
| gpt-oss-120b | Any | Max Throughput | gpt-oss-120b-throughput.yaml | trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-throughput.yaml |
| gpt-oss-120b | Any | Min Latency | gpt-oss-120b-latency.yaml | trtllm-serve openai/gpt-oss-120b --config ${TRTLLM_DIR}/examples/configs/curated/gpt-oss-120b-latency.yaml |
| Qwen3-Next-80B-A3B-Thinking | Any | Max Throughput | qwen3-next.yaml | trtllm-serve Qwen/Qwen3-Next-80B-A3B-Thinking --config ${TRTLLM_DIR}/examples/configs/curated/qwen3-next.yaml |
| Qwen3 family (e.g. Qwen3-30B-A3B) | Any | Max Throughput | qwen3.yaml | trtllm-serve Qwen/Qwen3-30B-A3B --config ${TRTLLM_DIR}/examples/configs/curated/qwen3.yaml (swap to another Qwen3 model name as needed) |
| Llama-3.3-70B (FP8) | Any | Max Throughput | llama-3.3-70b.yaml | trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 --config ${TRTLLM_DIR}/examples/configs/curated/llama-3.3-70b.yaml |
| Llama 4 Scout (FP8) | Any | Max Throughput | llama-4-scout.yaml | trtllm-serve nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --config ${TRTLLM_DIR}/examples/configs/curated/llama-4-scout.yaml |
| Kimi-K2-Thinking (NVFP4) | B200, GB200 | Max Throughput | kimi-k2-thinking.yaml | trtllm-serve nvidia/Kimi-K2-Thinking-NVFP4 --config ${TRTLLM_DIR}/examples/configs/curated/kimi-k2-thinking.yaml |
The deployment guides below provide more detailed instructions for serving specific models with TensorRT LLM.
.. toctree:: :maxdepth: 1 :name: Deployment Guides deployment-guide-for-nemotron-3-super-on-trtllm.md deployment-guide-for-deepseek-r1-on-trtllm.md deployment-guide-for-llama3.3-70b-on-trtllm.md deployment-guide-for-llama4-scout-on-trtllm.md deployment-guide-for-gpt-oss-on-trtllm.md deployment-guide-for-qwen3-on-trtllm.md deployment-guide-for-qwen3-next-on-trtllm.md deployment-guide-for-kimi-k2-thinking-on-trtllm.md
.. trtllm_config_selector::
The table below lists all available pre-configured model scenarios in the TensorRT LLM configuration database. Each row represents a specific model, GPU, and performance profile combination with recommended request settings.