You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This section provides the steps to run LLaMa-3.3 70B model FP8 precision on PyTorch backend by launching TensorRT-LLM server and run performance benchmarks.
1552
+
1553
+
1554
+
### Prepare TensorRT-LLM extra configs
1555
+
```bash
1556
+
cat >./extra-llm-api-config.yml <<EOF
1557
+
stream_interval: 2
1558
+
cuda_graph_config:
1559
+
max_batch_size: 1024
1560
+
padding_enabled: true
1561
+
EOF
1562
+
```
1563
+
Explanation:
1564
+
- `stream_interval`: The iteration interval to create responses under the streaming mode.
1565
+
- `cuda_graph_config`: CUDA Graph config.
1566
+
- `max_batch_size`: Max CUDA graph batch size to capture.
1567
+
- `padding_enabled`: Whether to enable CUDA graph padding.
1568
+
1569
+
1570
+
### Launch trtllm-serve OpenAI-compatible API server
1571
+
TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint
0 commit comments