Skip to content

Commit 24dfd4c

Browse files
authored
Doc: Update llama-3.3-70B guide (NVIDIA#6028)
Signed-off-by: jiahanc <[email protected]>
1 parent dd2491f commit 24dfd4c

File tree

1 file changed

+52
-0
lines changed

1 file changed

+52
-0
lines changed

examples/models/core/llama/README.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ This document shows how to build and run a LLaMA model in TensorRT-LLM on both s
3737
- [Convert Checkpoint to TensorRT-LLM Unified Checkpoint](#convert-checkpoint-to-tensorrt-llm-unified-checkpoint)
3838
- [Build Engine](#build-engine)
3939
- [Run Inference](#run-inference)
40+
- [Run LLaMa-3.3 70B Model on PyTorch Backend](#run-llama-33-70b-model-on-pytorch-backend)
41+
- [Prepare TensorRT-LLM extra configs](#prepare-tensorrt-llm-extra-configs)
42+
- [Launch trtllm-serve OpenAI-compatible API server](#launch-trtllm-serve-openai-compatible-api-server)
43+
- [Run performance benchmarks](#run-performance-benchmarks)
4044

4145
## Overview
4246

@@ -1542,3 +1546,51 @@ bash -c 'python ./examples/mmlu.py --test_trt_llm \
15421546
--kv_cache_free_gpu_memory_fraction 0.999 \
15431547
--max_tokens_in_paged_kv_cache 65064'
15441548
```
1549+
1550+
## Run LLaMa-3.3 70B Model on PyTorch Backend
1551+
This section provides the steps to run LLaMa-3.3 70B model FP8 precision on PyTorch backend by launching TensorRT-LLM server and run performance benchmarks.
1552+
1553+
1554+
### Prepare TensorRT-LLM extra configs
1555+
```bash
1556+
cat >./extra-llm-api-config.yml <<EOF
1557+
stream_interval: 2
1558+
cuda_graph_config:
1559+
max_batch_size: 1024
1560+
padding_enabled: true
1561+
EOF
1562+
```
1563+
Explanation:
1564+
- `stream_interval`: The iteration interval to create responses under the streaming mode.
1565+
- `cuda_graph_config`: CUDA Graph config.
1566+
- `max_batch_size`: Max CUDA graph batch size to capture.
1567+
- `padding_enabled`: Whether to enable CUDA graph padding.
1568+
1569+
1570+
### Launch trtllm-serve OpenAI-compatible API server
1571+
TensorRT-LLM supports nvidia TensorRT Model Optimizer quantized FP8 checkpoint
1572+
``` bash
1573+
trtllm-serve nvidia/Llama-3.3-70B-Instruct-FP8 \
1574+
--backend pytorch \
1575+
--tp_size 8 \
1576+
--max_batch_size 1024 \
1577+
--trust_remote_code \
1578+
--num_postprocess_workers 2 \
1579+
--extra_llm_api_options ./extra-llm-api-config.yml
1580+
```
1581+
1582+
### Run performance benchmarks
1583+
TensorRT-LLM provides a benchmark tool to benchmark `trtllm-serve`.
1584+
1585+
Prepare a new terminal and run `benchmark_serving`.
1586+
```bash
1587+
python -m tensorrt_llm.serve.scripts.benchmark_serving \
1588+
--model nvidia/Llama-3.3-70B-Instruct-FP8 \
1589+
--dataset-name random \
1590+
--ignore-eos \
1591+
--num-prompts 8192 \
1592+
--random-input-len 1024 \
1593+
--random-output-len 2048 \
1594+
--random-ids \
1595+
--max-concurrency 1024 \
1596+
```

0 commit comments

Comments
 (0)