-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
System Info
Occurs in TRTLLM: v1.2.0rc2
Does not occur in TRTLLM: v1.2.0rc1
CPU: reproduced on 6960P and 6747P
GPU: Occurs on both NVIDIA B200 and NVIDIA B300
Driver: reproduced on both 570.172.08 and 580.105.08
OS: Ubuntu 24.04
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Download the weights for nvidia/DeepSeek-V3-0324-NVFP4. In my case, I ran import huggingface_hub; huggingface_hub.snapshot_download('nvidia/DeepSeek-V3-0324-NVFP4', local_dir='data/deepseek-fp4')
Run from docker
container=nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc2; sudo docker pull $container; sudo docker run -it --gpus all --shm-size 64g -p 8000:8000 -p 8002:8002 -v /data:/data --entrypoint /bin/bash $container
cat > extra.yml
print_iter_log: true
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1
enable_iter_perf_stats: true
kv_cache_config:
dtype: fp8
free_gpu_memory_fraction: 0.8
moe_config:
backend: TRTLLM
Run the server
trtllm-serve serve /data/deepseek-fp4 --tp_size=8 --backend=pytorch --host=0.0.0.0 --port=8000 --max_batch_size=192 --max_seq_len=163840 --max_num_tokens=32768 --ep_size=8 --max_batch_size=192 --extra_llm_api_options=extra.yml
Download the following two files:
api_client_bench_lite.py
war_and_peace.txt
Then, from another terminal, run several requests in parallel to measure tokens per second. In this 30 second test, we're looking for the "otps" value after several requests have finished, kind of in the middle of the test.
python api_client_bench_lite.py --host localhost --port 8000 --type openai --model deepseek/DeepSeek-R1 --prompt-file war_and_peace.txt --prompt-words 2750 --max-new-tokens 200 --conc 1 --sleep-time 0.25 --test-time 30 --stats-brief;
Make sure to run this twice. The first test run will have skewed results due to cuda compilation delay.
To read the output, ignore the first requests, then copy the steady state "otps" from the individual request lines. In this example, it is about 7.62 otps:
EXAMPLE OUTPUT:
req 49.0 ->
req 39.0 4448 in 200 out in 2.73s bs 10.00 1704.05 iotps 73.32 otps
req 50.0 ->
req 40.0 4450 in 200 out in 2.73s bs 10.00 1702.41 iotps 73.22 otps
req 51.0 ->
req 41.0 4448 in 200 out in 2.73s bs 10.00 1705.38 iotps 73.38 otps
req 52.0 ->
Expected behavior
On a B200 system running DeepSeek FP4, I get 114 otps on TensorRT-LLM 1.2.0rc1
actual behavior
On the same B200 system running DeepSeek FP4, I get only 63 otps on TensorRT-LLM 1.2.0rc2
additional notes
I discovered this when trying to reproduce issue #9218 on a B300 system (also tested on 1.2.0rc2), but then discovered that a problem also occurs on B200 on 1.2.0rc2.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.