Skip to content

[Bug]: Major DeepSeek performance regression between 1.2.0rc1 and 1.2.0rc2 on Blackwell #9373

@pathorn

Description

@pathorn

System Info

Occurs in TRTLLM: v1.2.0rc2
Does not occur in TRTLLM: v1.2.0rc1

CPU: reproduced on 6960P and 6747P
GPU: Occurs on both NVIDIA B200 and NVIDIA B300
Driver: reproduced on both 570.172.08 and 580.105.08
OS: Ubuntu 24.04

Who can help?

@laikhtewari

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Download the weights for nvidia/DeepSeek-V3-0324-NVFP4. In my case, I ran import huggingface_hub; huggingface_hub.snapshot_download('nvidia/DeepSeek-V3-0324-NVFP4', local_dir='data/deepseek-fp4')

Run from docker
container=nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc2; sudo docker pull $container; sudo docker run -it --gpus all --shm-size 64g -p 8000:8000 -p 8002:8002 -v  /data:/data --entrypoint /bin/bash $container

cat > extra.yml

print_iter_log: true
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 1
enable_iter_perf_stats: true
kv_cache_config:
  dtype: fp8
  free_gpu_memory_fraction: 0.8
moe_config:
  backend: TRTLLM

Run the server

trtllm-serve serve /data/deepseek-fp4 --tp_size=8 --backend=pytorch --host=0.0.0.0 --port=8000 --max_batch_size=192 --max_seq_len=163840 --max_num_tokens=32768 --ep_size=8 --max_batch_size=192 --extra_llm_api_options=extra.yml

Download the following two files:

api_client_bench_lite.py
war_and_peace.txt

Then, from another terminal, run several requests in parallel to measure tokens per second. In this 30 second test, we're looking for the "otps" value after several requests have finished, kind of in the middle of the test.

python api_client_bench_lite.py --host localhost --port 8000 --type openai --model deepseek/DeepSeek-R1  --prompt-file war_and_peace.txt --prompt-words 2750 --max-new-tokens 200 --conc 1  --sleep-time 0.25 --test-time 30 --stats-brief; 

Make sure to run this twice. The first test run will have skewed results due to cuda compilation delay.

To read the output, ignore the first requests, then copy the steady state "otps" from the individual request lines. In this example, it is about 7.62 otps:
EXAMPLE OUTPUT:
req 49.0 ->
req 39.0 4448 in 200 out in 2.73s bs 10.00 1704.05 iotps 73.32 otps
req 50.0 ->
req 40.0 4450 in 200 out in 2.73s bs 10.00 1702.41 iotps 73.22 otps
req 51.0 ->
req 41.0 4448 in 200 out in 2.73s bs 10.00 1705.38 iotps 73.38 otps
req 52.0 ->

Expected behavior

On a B200 system running DeepSeek FP4, I get 114 otps on TensorRT-LLM 1.2.0rc1

actual behavior

On the same B200 system running DeepSeek FP4, I get only 63 otps on TensorRT-LLM 1.2.0rc2

additional notes

I discovered this when trying to reproduce issue #9218 on a B300 system (also tested on 1.2.0rc2), but then discovered that a problem also occurs on B200 on 1.2.0rc2.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

General perf<NV>Broad performance issues not specific to a particular componentbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions