[Bug]: Major DeepSeek performance regression between 1.2.0rc1 and 1.2.0rc2 on Blackwell

### System Info

Occurs in TRTLLM: v1.2.0rc2
Does not occur in TRTLLM: v1.2.0rc1

CPU: reproduced on 6960P and 6747P
GPU: Occurs on both NVIDIA B200 and NVIDIA B300
Driver: reproduced on both 570.172.08 and 580.105.08
OS: Ubuntu 24.04


### Who can help?

@laikhtewari 

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Download the weights for nvidia/DeepSeek-V3-0324-NVFP4. In my case, I ran `import huggingface_hub; huggingface_hub.snapshot_download('nvidia/DeepSeek-V3-0324-NVFP4', local_dir='data/deepseek-fp4')`

Run from docker
`container=nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc2; sudo docker pull $container; sudo docker run -it --gpus all --shm-size 64g -p 8000:8000 -p 8002:8002 -v  /data:/data --entrypoint /bin/bash $container`

cat > extra.yml
```yaml
print_iter_log: true
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 1
enable_iter_perf_stats: true
kv_cache_config:
  dtype: fp8
  free_gpu_memory_fraction: 0.8
moe_config:
  backend: TRTLLM

```

Run the server
```
trtllm-serve serve /data/deepseek-fp4 --tp_size=8 --backend=pytorch --host=0.0.0.0 --port=8000 --max_batch_size=192 --max_seq_len=163840 --max_num_tokens=32768 --ep_size=8 --max_batch_size=192 --extra_llm_api_options=extra.yml
```

Download the following two files:

[api_client_bench_lite.py](https://github.com/user-attachments/files/23687284/api_client_bench_lite.py)
[war_and_peace.txt](https://github.com/user-attachments/files/23687285/war_and_peace.txt)

Then, from another terminal, run several requests in parallel to measure tokens per second. In this 30 second test, we're looking for the "otps" value after several requests have finished, kind of in the middle of the test.

```
python api_client_bench_lite.py --host localhost --port 8000 --type openai --model deepseek/DeepSeek-R1  --prompt-file war_and_peace.txt --prompt-words 2750 --max-new-tokens 200 --conc 1  --sleep-time 0.25 --test-time 30 --stats-brief; 
```
Make sure to run this twice. The first test run will have skewed results due to cuda compilation delay.

To read the output, ignore the first requests, then copy the steady state "otps" from the individual request lines. In this example, it is about 7.62 otps:
EXAMPLE OUTPUT:
req 49.0 ->
req 39.0 4448 in 200 out in 2.73s bs 10.00 1704.05 iotps **73.32 otps**
req 50.0 ->
req 40.0 4450 in 200 out in 2.73s bs 10.00 1702.41 iotps **73.22 otps**
req 51.0 ->
req 41.0 4448 in 200 out in 2.73s bs 10.00 1705.38 iotps **73.38 otps**
req 52.0 ->



### Expected behavior

On a B200 system running DeepSeek FP4, I get **114 otps** on TensorRT-LLM 1.2.0rc1


### actual behavior

On the same B200 system running DeepSeek FP4, I get only **63 otps** on TensorRT-LLM 1.2.0rc2


### additional notes

I discovered this when trying to reproduce issue #9218 on a B300 system (also tested on 1.2.0rc2), but then discovered that a problem also occurs on B200 on 1.2.0rc2.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Major DeepSeek performance regression between 1.2.0rc1 and 1.2.0rc2 on Blackwell #9373

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Major DeepSeek performance regression between 1.2.0rc1 and 1.2.0rc2 on Blackwell #9373

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions