How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

I want to reproduce the DeepSeek-R1-FP4 on B200 deployment solution to align with the blog : https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance

However, I just get 40 output tokens per per user, comparing with the 253 mentiond in this blog. It is a huge gap.
Here is my method to deploy on B200:
1. The latest official image (nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3) does not support DeepSeek V3 model, I have to download the newest source code of main branch from this project https://github.com/triton-inference-server/tensorrtllm_backend.git, and build it from scratch by the command:
```bash
DOCKER_BUILDKIT=1 docker build -t tritonserver_trtllm -f dockerfile/Dockerfile.triton.trt_llm_backend .
```
2. Setup the docker container and launch trtllm-serve inside container:
```
echo -e "enable_attention_dp: true\npytorch_backend_config:\n enable_overlap_scheduler: true\n print_iter_log: true\n use_cuda_graph: true\n cuda_graph_padding_enabled: true\n cuda_graph_batch_sizes: [1, 512]" > extra-llm-api-config.yml
trtllm-serve nvidia/DeepSeek-R1-FP4 --backend pytorch --max_batch_size 512 --max_num_tokens 1560 --tp_size 8 --pp_size 1 --ep_size 8 --kv_cache_free_gpu_memory_fraction 0.90 --extra_llm_api_options ./extra-llm-api-config.yml
```
With the engine is up, I constructed hundreds of requests with input length 1000 and output length 1000 and sent to engine in different batch size, the avg output speed is about 40 tokens per request.
3. I also run the trtllm-bench by following the official doc: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/deepseek_v3/README.md#running-the-benchmark . Then I only got 7 output tokens per user.

![Image](https://github.com/user-attachments/assets/5f35ea8c-6171-4f79-8636-3bdf155847e5)

Is my method wrong? And what's the correct method & configuration to run engine to serve DeepSeek-R1-FP4 on 8xB200. Appreciate for any help!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200 #3058

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200 #3058

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions