Skip to content

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200 #3058

@jeffye-dev

Description

@jeffye-dev

I want to reproduce the DeepSeek-R1-FP4 on B200 deployment solution to align with the blog : https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance

However, I just get 40 output tokens per per user, comparing with the 253 mentiond in this blog. It is a huge gap.
Here is my method to deploy on B200:

  1. The latest official image (nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3) does not support DeepSeek V3 model, I have to download the newest source code of main branch from this project https://github.com/triton-inference-server/tensorrtllm_backend.git, and build it from scratch by the command:
DOCKER_BUILDKIT=1 docker build -t tritonserver_trtllm -f dockerfile/Dockerfile.triton.trt_llm_backend .
  1. Setup the docker container and launch trtllm-serve inside container:
echo -e "enable_attention_dp: true\npytorch_backend_config:\n enable_overlap_scheduler: true\n print_iter_log: true\n use_cuda_graph: true\n cuda_graph_padding_enabled: true\n cuda_graph_batch_sizes: [1, 512]" > extra-llm-api-config.yml
trtllm-serve nvidia/DeepSeek-R1-FP4 --backend pytorch --max_batch_size 512 --max_num_tokens 1560 --tp_size 8 --pp_size 1 --ep_size 8 --kv_cache_free_gpu_memory_fraction 0.90 --extra_llm_api_options ./extra-llm-api-config.yml

With the engine is up, I constructed hundreds of requests with input length 1000 and output length 1000 and sent to engine in different batch size, the avg output speed is about 40 tokens per request.
3. I also run the trtllm-bench by following the official doc: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/deepseek_v3/README.md#running-the-benchmark . Then I only got 7 output tokens per user.

Image

Is my method wrong? And what's the correct method & configuration to run engine to serve DeepSeek-R1-FP4 on 8xB200. Appreciate for any help!

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions