-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
I want to reproduce the DeepSeek-R1-FP4 on B200 deployment solution to align with the blog : https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance
However, I just get 40 output tokens per per user, comparing with the 253 mentiond in this blog. It is a huge gap.
Here is my method to deploy on B200:
- The latest official image (nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3) does not support DeepSeek V3 model, I have to download the newest source code of main branch from this project https://github.com/triton-inference-server/tensorrtllm_backend.git, and build it from scratch by the command:
DOCKER_BUILDKIT=1 docker build -t tritonserver_trtllm -f dockerfile/Dockerfile.triton.trt_llm_backend .- Setup the docker container and launch trtllm-serve inside container:
echo -e "enable_attention_dp: true\npytorch_backend_config:\n enable_overlap_scheduler: true\n print_iter_log: true\n use_cuda_graph: true\n cuda_graph_padding_enabled: true\n cuda_graph_batch_sizes: [1, 512]" > extra-llm-api-config.yml
trtllm-serve nvidia/DeepSeek-R1-FP4 --backend pytorch --max_batch_size 512 --max_num_tokens 1560 --tp_size 8 --pp_size 1 --ep_size 8 --kv_cache_free_gpu_memory_fraction 0.90 --extra_llm_api_options ./extra-llm-api-config.yml
With the engine is up, I constructed hundreds of requests with input length 1000 and output length 1000 and sent to engine in different batch size, the avg output speed is about 40 tokens per request.
3. I also run the trtllm-bench by following the official doc: https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/deepseek_v3/README.md#running-the-benchmark . Then I only got 7 output tokens per user.
Is my method wrong? And what's the correct method & configuration to run engine to serve DeepSeek-R1-FP4 on 8xB200. Appreciate for any help!
