DeepSeek-R1-0528 is a reasoning-focused Mixture-of-Experts (MoE) large language model developed by DeepSeek. It features Multi-head Latent Attention (MLA) with LoRA-compressed QKV projections and Multi-Token Prediction (MTP) for speculative decoding. ATOM provides built-in support for both BF16 and MXFP4 quantized variants.
Pull the latest docker from https://hub.docker.com/r/rocm/atom/ :
docker pull rocm/atom:latestAll the operations below will be executed inside the container.
python -m atom.entrypoints.openai_server \
--model deepseek-ai/DeepSeek-R1-0528 \
--kv_cache_dtype fp8 -tp 8MTP provides ~60% throughput improvement with 3 speculative tokens:
python -m atom.entrypoints.openai_server \
--model deepseek-ai/DeepSeek-R1-0528 \
--kv_cache_dtype fp8 -tp 8 \
--method mtp --num-speculative-tokens 3python -m atom.entrypoints.openai_server \
--model amd/DeepSeek-R1-0528-MXFP4 \
--kv_cache_dtype fp8 -tp 8python -m atom.entrypoints.openai_server \
--model amd/DeepSeek-R1-0528-MXFP4 \
--kv_cache_dtype fp8 -tp 8 \
--method mtp --num-speculative-tokens 3Tips on server configuration:
- Always use
--kv_cache_dtype fp8for better memory efficiency. - MTP with
--num-speculative-tokens 3provides the best throughput/latency tradeoff. --num-speculative-tokens 1is more conservative with lower overhead per step.- Set
AITER_LOG_LEVEL=WARNINGbefore starting to suppress aiter kernel log noise. - Clear compile cache before restarting:
rm -rf /root/.cache/atom/*
The following script can be used to benchmark the performance:
python -m atom.benchmarks.benchmark_serving \
--model=deepseek-ai/DeepSeek-R1-0528 --backend=vllm --base-url=http://localhost:8000 \
--dataset-name=random \
--random-input-len=${ISL} --random-output-len=${OSL} \
--random-range-ratio=0.8 \
--num-prompts=$(( $CONC * 10 )) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el"Performance on 8xMI300X GPUs with the following environment:
- Docker image: rocm/atom:latest.
- ATOM: main branch.
| ISL | OSL | Concurrency | Output Throughput (tok/s) | Total Throughput (tok/s) | Mean TPOT (ms) |
|---|---|---|---|---|---|
| 1024 | 1024 | 128 | 4,274 | 8,558 | 28.8 |
| 1024 | 1024 | 256 | 6,039 | 12,071 | 40.8 |
| ISL | OSL | Concurrency | Output Throughput (tok/s) | Total Throughput (tok/s) | Mean TPOT (ms) |
|---|---|---|---|---|---|
| 1024 | 1024 | 128 | 6,913 | 13,856 | 17.5 |
| 1024 | 1024 | 256 | 7,284 | 14,583 | 33.0 |
Live performance tracking: rocm.github.io/ATOM/benchmark-dashboard
We verified the lm_eval accuracy on gsm8k dataset with command:
lm_eval \
--model local-completions \
--model_args model=deepseek-ai/DeepSeek-R1-0528,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
--tasks gsm8k \
--num_fewshot 5Reference accuracy on 8 GPUs (BF16, FP8 KV Cache):
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9553|± |0.0057|
| | |strict-match | 5|exact_match|↑ |0.9538|± |0.0058|
CI accuracy threshold: flexible-extract ≥ 0.94 (BF16), ≥ 0.93 (MXFP4).