Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions DeepSeek/AMD_GPU/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
## AMD GPU Installation and Benchmarking Guide
#### Support Matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line has a trailing whitespace. Several other lines in this file also have trailing whitespaces (e.g., 14, 37, 55, 85, 89). Please remove them to improve formatting consistency.

Suggested change
#### Support Matrix
#### Support Matrix


##### GPU TYPE
MI300X
##### DATA TYPE
FP8
Comment on lines +4 to +7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current formatting for the support matrix is a bit difficult to read and inconsistent. Using a markdown table would make this much clearer and more standard.

Suggested change
##### GPU TYPE
MI300X
##### DATA TYPE
FP8
| GPU TYPE | DATA TYPE |
|----------|-----------|
| MI300X | FP8 |


#### Step by Step Guide
Please follow the steps here to install and run DeepSeek-R1 models on AMD MI300X GPU.
The model requires 8 * MI300X GPU.

#### Step 1
Verify the GPU environment:
```shell
================================================== ROCm System Management Interface ==================================================
============================================================ Concise Info ============================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID)
======================================================================================================================================
0 9 0x74b5, 21947 51.0°C 163.0W NPS1, SPX, 0 144Mhz 900Mhz 0% perf_determinism 750.0W 0% 0%
1 8 0x74b5, 37820 45.0°C 154.0W NPS1, SPX, 0 141Mhz 900Mhz 0% perf_determinism 750.0W 0% 0%
2 7 0x74b5, 39350 46.0°C 163.0W NPS1, SPX, 0 142Mhz 900Mhz 0% perf_determinism 750.0W 0% 0%
3 6 0x74b5, 24497 53.0°C 172.0W NPS1, SPX, 0 142Mhz 900Mhz 0% perf_determinism 750.0W 0% 0%
4 5 0x74b5, 36258 51.0°C 169.0W NPS1, SPX, 0 145Mhz 900Mhz 0% perf_determinism 750.0W 0% 0%
5 4 0x74b5, 19365 44.0°C 158.0W NPS1, SPX, 0 148Mhz 900Mhz 0% perf_determinism 750.0W 0% 0%
6 3 0x74b5, 16815 53.0°C 167.0W NPS1, SPX, 0 141Mhz 900Mhz 0% perf_determinism 750.0W 0% 0%
7 2 0x74b5, 34728 46.0°C 165.0W NPS1, SPX, 0 141Mhz 900Mhz 0% perf_determinism 750.0W 0% 0%
======================================================================================================================================
```
Lock the GPU frequency
```shell
rocm-smi --setperfdeterminism 1900
```

### Step 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The heading level for 'Step 2' is ###, which is inconsistent with the other steps (e.g., 'Step 1', 'Step 3') that use ####. For consistency, please use #### for all steps.

Suggested change
### Step 2
#### Step 2

Launch the Rocm-vllm docker:
```shell
docker run -it --rm \
--cap-add=SYS_PTRACE \
-e SHELL=/bin/bash \
--network=host \
--security-opt seccomp=unconfined \
--device=/dev/kfd \
--device=/dev/dri \
-v /:/workspace \
--group-add video \
--ipc=host \
--name vllm_DS \
rocm/vllm:latest
```
Huggingface login
```shell
pip install -U "huggingface_hub[cli]"
huggingface-cli login
Comment on lines +54 to +55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The commands in this code block have unnecessary indentation, which can be confusing. It's best to remove it for clarity.

Suggested change
pip install -U "huggingface_hub[cli]"
huggingface-cli login
pip install -U "huggingface_hub[cli]"
huggingface-cli login

```
#### Step 3
##### FP8

Run the vllm online serving
Sample Command
```shell
NCCL_MIN_NCHANNELS=112 SAFETENSORS_FAST_GPU=1 VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_MHA=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
vllm serve deepseek-ai/DeepSeek-R1 \
--tensor-parallel-size 8 \
--max-model-len 65536 \
--max-num-seqs 1024 \
--max-num-batched-tokens 32768 \
--disable-log-requests \
--block-size 1 \
--compilation-config '{"full_cuda_graph":false}' \
--trust-remote-code
```

##### Tips: Users may modify the following parameters as needed.
--max-model-len=65536: A good sweet spot in most cases; preserves memory while still allowing long context.

--max-num-batched-tokens=32768: Balances throughput with manageable memory/latency.

If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This paragraph is dense and hard to read. It would be better to use bullet points or line breaks to separate the different pieces of advice for tuning max-model-len and max-num-batched-tokens.

Suggested change
If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs.
If OOM errors or sluggish performance occur:
- Decrease `max-model-len` (e.g., to 32k or 8k).
- Reduce `max-num-batched-tokens` (e.g., to 16k or 8k).
For low latency needs, consider reducing `max-num-batched-tokens`.
To maximize throughput with available VRAM, keep `max-num-batched-tokens` high, but stay aware of latency trade-offs.


--max-num-seqs=1024: It affects throughput vs latency trade-offs:Higher values yield better throughput (more parallel requests) but may raise memory pressure and latency.Lower values reduce GPU memory footprint and latency, at the cost of throughput.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This paragraph about max-num-seqs is also quite dense. Using a list would make the trade-offs between higher and lower values much clearer to the user.

Suggested change
--max-num-seqs=1024: It affects throughput vs latency trade-offs:Higher values yield better throughput (more parallel requests) but may raise memory pressure and latency.Lower values reduce GPU memory footprint and latency, at the cost of throughput.
--max-num-seqs=1024: Affects throughput vs. latency trade-offs:
- **Higher values**: Yield better throughput (more parallel requests) but may increase memory pressure and latency.
- **Lower values**: Reduce GPU memory footprint and latency, at the cost of throughput.



#### Step 4
Open a new terminal, access the running Docker container, and execute the online serving benchmark script as follows:

```shell
docker exec -it vllm_DS /bin/bash
python3 /app/vllm/benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1 --dataset-name random --ignore-eos --num-prompts 500 --max-concurrency 256 --random-input-len 3200 --random-output-len 800 --percentile-metrics ttft,tpot,itl,e2el
```
```shell
Maximum request concurrency: 256
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [03:54<00:00, 2.14it/s]
============ Serving Benchmark Result ============
Successful requests: 500
Benchmark duration (s): 234.00
Total input tokens: 1597574
Total generated tokens: 400000
Request throughput (req/s): 2.14
Output token throughput (tok/s): 1709.39
Total Token throughput (tok/s): 8536.59
---------------Time to First Token----------------
Mean TTFT (ms): 18547.34
Median TTFT (ms): 5711.21
P99 TTFT (ms): 59776.29
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 124.24
Median TPOT (ms): 140.70
P99 TPOT (ms): 144.12
---------------Inter-token Latency----------------
Mean ITL (ms): 124.24
Median ITL (ms): 71.91
P99 ITL (ms): 2290.11
----------------End-to-end Latency----------------
Mean E2EL (ms): 117819.02
Median E2EL (ms): 118451.88
P99 E2EL (ms): 174508.24
==================================================
```
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ This repo intends to host community maintained common recipes to run vLLM answer
### Qwen <img src="https://qwenlm.github.io/favicon.png" alt="Qwen" width="16" height="16" style="vertical-align:middle;">
- [Qwen3-Coder-480B-A35B](Qwen/Qwen3-Coder-480B-A35B.md)

### AMD GPU Support
For the user guide,kindly review the AMD-GPU repository within the model directory.
Comment on lines +21 to +22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This new section for AMD GPU support is a bit vague and seems misplaced. Since the guide is specific to running DeepSeek models on AMD GPUs, it would be more organized if it were listed under the existing ### DeepSeek section.

However, if it's intended to be a separate section, the title and link should be more descriptive. The current text also has a typo and grammatical issue (guide,kindly).

Here is a suggestion to make it clearer while keeping it as a separate section:

Suggested change
### AMD GPU Support
For the user guide,kindly review the AMD-GPU repository within the model directory.
### DeepSeek on AMD GPU
- [DeepSeek-R1 Performance Guide for AMD GPU](DeepSeek/AMD_GPU/README.md)


## Contributing
Please feel free to contribute by adding a new recipe or improving an existing one, just send us a PR!

Expand All @@ -31,4 +34,4 @@ uv run mkdocs serve
```

## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.