Update the Deepseek-R1 performance guide for the AMD GPU. #17

haic0 · 2025-08-07T09:34:36Z

Update the Deepseek-R1 performance guide for the AMD GPU.

gemini-code-assist

Summary of Changes

Hello @haic0, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the documentation by adding a dedicated performance guide for running DeepSeek-R1 models on AMD MI300X GPUs. My aim is to provide clear, actionable steps for users to achieve optimal performance, including environment setup, vLLM configuration, and benchmarking. Additionally, I've updated the main project README to ensure this new guide is easily accessible, improving overall user experience and documentation clarity.

Highlights

New Deepseek-R1 Performance Guide for AMD GPUs: I've introduced a new DeepSeek/AMD_GPU/README.md file. This guide provides detailed, step-by-step instructions for setting up and running DeepSeek-R1 models specifically on AMD MI300X GPUs. It covers everything from verifying the GPU environment and locking GPU frequency to launching the Rocm-vllm Docker container and executing vLLM online serving commands with optimized parameters. The guide also includes valuable tips for adjusting parameters like max-model-len, max-num-batched-tokens, and max-num-seqs to fine-tune performance for throughput, memory, and latency, along with a sample benchmark serving script and its expected output.
Main README Update for AMD GPU Support: To ensure discoverability of the new guide, I've updated the main README.md file. A new section titled "AMD GPU Support" has been added, which includes a direct link to the DeepSeek/AMD_GPU/README.md file. This makes it easy for users to find the relevant performance information for AMD GPUs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a new performance guide for running DeepSeek-R1 on AMD GPUs and links to it from the main README. The new guide is detailed and provides useful commands and benchmark results. My feedback focuses on improving the formatting and readability of the new markdown file for a better user experience, and on restructuring the link in the main README to be more consistent with the existing document structure.

gemini-code-assist · 2025-08-07T09:36:15Z

README.md

+### AMD GPU Support
+For the user guide,kindly review the AMD-GPU repository within the model directory.


This new section for AMD GPU support is a bit vague and seems misplaced. Since the guide is specific to running DeepSeek models on AMD GPUs, it would be more organized if it were listed under the existing ### DeepSeek section.

However, if it's intended to be a separate section, the title and link should be more descriptive. The current text also has a typo and grammatical issue (guide,kindly).

Here is a suggestion to make it clearer while keeping it as a separate section:

Suggested change

### AMD GPU Support

For the user guide,kindly review the AMD-GPU repository within the model directory.

### DeepSeek on AMD GPU

- [DeepSeek-R1 Performance Guide for AMD GPU](DeepSeek/AMD_GPU/README.md)

gemini-code-assist · 2025-08-07T09:36:15Z

DeepSeek/AMD_GPU/README.md

@@ -0,0 +1,120 @@
+## AMD GPU Installation and Benchmarking Guide
+#### Support Matrix 


This line has a trailing whitespace. Several other lines in this file also have trailing whitespaces (e.g., 14, 37, 55, 85, 89). Please remove them to improve formatting consistency.

Suggested change

#### Support Matrix

#### Support Matrix

gemini-code-assist · 2025-08-07T09:36:15Z

DeepSeek/AMD_GPU/README.md

+##### GPU TYPE       
+MI300X
+##### DATA TYPE
+FP8


The current formatting for the support matrix is a bit difficult to read and inconsistent. Using a markdown table would make this much clearer and more standard.

Suggested change

##### GPU TYPE

MI300X

##### DATA TYPE

FP8

| GPU TYPE | DATA TYPE |

|----------|-----------|

| MI300X | FP8 |

gemini-code-assist · 2025-08-07T09:36:15Z

DeepSeek/AMD_GPU/README.md

+rocm-smi --setperfdeterminism 1900
+```
+
+### Step 2


The heading level for 'Step 2' is ###, which is inconsistent with the other steps (e.g., 'Step 1', 'Step 3') that use ####. For consistency, please use #### for all steps.

Suggested change

### Step 2

#### Step 2

gemini-code-assist · 2025-08-07T09:36:16Z

DeepSeek/AMD_GPU/README.md

+   pip install -U "huggingface_hub[cli]"
+   huggingface-cli login 


The commands in this code block have unnecessary indentation, which can be confusing. It's best to remove it for clarity.

Suggested change

pip install -U "huggingface_hub[cli]"

huggingface-cli login

pip install -U "huggingface_hub[cli]"

huggingface-cli login

gemini-code-assist · 2025-08-07T09:36:16Z

DeepSeek/AMD_GPU/README.md

+
+--max-num-batched-tokens=32768: Balances throughput with manageable memory/latency.
+
+If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs.


This paragraph is dense and hard to read. It would be better to use bullet points or line breaks to separate the different pieces of advice for tuning max-model-len and max-num-batched-tokens.

Suggested change

If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs.

If OOM errors or sluggish performance occur:

- Decrease `max-model-len` (e.g., to 32k or 8k).

- Reduce `max-num-batched-tokens` (e.g., to 16k or 8k).

For low latency needs, consider reducing `max-num-batched-tokens`.

To maximize throughput with available VRAM, keep `max-num-batched-tokens` high, but stay aware of latency trade-offs.

gemini-code-assist · 2025-08-07T09:36:16Z

DeepSeek/AMD_GPU/README.md

+
+If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs.
+
+--max-num-seqs=1024: It affects throughput vs latency trade-offs:Higher values yield better throughput (more parallel requests) but may raise memory pressure and latency.Lower values reduce GPU memory footprint and latency, at the cost of throughput.


This paragraph about max-num-seqs is also quite dense. Using a list would make the trade-offs between higher and lower values much clearer to the user.

Suggested change

--max-num-seqs=1024: It affects throughput vs latency trade-offs:Higher values yield better throughput (more parallel requests) but may raise memory pressure and latency.Lower values reduce GPU memory footprint and latency, at the cost of throughput.

--max-num-seqs=1024: Affects throughput vs. latency trade-offs:

- **Higher values**: Yield better throughput (more parallel requests) but may increase memory pressure and latency.

- **Lower values**: Reduce GPU memory footprint and latency, at the cost of throughput.

haic0 added 4 commits August 6, 2025 16:57

Create README_AMD_GPU.md

9098634

Update README.md

2e89cae

Update README.md

cbb075b

Update README.md

b7d325e

gemini-code-assist bot reviewed Aug 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update the Deepseek-R1 performance guide for the AMD GPU. #17

Update the Deepseek-R1 performance guide for the AMD GPU. #17

Uh oh!

haic0 commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

Uh oh!

		### AMD GPU Support
		For the user guide,kindly review the AMD-GPU repository within the model directory.

		@@ -0,0 +1,120 @@
		## AMD GPU Installation and Benchmarking Guide
		#### Support Matrix

-##### GPU TYPE
-MI300X
-##### DATA TYPE
-FP8
+| GPU TYPE | DATA TYPE |
+|----------|-----------|
+| MI300X   | FP8       |


		--max-num-batched-tokens=32768: Balances throughput with manageable memory/latency.

		If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs.

-If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs.
+If OOM errors or sluggish performance occur:
+- Decrease `max-model-len` (e.g., to 32k or 8k).
+- Reduce `max-num-batched-tokens` (e.g., to 16k or 8k).
+For low latency needs, consider reducing `max-num-batched-tokens`.
+To maximize throughput with available VRAM, keep `max-num-batched-tokens` high, but stay aware of latency trade-offs.


		If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs.

		--max-num-seqs=1024: It affects throughput vs latency trade-offs:Higher values yield better throughput (more parallel requests) but may raise memory pressure and latency.Lower values reduce GPU memory footprint and latency, at the cost of throughput.

---max-num-seqs=1024: It affects throughput vs latency trade-offs:Higher values yield better throughput (more parallel requests) but may raise memory pressure and latency.Lower values reduce GPU memory footprint and latency, at the cost of throughput.
+--max-num-seqs=1024: Affects throughput vs. latency trade-offs:
+- **Higher values**: Yield better throughput (more parallel requests) but may increase memory pressure and latency.
+- **Lower values**: Reduce GPU memory footprint and latency, at the cost of throughput.

Update the Deepseek-R1 performance guide for the AMD GPU. #17

Are you sure you want to change the base?

Update the Deepseek-R1 performance guide for the AMD GPU. #17

Uh oh!

Conversation

haic0 commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!