-
Notifications
You must be signed in to change notification settings - Fork 34
Update the Deepseek-R1 performance guide for the AMD GPU. #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @haic0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the documentation by adding a dedicated performance guide for running DeepSeek-R1 models on AMD MI300X GPUs. My aim is to provide clear, actionable steps for users to achieve optimal performance, including environment setup, vLLM configuration, and benchmarking. Additionally, I've updated the main project README to ensure this new guide is easily accessible, improving overall user experience and documentation clarity.
Highlights
- New Deepseek-R1 Performance Guide for AMD GPUs: I've introduced a new
DeepSeek/AMD_GPU/README.md
file. This guide provides detailed, step-by-step instructions for setting up and running DeepSeek-R1 models specifically on AMD MI300X GPUs. It covers everything from verifying the GPU environment and locking GPU frequency to launching the Rocm-vllm Docker container and executing vLLM online serving commands with optimized parameters. The guide also includes valuable tips for adjusting parameters likemax-model-len
,max-num-batched-tokens
, andmax-num-seqs
to fine-tune performance for throughput, memory, and latency, along with a sample benchmark serving script and its expected output. - Main README Update for AMD GPU Support: To ensure discoverability of the new guide, I've updated the main
README.md
file. A new section titled "AMD GPU Support" has been added, which includes a direct link to theDeepSeek/AMD_GPU/README.md
file. This makes it easy for users to find the relevant performance information for AMD GPUs.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new performance guide for running DeepSeek-R1 on AMD GPUs and links to it from the main README. The new guide is detailed and provides useful commands and benchmark results. My feedback focuses on improving the formatting and readability of the new markdown file for a better user experience, and on restructuring the link in the main README to be more consistent with the existing document structure.
### AMD GPU Support | ||
For the user guide,kindly review the AMD-GPU repository within the model directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new section for AMD GPU support is a bit vague and seems misplaced. Since the guide is specific to running DeepSeek models on AMD GPUs, it would be more organized if it were listed under the existing ### DeepSeek
section.
However, if it's intended to be a separate section, the title and link should be more descriptive. The current text also has a typo and grammatical issue (guide,kindly
).
Here is a suggestion to make it clearer while keeping it as a separate section:
### AMD GPU Support | |
For the user guide,kindly review the AMD-GPU repository within the model directory. | |
### DeepSeek on AMD GPU | |
- [DeepSeek-R1 Performance Guide for AMD GPU](DeepSeek/AMD_GPU/README.md) |
@@ -0,0 +1,120 @@ | |||
## AMD GPU Installation and Benchmarking Guide | |||
#### Support Matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
##### GPU TYPE | ||
MI300X | ||
##### DATA TYPE | ||
FP8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rocm-smi --setperfdeterminism 1900 | ||
``` | ||
|
||
### Step 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pip install -U "huggingface_hub[cli]" | ||
huggingface-cli login |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
--max-num-batched-tokens=32768: Balances throughput with manageable memory/latency. | ||
|
||
If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is dense and hard to read. It would be better to use bullet points or line breaks to separate the different pieces of advice for tuning max-model-len
and max-num-batched-tokens
.
If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs. | |
If OOM errors or sluggish performance occur: | |
- Decrease `max-model-len` (e.g., to 32k or 8k). | |
- Reduce `max-num-batched-tokens` (e.g., to 16k or 8k). | |
For low latency needs, consider reducing `max-num-batched-tokens`. | |
To maximize throughput with available VRAM, keep `max-num-batched-tokens` high, but stay aware of latency trade-offs. |
|
||
If OOM errors or sluggish performance occur → decrease max-model-len (e.g., 32k or 8k) or reduce max-num-batched-tokens (e.g., 16k or 8k).For low latency needs, consider reducing max-num-batched-tokens.To maximize throughput and you have available VRAM, keep it high—but stay aware of latency trade-offs. | ||
|
||
--max-num-seqs=1024: It affects throughput vs latency trade-offs:Higher values yield better throughput (more parallel requests) but may raise memory pressure and latency.Lower values reduce GPU memory footprint and latency, at the cost of throughput. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph about max-num-seqs
is also quite dense. Using a list would make the trade-offs between higher and lower values much clearer to the user.
--max-num-seqs=1024: It affects throughput vs latency trade-offs:Higher values yield better throughput (more parallel requests) but may raise memory pressure and latency.Lower values reduce GPU memory footprint and latency, at the cost of throughput. | |
--max-num-seqs=1024: Affects throughput vs. latency trade-offs: | |
- **Higher values**: Yield better throughput (more parallel requests) but may increase memory pressure and latency. | |
- **Lower values**: Reduce GPU memory footprint and latency, at the cost of throughput. |
Update the Deepseek-R1 performance guide for the AMD GPU.