-
Notifications
You must be signed in to change notification settings - Fork 34
[Docs] Update Llama3/4 and GPT-OSS recipe for NVIDIA GPUs #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Docs] Update Llama3/4 and GPT-OSS recipe for NVIDIA GPUs #42
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @nvpohanh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request updates the documentation for Llama3/4 and GPT-OSS recipes, specifically tailored for NVIDIA GPUs. The changes focus on improving the deployment experience by simplifying Docker image acquisition, addressing NCCL initialization issues on Blackwell architecture, and refining vLLM server launch configurations. Additionally, the GPT-OSS recipe now includes extensive guidance on performance benchmarking and tuning, providing users with clearer instructions and insights into optimizing model serving for various throughput and latency requirements.
Highlights
- Streamlined Docker Image Acquisition: The documentation now guides users to pull pre-built vLLM Docker images from a specific ECR repository instead of building them locally. This significantly simplifies the setup process by removing the need for local compilation and dependency management.
- Critical NCCL Version Installation: A new step has been added to install a specific version of NCCL (
nvidia-nccl-cu12==2.26.2.post1
). This addresses known issues with long NCCL initialization times on NVIDIA Blackwell architecture, ensuring smoother and faster model deployment. - Optimized vLLM Server Launch Commands: The
vllm serve
commands across all recipes have been updated with refined compilation configurations and flag usage. This includes removing deprecated environment variables and introducing new flags likeNO_PREFIX_CACHING_FLAG
for more consistent and optimized performance measurements. - Enhanced Performance Benchmarking and Tuning Guidance: The GPT-OSS recipe now includes comprehensive sections on performance benchmarking, detailing how to use
vllm bench serve
and how to interpret its output. It also provides in-depth guidance on tunable parameters and recommended configurations for balancing throughput and latency on NVIDIA GPUs.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the documentation for Llama 3.3, Llama 4, and GPT-OSS models to provide clearer and more up-to-date instructions for running them on NVIDIA GPUs, particularly the new Blackwell architecture. Key changes include switching from building a Docker image to pulling a pre-built one, adding a necessary step to update NCCL for Blackwell, and providing more detailed server launch commands with explanations for various parameters. The documentation for performance benchmarking and balancing throughput vs. latency has also been significantly expanded and standardized across the recipes. The changes are well-structured and greatly improve the usability and clarity of the guides. I've found a minor typo and a couple of small inconsistencies in the documentation which I've commented on.
6ea9d90
to
658e615
Compare
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the documentation for Llama3, Llama4, and GPT-OSS models on NVIDIA GPUs. The changes include switching from building a Docker image from source to pulling a pre-built image, which simplifies the setup process. It also adds instructions for installing a newer version of NCCL required for Blackwell architecture and updates various vLLM flags and commands to align with recent changes. The documentation for GPT-OSS is significantly expanded with detailed examples, parameter explanations, and comprehensive sections on performance benchmarking and tuning.
My review identifies a couple of minor issues in OpenAI/GPT-OSS.md
: a formatting inconsistency in the parameter list and a typo at the end of the file. Applying these suggestions will improve the clarity and consistency of the documentation. Overall, this is a valuable update that makes the recipes more user-friendly and informative.
b5c8d62
to
5b6aa93
Compare
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the documentation for Llama and GPT-OSS models, shifting from building Docker images from source to using pre-built images and updating configurations for NVIDIA's Blackwell and Hopper GPUs. This simplifies the setup process and provides more current instructions. My review has identified a few critical errors in the provided Docker commands and some typos in the documentation that could lead to user confusion. Addressing these will improve the quality and usability of the recipes.
Signed-off-by: Po-Han Huang <[email protected]>
5b6aa93
to
0b14830
Compare
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the documentation for running Llama3, Llama4, and GPT-OSS models on NVIDIA GPUs. The changes simplify the setup process by switching to a pre-built Docker image, add necessary steps for installing updated dependencies like NCCL and FlashInfer, and refine the server launch configurations for improved performance on newer hardware architectures. The instructions are generally clear and the updates are consistent across the different model recipes. I've identified one issue in the Llama4-Scout.md
recipe where a command uses an incorrect model name, which would cause it to fail. I have provided a suggestion to correct this.
@heheda12345 Could you review this and merge this? Thanks! |
[Docs] Update Llama3/4 and GPT-OSS recipe for NVIDIA GPUs