Skip to content

Conversation

nvpohanh
Copy link
Contributor

[Docs] Update Llama3/4 and GPT-OSS recipe for NVIDIA GPUs

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @nvpohanh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the documentation for Llama3/4 and GPT-OSS recipes, specifically tailored for NVIDIA GPUs. The changes focus on improving the deployment experience by simplifying Docker image acquisition, addressing NCCL initialization issues on Blackwell architecture, and refining vLLM server launch configurations. Additionally, the GPT-OSS recipe now includes extensive guidance on performance benchmarking and tuning, providing users with clearer instructions and insights into optimizing model serving for various throughput and latency requirements.

Highlights

  • Streamlined Docker Image Acquisition: The documentation now guides users to pull pre-built vLLM Docker images from a specific ECR repository instead of building them locally. This significantly simplifies the setup process by removing the need for local compilation and dependency management.
  • Critical NCCL Version Installation: A new step has been added to install a specific version of NCCL (nvidia-nccl-cu12==2.26.2.post1). This addresses known issues with long NCCL initialization times on NVIDIA Blackwell architecture, ensuring smoother and faster model deployment.
  • Optimized vLLM Server Launch Commands: The vllm serve commands across all recipes have been updated with refined compilation configurations and flag usage. This includes removing deprecated environment variables and introducing new flags like NO_PREFIX_CACHING_FLAG for more consistent and optimized performance measurements.
  • Enhanced Performance Benchmarking and Tuning Guidance: The GPT-OSS recipe now includes comprehensive sections on performance benchmarking, detailing how to use vllm bench serve and how to interpret its output. It also provides in-depth guidance on tunable parameters and recommended configurations for balancing throughput and latency on NVIDIA GPUs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the documentation for Llama 3.3, Llama 4, and GPT-OSS models to provide clearer and more up-to-date instructions for running them on NVIDIA GPUs, particularly the new Blackwell architecture. Key changes include switching from building a Docker image to pulling a pre-built one, adding a necessary step to update NCCL for Blackwell, and providing more detailed server launch commands with explanations for various parameters. The documentation for performance benchmarking and balancing throughput vs. latency has also been significantly expanded and standardized across the recipes. The changes are well-structured and greatly improve the usability and clarity of the guides. I've found a minor typo and a couple of small inconsistencies in the documentation which I've commented on.

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/llama-gptoss-v0.4 branch from 6ea9d90 to 658e615 Compare August 26, 2025 08:36
@nvpohanh
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the documentation for Llama3, Llama4, and GPT-OSS models on NVIDIA GPUs. The changes include switching from building a Docker image from source to pulling a pre-built image, which simplifies the setup process. It also adds instructions for installing a newer version of NCCL required for Blackwell architecture and updates various vLLM flags and commands to align with recent changes. The documentation for GPT-OSS is significantly expanded with detailed examples, parameter explanations, and comprehensive sections on performance benchmarking and tuning.

My review identifies a couple of minor issues in OpenAI/GPT-OSS.md: a formatting inconsistency in the parameter list and a typo at the end of the file. Applying these suggestions will improve the clarity and consistency of the documentation. Overall, this is a valuable update that makes the recipes more user-friendly and informative.

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/llama-gptoss-v0.4 branch 2 times, most recently from b5c8d62 to 5b6aa93 Compare September 1, 2025 06:12
@nvpohanh
Copy link
Contributor Author

nvpohanh commented Sep 1, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the documentation for Llama and GPT-OSS models, shifting from building Docker images from source to using pre-built images and updating configurations for NVIDIA's Blackwell and Hopper GPUs. This simplifies the setup process and provides more current instructions. My review has identified a few critical errors in the provided Docker commands and some typos in the documentation that could lead to user confusion. Addressing these will improve the quality and usability of the recipes.

@nvpohanh nvpohanh force-pushed the dev/nvpohanh/llama-gptoss-v0.4 branch from 5b6aa93 to 0b14830 Compare September 1, 2025 08:34
@nvpohanh
Copy link
Contributor Author

nvpohanh commented Sep 1, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the documentation for running Llama3, Llama4, and GPT-OSS models on NVIDIA GPUs. The changes simplify the setup process by switching to a pre-built Docker image, add necessary steps for installing updated dependencies like NCCL and FlashInfer, and refine the server launch configurations for improved performance on newer hardware architectures. The instructions are generally clear and the updates are consistent across the different model recipes. I've identified one issue in the Llama4-Scout.md recipe where a command uses an incorrect model name, which would cause it to fail. I have provided a suggestion to correct this.

@nvpohanh
Copy link
Contributor Author

nvpohanh commented Sep 3, 2025

@heheda12345 Could you review this and merge this? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant