-
Notifications
You must be signed in to change notification settings - Fork 86
[NVIDIA] Update Llama3/4/GPT-OSS recipes to vLLM v0.11.2 #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[NVIDIA] Update Llama3/4/GPT-OSS recipes to vLLM v0.11.2 #123
Conversation
Changes: - Update vLLM version to v0.11.2. - Remove `custom_ops` and `cudagraph_mode` that are no longer needed from compilation-config. - Remove `VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MB` env var that is no longer needed. - Add `stream-interval: 20` for GPT-OSS to avoid being bottlenecked by host overheads in max throughput scenario. - Disable Attn+Q fusion on Llama4 since it does not work anymore. Signed-off-by: Po-Han Huang <[email protected]>
Summary of ChangesHello @nvpohanh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on updating and optimizing the vLLM recipes for Llama3, Llama4, and GPT-OSS models to align with the vLLM v0.11.2 release. The changes streamline compilation settings, remove obsolete environment variables, and introduce new parameters to enhance performance, particularly for GPT-OSS models, by improving streaming throughput and enabling speculative decoding. These updates ensure the recipes leverage the latest vLLM capabilities for efficient model serving. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the Llama and GPT-OSS recipes for vLLM v0.11.2. The changes include updating the Docker image version, removing obsolete configuration parameters, and adding new ones like stream-interval for GPT-OSS. Overall, the updates are well-aligned with the pull request's goals. However, I've identified a few issues, including a critical syntax error in a YAML configuration example within the documentation, a likely typo in a command-line argument that is used inconsistently across files, and some documentation text that might be confusing as it refers to parameters not used in the specific recipe. Addressing these points will improve the correctness and clarity of the recipes.
| no-enable-prefix-caching: true | ||
| max-cudagraph-capture-size: 2048 | ||
| max-num-batched-tokens: 8192 | ||
| stream-interval 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| --tensor-parallel-size 1 \ | ||
| --max-num-seqs 512 & | ||
| --max-num-seqs 512 \ | ||
| --max-model-length 10240 & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command-line argument --max-model-length appears to be a typo. The correct argument in vLLM is typically --max-model-len. This is used correctly in OpenAI/GPT-OSS.md in this same PR. Using the wrong argument name could cause the server to fail to start or to ignore this important parameter.
| --max-model-length 10240 & | |
| --max-model-len 10240 & |
| --tensor-parallel-size 1 \ | ||
| --max-num-seqs 512 & | ||
| --max-num-seqs 512 \ | ||
| --max-model-length 10240 & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command-line argument --max-model-length appears to be a typo. The correct argument in vLLM is typically --max-model-len. This is used correctly in OpenAI/GPT-OSS.md in this same PR. Using the wrong argument name could cause the server to fail to start or to ignore this important parameter.
| --max-model-length 10240 & | |
| --max-model-len 10240 & |
| - `Median Inter-Token Latency (ITL)`: The typical time delay between a response for the completion of one output token (or output tokens) and the next response for the completion of token(s). | ||
| - If the `--stream-interval 20` flag is added in the server command, the ITL will be the completion time for every 20 output tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - `Median Inter-Token Latency (ITL)`: The typical time delay between a response for the completion of one output token (or output tokens) and the next response for the completion of token(s). | ||
| - If the `--stream-interval 20` flag is added in the server command, the ITL will be the completion time for every 20 output tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes:
custom_opsandcudagraph_modethat are no longer needed from compilation-config.VLLM_FLASHINFER_ALLREDUCE_FUSION_THRESHOLDS_MBenv var that is no longer needed.stream-interval: 20for GPT-OSS to avoid being bottlenecked by host overheads in max throughput scenario.cuda-graph-sizestomax-cudagraph-capture-size.