upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064
Closed
2timesjay wants to merge 1 commit intomodal-labs:mainfrom
Closed
upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #10642timesjay wants to merge 1 commit intomodal-labs:mainfrom
2timesjay wants to merge 1 commit intomodal-labs:mainfrom
Conversation
…s especially at high concurrency
Collaborator
|
Thanks for the PR! cc @jackcook |
|
Looks good to me! I did some benchmarking today to look at the effects of the new V1 engine and we're seeing similar improvements internally as well. |
|
theres a fair amount of things not implemented yet on v1, could be worth adding a disclaimer |
Collaborator
|
thanks for sharing this @2timesjay and for your comment @bhaktatejas922! we incorporated the changes and feedback into #1076 and #1078 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I upgraded to the newest version of vllm (0.7.0), which includes an alpha version of their substantially faster engine and a refactor to model configuration. If people are reusing these examples for demos and projects, this should be helpful.
Big speedup, especially at high concurrency. Here's some numbers from testing with Llama3-70B-fp8 on 1 H100:
vllm==0.7.0, VLLM_USE_V1=1
vllm=0.6.3post1
see full results: https://gist.github.com/2timesjay/ebc7773aa8fb01115172f37dae86bc47
Type of Change
Checklist
(all of these are satisfied by keeping the changes to a minimum)
lambda-test: falseis added to example frontmatter (---)modal runor an alternativecmdis provided in the example frontmatter (e.g.cmd: ["modal", "deploy"])argsare provided in the example frontmatter (e.g.args: ["--prompt", "Formula for room temperature superconductor:"]latestpython_versionfor the base image, if it is used~=x.y.zor==x.yversion < 1are pinned to patch version,==0.y.zOutside contributors
Jacob Jensen (2timesjay)