Skip to content

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064

Closed
2timesjay wants to merge 1 commit intomodal-labs:mainfrom
2timesjay:2timesjay/upgrade-vllm-inference-demo
Closed

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064
2timesjay wants to merge 1 commit intomodal-labs:mainfrom
2timesjay:2timesjay/upgrade-vllm-inference-demo

Conversation

@2timesjay
Copy link
Copy Markdown

@2timesjay 2timesjay commented Feb 3, 2025

I upgraded to the newest version of vllm (0.7.0), which includes an alpha version of their substantially faster engine and a refactor to model configuration. If people are reusing these examples for demos and projects, this should be helpful.

Big speedup, especially at high concurrency. Here's some numbers from testing with Llama3-70B-fp8 on 1 H100:

vllm==0.7.0, VLLM_USE_V1=1

Max Parallelism Number of Prompts Average Latency (s) p95 Latency (s) Throughput (requests/s)
8 32 3.3245 3.5712 2.3654
16 32 3.7085 3.8151 4.2802
32 32 4.5872 4.6662 6.8342
64 64 5.8669 6.0471 10.4833
128 128 8.6023 8.8094 14.3457
256 256 14.7483 18.9714 13.2442

vllm=0.6.3post1

Max Parallelism Number of Prompts Average Latency (s) p95 Latency (s) Throughput (requests/s)
8 32 4.1822 4.3813 1.9079
16 32 4.6502 5.1558 3.1282
32 32 6.9919 9.2724 3.4463
64 64 11.4092 18.2382 3.4930
128 128 75.9388 90.1596 1.4170
256 256 93.1698 123.8264 2.0409

see full results: https://gist.github.com/2timesjay/ebc7773aa8fb01115172f37dae86bc47

Type of Change

  • New example
  • Example updates (Bug fixes, new features, etc.)
  • Other (changes to the codebase, but not to examples)

Checklist

(all of these are satisfied by keeping the changes to a minimum)

  • Example is testable in synthetic monitoring system, or lambda-test: false is added to example frontmatter (---)
    • Example is tested by executing with modal run or an alternative cmd is provided in the example frontmatter (e.g. cmd: ["modal", "deploy"])
    • Example is tested by running with no arguments or the args are provided in the example frontmatter (e.g. args: ["--prompt", "Formula for room temperature superconductor:"]
  • Example is documented with comments throughout, in a Literate Programming style.
  • Example does not require third-party dependencies to be installed locally
  • Example pins its dependencies
    • Example pins container images to a stable tag, not a dynamic tag like latest
    • Example specifies a python_version for the base image, if it is used
    • Example pins all dependencies to at least minor version, ~=x.y.z or ==x.y
    • Example dependencies with version < 1 are pinned to patch version, ==0.y.z

Outside contributors

Jacob Jensen (2timesjay)

@charlesfrye
Copy link
Copy Markdown
Collaborator

Thanks for the PR! cc @jackcook

@jackcook
Copy link
Copy Markdown

jackcook commented Feb 4, 2025

Looks good to me! I did some benchmarking today to look at the effects of the new V1 engine and we're seeing similar improvements internally as well.

@bhaktatejas922
Copy link
Copy Markdown

theres a fair amount of things not implemented yet on v1, could be worth adding a disclaimer

@charlesfrye
Copy link
Copy Markdown
Collaborator

thanks for sharing this @2timesjay and for your comment @bhaktatejas922! we incorporated the changes and feedback into #1076 and #1078

@github-actions github-actions bot deleted the 2timesjay/upgrade-vllm-inference-demo branch January 28, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants