upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. by 2timesjay · Pull Request #1064 · modal-labs/modal-examples

2timesjay · 2025-02-03T02:44:43Z

I upgraded to the newest version of vllm (0.7.0), which includes an alpha version of their substantially faster engine and a refactor to model configuration. If people are reusing these examples for demos and projects, this should be helpful.

Big speedup, especially at high concurrency. Here's some numbers from testing with Llama3-70B-fp8 on 1 H100:

vllm==0.7.0, VLLM_USE_V1=1

Max Parallelism	Number of Prompts	Average Latency (s)	p95 Latency (s)	Throughput (requests/s)
8	32	3.3245	3.5712	2.3654
16	32	3.7085	3.8151	4.2802
32	32	4.5872	4.6662	6.8342
64	64	5.8669	6.0471	10.4833
128	128	8.6023	8.8094	14.3457
256	256	14.7483	18.9714	13.2442

vllm=0.6.3post1

Max Parallelism	Number of Prompts	Average Latency (s)	p95 Latency (s)	Throughput (requests/s)
8	32	4.1822	4.3813	1.9079
16	32	4.6502	5.1558	3.1282
32	32	6.9919	9.2724	3.4463
64	64	11.4092	18.2382	3.4930
128	128	75.9388	90.1596	1.4170
256	256	93.1698	123.8264	2.0409

see full results: https://gist.github.com/2timesjay/ebc7773aa8fb01115172f37dae86bc47

Type of Change

New example
Example updates (Bug fixes, new features, etc.)
Other (changes to the codebase, but not to examples)

Checklist

(all of these are satisfied by keeping the changes to a minimum)

Outside contributors

Jacob Jensen (2timesjay)

…s especially at high concurrency

charlesfrye · 2025-02-03T22:11:27Z

Thanks for the PR! cc @jackcook

jackcook · 2025-02-04T21:09:02Z

Looks good to me! I did some benchmarking today to look at the effects of the new V1 engine and we're seeing similar improvements internally as well.

bhaktatejas922 · 2025-02-08T04:02:20Z

theres a fair amount of things not implemented yet on v1, could be worth adding a disclaimer

charlesfrye · 2025-02-20T23:16:41Z

thanks for sharing this @2timesjay and for your comment @bhaktatejas922! we incorporated the changes and feedback into #1076 and #1078

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. Big speedup…

dc9bac5

…s especially at high concurrency

charlesfrye closed this Feb 20, 2025

github-actions bot deleted the 2timesjay/upgrade-vllm-inference-demo branch January 28, 2026 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064

upgrade vllm inference demo to use 0.7.0 and VLLM_USE_V1. #1064
2timesjay wants to merge 1 commit intomodal-labs:mainfrom
2timesjay:2timesjay/upgrade-vllm-inference-demo

2timesjay commented Feb 3, 2025 •

edited

Loading

Uh oh!

charlesfrye commented Feb 3, 2025

Uh oh!

jackcook commented Feb 4, 2025

Uh oh!

bhaktatejas922 commented Feb 8, 2025

Uh oh!

charlesfrye commented Feb 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

2timesjay commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of Change

Checklist

Outside contributors

Uh oh!

charlesfrye commented Feb 3, 2025

Uh oh!

jackcook commented Feb 4, 2025

Uh oh!

bhaktatejas922 commented Feb 8, 2025

Uh oh!

charlesfrye commented Feb 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

2timesjay commented Feb 3, 2025 •

edited

Loading