Skip to content

Commit 0b924ad

Browse files
committed
Simon edits
Signed-off-by: simon-mo <[email protected]>
1 parent fc6e1dc commit 0b924ad

File tree

1 file changed

+31
-30
lines changed

1 file changed

+31
-30
lines changed

_posts/2025-01-10-vllm-2024-wrapped-2025-vision.md

Lines changed: 31 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@ image: /assets/figures/vllm-2024-wrapped-2025-roadmap/model-architecture-serving
77

88
The vLLM community has achieved remarkable growth in 2024, evolving from a specialized inference engine to becoming the de facto serving solution for the open-source AI ecosystem. This transformation is reflected in our growth metrics, which tell a story of rapid adoption and expanding impact:
99

10-
* GitHub stars grew from 14,000 to 32,600 (2.3x)
11-
* Contributors expanded from 190 to 740 (3.8x)
12-
* Monthly downloads surged from 6,000 to 27,000 (4.5x)
13-
* We've seen approximately 10x growth in GPU hours over the last six months.
10+
* GitHub stars grew from 14,000 to 32,600 (2.3x)
11+
* Contributors expanded from 190 to 740 (3.8x)
12+
* Monthly downloads surged from 6,000 to 27,000 (4.5x)
13+
* We've seen approximately 10x growth in GPU hours over the last six months.
1414
* You can explore more of our usage data at [https://2024.vllm.ai](https://2024.vllm.ai).
1515

16-
This transformation has established vLLM as the Linux/Kubernetes/PyTorch of **LLM inference infrastructure**, with large adoption for production applications (e.g. powering Amazon Rufus and Linkedin AI Features). Our bi-monthly meetups have become strategic gatherings for partnerships with industry leaders like IBM, AWS, and NVIDIA, marking our progress toward becoming the universal serving solution for the open-source AI ecosystem. Read on for more details on vLLM’s 2024 achievements and 2025 roadmap.
16+
This transformation has established vLLM as the leading open-source LLM serving and inference engine, with large adoption for production applications (e.g. powering Amazon Rufus and Linkedin AI Features). Our bi-monthly meetups have become strategic gatherings for partnerships with industry leaders like IBM, AWS, and NVIDIA, marking our progress toward becoming the universal serving solution for the open-source AI ecosystem. Read on for more details on vLLM’s 2024 achievements and 2025 roadmap!
1717

1818
*This blog is based off of the 16th session of the bi-weekly [vLLM Office Hours](https://hubs.li/Q02TFDTT0) session. Watch the recording [here](https://www.youtube.com/watch?v=xmz8lHsrbGM).*
1919

@@ -32,10 +32,10 @@ vLLM Main Contributor Groups (by Commits)
3232

3333
It’s been a great 2024 for vLLM! Our contribution community has expanded dramatically, now including:
3434

35-
* 15+ full-time contributors spanning 6+ organizations
36-
* 20+ active organizations as key stakeholders and sponsors
37-
* Contributions from top institutions including UC Berkeley, Neural Magic, AnyScale, Roblox, IBM, AMD, Intel, and NVIDIA, as well as individual developers worldwide.
38-
* A thriving ecosystem bridging model creators, hardware vendors, and optimization developers
35+
* 15+ full-time contributors spanning 6+ organizations
36+
* 20+ active organizations as key stakeholders and sponsors
37+
* Contributions from top institutions including UC Berkeley, Neural Magic, Anyscale, Roblox, IBM, AMD, Intel, and NVIDIA, as well as individual developers worldwide.
38+
* A thriving ecosystem bridging model creators, hardware vendors, and optimization developers
3939
* Well-attended bi-weekly office hours facilitating transparency, community growth, and strategic partnerships
4040

4141
These numbers reflect more than just growth \- they demonstrate vLLM's increasing role as critical infrastructure in the AI ecosystem, supporting everything from research prototypes to production systems serving millions of users.
@@ -49,7 +49,7 @@ Usage by Model Architecture in Serving
4949
</figcaption>
5050
</figure>
5151

52-
At the beginning of 2024, vLLM supported only a handful of models. By year’s end, the project had evolved to support performant inference for almost [**100 model architectures**](https://docs.vllm.ai/en/latest/models/supported_models.html): spanning nearly every prominent open-source large language model (LLM), including multimodal (image, audio, video), encoder-decoder, speculative decoding, classification, embedding, and reward models. Notably, vLLM introduced support for state-space language models, marking a significant technical milestone.
52+
At the beginning of 2024, vLLM supported only a handful of models. By year’s end, the project had evolved to support performant inference for almost [**100 model architectures**](https://docs.vllm.ai/en/latest/models/supported_models.html): spanning nearly every prominent open-source large language model (LLM), multimodal (image, audio, video), encoder-decoder, speculative decoding, classification, embedding, and reward models. Notably, vLLM introduced production support for state-space language models, exploring the future of non-transformer language models.
5353

5454
### Broadening Hardware Compatibility
5555

@@ -62,14 +62,14 @@ GPU Hours Breakdown by Hardware Vendor
6262

6363
From the initial hardware target of NVIDIA A100 GPUs, vLLM has expanded to support:
6464

65-
* **NVIDIA GPUs:** First-class optimizations for H100, with support for every NVIDIA GPU from V100 and newer.
66-
* **AMD GPUs:** Support for MI200, MI300, and Radeon RX 7900 series \- with rapidly growing adoption for MI300X.
67-
* **Google TPUs:** Support for TPU v4, v5p, v5e, and the latest v6e.
68-
* **AWS Inferentia and Trainium:** Supports for trn1/inf2 instances.
69-
* **Intel Gaudi (HPU) and GPU (XPU):** Leveraging Intel GPU and Gaudi architectures for AI workloads.
65+
* **NVIDIA GPUs:** First-class optimizations for H100, with support for every NVIDIA GPU from V100 and newer.
66+
* **AMD GPUs:** Support for MI200, MI300, and Radeon RX 7900 series \- with rapidly growing adoption for MI300X.
67+
* **Google TPUs:** Support for TPU v4, v5p, v5e, and the latest v6e.
68+
* **AWS Inferentia and Trainium:** Supports for trn1/inf2 instances.
69+
* **Intel Gaudi (HPU) and GPU (XPU):** Leveraging Intel GPU and Gaudi architectures for AI workloads.
7070
* **CPUs:** Featuring support for a growing list of ISAs \- x86, ARM, and PowerPC.
7171

72-
vLLM's hardware compatibility has broadened to address diverse user requirements while incorporating performance improvements.
72+
vLLM's hardware compatibility has broadened to address diverse user requirements while incorporating performance improvements. Importantly, vLLM is on the path to ensure that all models work on all hardware platforms, with all the optimizations enabled.
7373

7474
### Delivering Key Features
7575

@@ -82,22 +82,23 @@ Increasing Percentage of vLLM Deployments with Quantization
8282

8383
vLLM’s 2024 development roadmap emphasized performance, scalability, and usability:
8484

85-
* **Weight and Activation Quantization:** Prioritized support for a wide variety of quantization methods and kernels, enabling efficient inference on various hardware platforms. Some integrated methods of note are: activation quantization for FP8+INT8, Marlin+Machete kernels for GPTQ/AWQ/wNa16, FP8 KV Cache, AQLM, QQQ, HQQ, bitsandbytes, and GGUF. More than 20% of vLLM deployments now use quantization.
86-
* **Automatic Prefix Caching:** Reducing costs and improving latency for context-heavy applications.
87-
* **Speculative Decoding:** Accelerating token generation by predicting multiple tokens simultaneously for the model to validate. We added support for draft models, matching n-grams in the prompt, and MLP speculators like Medusa or EAGLE.
88-
* **Structured Outputs:** Providing high-performance, structured output capabilities for applications requiring specific formats like JSON or pydantic schemas.
89-
* **Tool Calling:** Models with supported chat templates can generate its own tool calls when it deems appropriate, enabling data processing and agentic flows.
85+
* **Weight and Activation Quantization:** Prioritized support for a wide variety of quantization methods and kernels, enabling efficient inference on various hardware platforms. Some integrated methods of note are: activation quantization for FP8+INT8, Marlin+Machete kernels for GPTQ/AWQ/wNa16, FP8 KV Cache, AQLM, QQQ, HQQ, bitsandbytes, and GGUF. More than 20% of vLLM deployments now use quantization.
86+
* **Automatic Prefix Caching:** Reducing costs and improving latency for context-heavy applications.
87+
* **Chunked Prefill:** Improving stability of inter-token latency for interactive applications.
88+
* **Speculative Decoding:** Accelerating token generation by predicting multiple tokens simultaneously for the model to validate. We added support for draft models, matching n-grams in the prompt, and MLP speculators like Medusa or EAGLE.
89+
* **Structured Outputs:** Providing high-performance, structured output capabilities for applications requiring specific formats like JSON or pydantic schemas.
90+
* **Tool Calling:** Models with supported chat templates can generate its own tool calls when it deems appropriate, enabling data processing and agentic flows.
9091
* **Distributed Inference:** Introducing pipeline parallelism and disaggregated prefill to scale workloads across GPUs and nodes effectively.
9192

9293
---
9394

94-
## 2025 Vision: The Next Frontier in AI Inference
95+
## Our 2025 Vision
9596

96-
In 2025, we anticipate a significant push in the boundaries of AI model scaling, with AGI models being trained on clusters of 100,000+ GPUs. However, we're seeing an exciting counter-trend: open-source models are rapidly catching up to proprietary ones, and through distillation, these massive models are becoming smaller, more intelligent, and more practical for production deployment.
97+
In 2025, we anticipate a significant push in the boundaries of scaling for both pretraining and inference-time scaling. We believe that open-source models are rapidly catching up to proprietary ones, and through distillation, these massive models are becoming smaller, more intelligent, and more practical for production deployment.
9798

98-
### Emerging Model Capabilities: GPT-4 Class Models on Consumer Hardware
99+
### Emerging Model Capabilities: GPT-4o Class Models served on single node
99100

100-
Our vision is ambitious yet concrete: enabling GPT-4 level performance on a single GPU, GPT-4o on a single node, and GPT-5 scale capabilities on a modest cluster. To achieve this, we're focusing on three key optimization frontiers:
101+
Our vision is ambitious yet concrete: enabling GPT-4o level performance on a single GPU, GPT-4o on a single node, and next generation scale capabilities on a modest cluster. To achieve this, we're focusing on three key optimization frontiers:
101102

102103
* KV cache and attention optimization with sliding windows, cross-layer attention, and native quantization
103104

@@ -107,7 +108,7 @@ Our vision is ambitious yet concrete: enabling GPT-4 level performance on a sing
107108

108109
Beyond raw performance, we're tailoring vLLM for specialized vertical applications. Each use case demands specific optimizations: reasoning applications need custom tokens and flexible reasoning steps, coding requires fill-in-the-middle capabilities and prompt lookup decoding, agent frameworks benefit from tree-based caching, and creative applications need diverse sampling strategies including beam search variants and contrastive decode.
109110

110-
We're also expanding vLLM's role in the model training process. Recent adoption by prominent researchers like John Schulman signals our growing importance in post-training workflows. We'll provide tight integration with data curation and RLHF processes, making vLLM an essential tool across the full AI development lifecycle.
111+
We're also expanding vLLM's role in the model training process. Recent adoption by prominent researchers like John Schulman signals our growing importance in post-training workflows. We'll provide tight integration with data curation and post-training processes, making vLLM an essential tool across the full AI development lifecycle.
111112

112113
### Practical Scale: Powering Thousands of Production Clusters
113114

@@ -125,13 +126,13 @@ Our commitment to openness extends beyond just code. We're introducing:
125126

126127
* Pluggable architectures for seamless integration of new models, hardware backends, and custom extensions
127128

128-
* First-class torch.compile support, enabling custom operation fusion passes and rapid experimentation
129+
* First-class `torch.compile` support, enabling custom operation fusion passes and rapid experimentation
129130

130131
* A flexible component system that supports private extensions while maintaining core stability
131132

132133
We're doubling down on community development, coordinating engineering efforts across organizations while celebrating ecosystem projects. This includes growing our core team through a clear recruitment process and organizational structure. The goal isn't just to make vLLM the best choice technically – it's to ensure that everyone who invests in vLLM finds themselves better off for having done so.
133134

134-
Our architecture is more than just a technical choice; it's a commitment to creating "stickiness" through extensibility and modification rather than lock-in. By making vLLM both powerful and customizable, we ensure its place at the heart of the AI inference ecosystem.
135+
Our architecture is more than just a technical choice; it's a commitment to creating a connected ecosystem through extensibility and modification rather than lock-in. By making vLLM both powerful and customizable, we ensure its place at the heart of the AI inference ecosystem.
135136

136137
---
137138

@@ -163,8 +164,8 @@ vLLM’s 2024 journey reflects the transformative potential of open-source colla
163164

164165
As vLLM enters 2025, we continue to encourage the community to participate in its growth. Opportunities include:
165166

166-
* **Contributing Code:** Help refine vLLM’s core functionality or extend its capabilities. There are many RFCs and features that could use more hands.
167-
* **Providing Feedback:** Share insights on features and use cases to shape vLLM’s roadmap. Find us on Github, Slack, Discord, or at events.
167+
* **Contributing Code:** Help refine vLLM’s core functionality or extend its capabilities. There are many RFCs and features that could use more hands.
168+
* **Providing Feedback:** Share insights on features and use cases to shape vLLM’s roadmap. Find us on Github, Slack, Discord, or at events.
168169
* **Building with vLLM:** Adopt the platform in your projects, develop your personal knowledge, and share your experience.
169170

170171
Join the [vLLM Developer Slack](https://slack.vllm.ai/) to get mentored by project leaders, offering an exciting opportunity to work at the forefront of AI inference innovation.

0 commit comments

Comments
 (0)