retro add images

simon-mo · simon-mo · commit 4577c6ac65d8 · 2024-12-15T15:16:12.000-08:00
diff --git a/_posts/2023-06-20-vllm.md b/_posts/2023-06-20-vllm.md
@@ -3,6 +3,7 @@ layout: post
 title: "vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention"
 author: "Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal Contribution)"
 extra: "<br><p align=\"center\"><picture><img src=\"/assets/logos/vllm-logo-text-light.png\" width=\"65%\"></picture></p><br>"
+image: /assets/logos/vllm-logo-text-light.png
 ---
 <p align="center" style="margin-top:-15px">
 <a href="https://github.com/vllm-project/vllm"><b>GitHub</b></a> | <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://arxiv.org/pdf/2309.06180.pdf"><b>Paper</b></a>
diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md
@@ -2,6 +2,7 @@
 layout: post
 title: "Notes on vLLM v.s. DeepSpeed-FastGen"
 author: "vLLM Team"
+image: /assets/figures/notes-vllm-vs-deepspeed/s2.png
 ---
 
 ---
diff --git a/_posts/2024-07-23-llama31.md b/_posts/2024-07-23-llama31.md
@@ -2,6 +2,7 @@
 layout: post
 title: "Announcing Llama 3.1 Support in vLLM"
 author: "vLLM Team"
+image: /assets/figures/llama31/perf_llama3.png
 ---
 
 Today, the vLLM team is excited to partner with Meta to announce the support for the Llama 3.1 model series. Llama 3.1 comes with exciting new features with longer context length (up to 128K tokens), larger model size (up to 405B parameters), and more advanced model capabilities. The vLLM community has added many enhancements to make sure the longer, larger Llamas run smoothly on vLLM, which includes chunked prefill, FP8 quantization, and pipeline parallelism. We will introduce these new enhancements in this blogpost.
diff --git a/_posts/2024-07-25-lfai-perf.md b/_posts/2024-07-25-lfai-perf.md
@@ -2,6 +2,7 @@
 layout: post
 title: "vLLM’s Open Governance and Performance Roadmap"
 author: "vLLM Team"
+image: /assets/figures/lfai/vllm-lfai-light.png
 ---
 
 
diff --git a/_posts/2024-09-05-perf-update.md b/_posts/2024-09-05-perf-update.md
@@ -2,6 +2,7 @@
 layout: post
 title: "vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction"
 author: "vLLM Team"
+image: /assets/figures/perf-v060/llama8B_comparison.png
 ---
 
 **TL;DR:** vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.
diff --git a/_posts/2024-10-17-spec-decode.md b/_posts/2024-10-17-spec-decode.md
@@ -2,6 +2,7 @@
 layout: post
 title: "How Speculative Decoding Boosts vLLM Performance by up to 2.8x"
 author: "vLLM Team"
+image: /assets/figures/spec-decode/figure9.png
 ---
 
 Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings.