Skip to content

Commit fcfd1eb

Browse files
[Doc] Remove vLLM prefix and add citation for PagedAttention (#21910)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent d979dd6 commit fcfd1eb

File tree

10 files changed

+22
-11
lines changed

10 files changed

+22
-11
lines changed
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

docs/design/paged_attention.md

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
# vLLM Paged Attention
1+
# Paged Attention
22

33
!!! warning
4-
This document is being kept in the vLLM documentation for historical purposes.
4+
This is a historical document based on the [original paper for vLLM](https://arxiv.org/abs/2309.06180).
55
It no longer describes the code used in vLLM today.
66

77
Currently, vLLM utilizes its own implementation of a multi-head query
@@ -140,7 +140,7 @@ const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
140140
```
141141

142142
<figure markdown="span">
143-
![](../../assets/kernel/query.png){ align="center" alt="query" width="70%" }
143+
![](../assets/design/paged_attention/query.png){ align="center" alt="query" width="70%" }
144144
</figure>
145145

146146
Each thread defines its own `q_ptr` which points to the assigned
@@ -149,7 +149,7 @@ and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
149149
total of 128 elements divided into 128 / 4 = 32 vecs.
150150

151151
<figure markdown="span">
152-
![](../../assets/kernel/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
152+
![](../assets/design/paged_attention/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
153153
</figure>
154154

155155
```cpp
@@ -188,7 +188,7 @@ points to key token data based on `k_cache` at assigned block,
188188
assigned head and assigned token.
189189

190190
<figure markdown="span">
191-
![](../../assets/kernel/key.png){ align="center" alt="key" width="70%" }
191+
![](../assets/design/paged_attention/key.png){ align="center" alt="key" width="70%" }
192192
</figure>
193193

194194
The diagram above illustrates the memory layout for key data. It
@@ -203,7 +203,7 @@ elements for one token) that will be processed by 2 threads (one
203203
thread group) separately.
204204

205205
<figure markdown="span">
206-
![](../../assets/kernel/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
206+
![](../assets/design/paged_attention/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
207207
</figure>
208208

209209
```cpp
@@ -362,15 +362,15 @@ later steps. Now, it should store the normalized softmax result of
362362
## Value
363363

364364
<figure markdown="span">
365-
![](../../assets/kernel/value.png){ align="center" alt="value" width="70%" }
365+
![](../assets/design/paged_attention/value.png){ align="center" alt="value" width="70%" }
366366
</figure>
367367

368368
<figure markdown="span">
369-
![](../../assets/kernel/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
369+
![](../assets/design/paged_attention/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
370370
</figure>
371371

372372
<figure markdown="span">
373-
![](../../assets/kernel/v_vec.png){ align="center" alt="v_vec" width="70%" }
373+
![](../assets/design/paged_attention/v_vec.png){ align="center" alt="v_vec" width="70%" }
374374
</figure>
375375

376376
Now we need to retrieve the value data and perform dot multiplication
@@ -499,3 +499,14 @@ for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
499499
Finally, we need to iterate over different assigned head positions
500500
and write out the corresponding accumulated result based on the
501501
`out_ptr`.
502+
503+
## Citation
504+
505+
```bibtex
506+
@inproceedings{kwon2023efficient,
507+
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
508+
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
509+
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
510+
year={2023}
511+
}
512+
```

docs/design/plugin_system.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# vLLM's Plugin System
1+
# Plugin System
22

33
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
44

docs/design/torch_compile.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# vLLM's `torch.compile` integration
1+
# `torch.compile` integration
22

33
In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critical part of the framework. This document gives a simple walk-through example to show how to understand the `torch.compile` usage.
44

0 commit comments

Comments
 (0)