vllm-project
diff --git a/‎docs/assets/kernel/k_vecs.png‎ renamed to ‎docs/assets/design/paged_attention/k_vecs.png‎ b/‎docs/assets/kernel/k_vecs.png‎ renamed to ‎docs/assets/design/paged_attention/k_vecs.png‎
diff --git a/‎docs/assets/kernel/key.png‎ renamed to ‎docs/assets/design/paged_attention/key.png‎ b/‎docs/assets/kernel/key.png‎ renamed to ‎docs/assets/design/paged_attention/key.png‎
diff --git a/‎docs/assets/kernel/logits_vec.png‎ renamed to ‎docs/assets/design/paged_attention/logits_vec.png‎ b/‎docs/assets/kernel/logits_vec.png‎ renamed to ‎docs/assets/design/paged_attention/logits_vec.png‎
diff --git a/‎docs/assets/kernel/q_vecs.png‎ renamed to ‎docs/assets/design/paged_attention/q_vecs.png‎ b/‎docs/assets/kernel/q_vecs.png‎ renamed to ‎docs/assets/design/paged_attention/q_vecs.png‎
diff --git a/‎docs/assets/kernel/query.png‎ renamed to ‎docs/assets/design/paged_attention/query.png‎ b/‎docs/assets/kernel/query.png‎ renamed to ‎docs/assets/design/paged_attention/query.png‎
diff --git a/‎docs/assets/kernel/v_vec.png‎ renamed to ‎docs/assets/design/paged_attention/v_vec.png‎ b/‎docs/assets/kernel/v_vec.png‎ renamed to ‎docs/assets/design/paged_attention/v_vec.png‎
diff --git a/‎docs/assets/kernel/value.png‎ renamed to ‎docs/assets/design/paged_attention/value.png‎ b/‎docs/assets/kernel/value.png‎ renamed to ‎docs/assets/design/paged_attention/value.png‎
diff --git a/‎docs/design/paged_attention.md‎
Lines changed: 20 additions & 9 deletions b/‎docs/design/paged_attention.md‎
Lines changed: 20 additions & 9 deletions
diff --git a/‎docs/design/plugin_system.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/design/plugin_system.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/design/torch_compile.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/design/torch_compile.md‎
Lines changed: 1 addition & 1 deletion
@@ -1,7 +1,7 @@
-# vLLM Paged Attention
+# Paged Attention
 
 !!! warning
-    This document is being kept in the vLLM documentation for historical purposes.
+    This is a historical document based on the [original paper for vLLM](https://arxiv.org/abs/2309.06180).
     It no longer describes the code used in vLLM today.
 
 Currently, vLLM utilizes its own implementation of a multi-head query
@@ -140,7 +140,7 @@ const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
 ```
 
 <figure markdown="span">
-  ![](../../assets/kernel/query.png){ align="center" alt="query" width="70%" }
+  ![](../assets/design/paged_attention/query.png){ align="center" alt="query" width="70%" }
 </figure>
 
 Each thread defines its own `q_ptr` which points to the assigned
@@ -149,7 +149,7 @@ and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
 total of 128 elements divided into 128 / 4 = 32 vecs.
 
 <figure markdown="span">
-  ![](../../assets/kernel/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
+  ![](../assets/design/paged_attention/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
 </figure>
 
 ```cpp
@@ -188,7 +188,7 @@ points to key token data based on `k_cache` at assigned block,
 assigned head and assigned token.
 
 <figure markdown="span">
-  ![](../../assets/kernel/key.png){ align="center" alt="key" width="70%" }
+  ![](../assets/design/paged_attention/key.png){ align="center" alt="key" width="70%" }
 </figure>
 
 The diagram above illustrates the memory layout for key data. It
@@ -203,7 +203,7 @@ elements for one token) that will be processed by 2 threads (one
 thread group) separately.
 
 <figure markdown="span">
-  ![](../../assets/kernel/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
+  ![](../assets/design/paged_attention/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
 </figure>
 
 ```cpp
@@ -362,15 +362,15 @@ later steps. Now, it should store the normalized softmax result of
 ## Value
 
 <figure markdown="span">
-  ![](../../assets/kernel/value.png){ align="center" alt="value" width="70%" }
+  ![](../assets/design/paged_attention/value.png){ align="center" alt="value" width="70%" }
 </figure>
 
 <figure markdown="span">
-  ![](../../assets/kernel/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
+  ![](../assets/design/paged_attention/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
 </figure>
 
 <figure markdown="span">
-  ![](../../assets/kernel/v_vec.png){ align="center" alt="v_vec" width="70%" }
+  ![](../assets/design/paged_attention/v_vec.png){ align="center" alt="v_vec" width="70%" }
 </figure>
 
 Now we need to retrieve the value data and perform dot multiplication
@@ -499,3 +499,14 @@ for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
 Finally, we need to iterate over different assigned head positions
 and write out the corresponding accumulated result based on the
 `out_ptr`.
+
+## Citation
+
+```bibtex
+@inproceedings{kwon2023efficient,
+  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
+  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
+  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
+  year={2023}
+}
+```
@@ -1,4 +1,4 @@
-# vLLM's Plugin System
+# Plugin System
 
 The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
 
 
@@ -1,4 +1,4 @@
-# vLLM's `torch.compile` integration
+# `torch.compile` integration
 
 In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critical part of the framework. This document gives a simple walk-through example to show how to understand the `torch.compile` usage.
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# vLLM's Plugin System`
	`1`	`+# Plugin System`
`2`	`2`
`3`	`3`	`The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.`
`4`	`4`
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		-# vLLM's `torch.compile` integration
	`1`	+# `torch.compile` integration
`2`	`2`
`3`	`3`	In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critical part of the framework. This document gives a simple walk-through example to show how to understand the `torch.compile` usage.
`4`	`4`