Skip to content

Commit 18194ad

Browse files
authored
Merge branch 'develop' into feature_rerope_new2
2 parents f71d5ea + 374568d commit 18194ad

File tree

29 files changed

+823
-259
lines changed

29 files changed

+823
-259
lines changed

.github/CODEOWNERS

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,37 +2,37 @@
22
# for more info about CODEOWNERS file
33

44
* @mag1c-h @ygwpz @FangRun2 @Tarrei
5-
/.github @Wwwzff @hek14 @ygwpz @mag1c-h @FangRun2 @Tarrei
5+
/.github @Wwwzff @Infinite666 @ygwpz @mag1c-h @FangRun2 @Tarrei
66

7-
/ucm/sparse @wuhuxiao @wangwenxin0312 @hek14 @ygwpz @mag1c-h
8-
/ucm/sparse/cache_blend @wuhuxiao @hek14 @ygwpz @mag1c-h
9-
/ucm/sparse/esa @wangwenxin0312 @hek14 @ygwpz @mag1c-h
10-
/ucm/sparse/gsa @Zbm1996 @zbb200819 @yxkyong @HaoLi980405 @wuhuxiao @hek14 @ygwpz @mag1c-h
11-
/ucm/sparse/kvcomp @leideng @pengwwang @wuhuxiao @hek14 @ygwpz @mag1c-h
12-
/ucm/sparse/kvstar @saki-daisuki @summer-ai007 @xwLearnsLLM @wuhuxiao @hek14 @ygwpz @mag1c-h
7+
/ucm/sparse @wuhuxiao @wangwenxin0312 @Infinite666 @ygwpz @mag1c-h
8+
/ucm/sparse/cache_blend @wuhuxiao @Infinite666 @ygwpz @mag1c-h
9+
/ucm/sparse/esa @wangwenxin0312 @Infinite666 @ygwpz @mag1c-h
10+
/ucm/sparse/gsa @Zbm1996 @zbb200819 @yxkyong @HaoLi980405 @wuhuxiao @Infinite666 @ygwpz @mag1c-h
11+
/ucm/sparse/kvcomp @leideng @pengwwang @wuhuxiao @Infinite666 @ygwpz @mag1c-h
12+
/ucm/sparse/kvstar @saki-daisuki @summer-ai007 @xwLearnsLLM @wuhuxiao @Infinite666 @ygwpz @mag1c-h
1313

1414
/ucm/store @mag1c-h @ygwpz
1515
/ucm/store/dramstore @harrisonyhq @mag1c-h @ygwpz
1616
/ucm/store/localstore @mag1c-h @ygwpz
1717
/ucm/store/mooncakestore @chinesezyc @mag1c-h @ygwpz
1818
/ucm/store/nfsstore @mag1c-h @ygwpz
1919

20-
/ucm/integration @qyh111 @harrisonyhq @ygwpz @mag1c-h @hek14
20+
/ucm/integration @qyh111 @harrisonyhq @ygwpz @mag1c-h @Infinite666
2121

2222
/ucm/pd @flesher0813 @ygwpz @mag1c-h
2323

24-
/ucm/sandbox @Wwwzff @hek14 @ygwpz @mag1c-h @FangRun2 @Tarrei
24+
/ucm/sandbox @Wwwzff @Infinite666 @ygwpz @mag1c-h @FangRun2 @Tarrei
2525

2626
/benchmarks @flesher0813 @ygwpz @mag1c-h
2727

2828
/docker @harrisonyhq @ygwpz @mag1c-h
2929

30-
/docs @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei @hek14
31-
/docs/source/user-guide/sparse-attention/esa.md @wangwenxin0312 @hek14 @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei
30+
/docs @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei @Infinite666
31+
/docs/source/user-guide/sparse-attention/esa.md @wangwenxin0312 @Infinite666 @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei
3232
/docs/source/user-guide/sparse-attention/gsa.md @Zbm1996 @zbb200819 @yxkyong @HaoLi980405 @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei
3333
/docs/source/user-guide/sparse-attention/kvcomp.md @leideng @pengwwang @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei
3434
/docs/source/user-guide/sparse-attention/kvstar.md @saki-daisuki @summer-ai007 @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei
3535

36-
/examples @harrisonyhq @ygwpz @mag1c-h @hek14
36+
/examples @harrisonyhq @ygwpz @mag1c-h @Infinite666
3737

3838
/test @Wwwzff @ygwpz @mag1c-h
File renamed without changes.

docs/source/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ user-guide/prefix-cache/index
5757
user-guide/sparse-attention/index
5858
user-guide/pd-disaggregation/index
5959
user-guide/metrics/metrics
60+
user-guide/rerope/rerope
6061
:::
6162

6263
:::{toctree}

docs/source/user-guide/triton-rerope/rerope.md renamed to docs/source/user-guide/rerope/rerope.md

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,34 @@
1-
# Rectified Rotary Position Embeddings (ReRoPE)
1+
# Rectified Rotary Position Embeddings
22

3-
Using ReRoPE, we can more effectively extend the context length of LLM without the need for fine-tuning. This is about the Triton implementation of ReRoPE and its integration into the vLLM inference framework.
3+
Using Rectified Rotary Position Embeddings (ReRoPE), we can more effectively extend the context length of LLM without the need for fine-tuning. This is about the Triton implementation of ReRoPE and its integration into the vLLM inference framework.
4+
5+
<div align="center">
46

57
**🚀 ReRoPE | 📄 blog [https://kexue.fm/archives/9708] [https://normxu.github.io/Rethinking-Rotary-Position-Embedding-3]**
68

9+
710
[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/ModelEngine-Group/unified-cache-management/blob/main/LICENSE)
811
[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://python.org)
912

13+
</div>
1014

1115
## 🌟 What is ReRoPE?
1216

17+
<div align="center">
18+
1319
<img src="https://raw.githubusercontent.com/bojone/rerope/main/idea.png" width=750>
1420

21+
</div>
22+
1523
This approach combines direct extrapolation with position interpolation. A window size $w$ is established, where a position interval of $1$ is used within the window, and an interval of $\frac{1}{k}$ is applied outside. As $k \to \infty$, this simplifies to the form illustrated above. Under this scheme, the position encoding range never exceeds $w$ regardless of input length, potentially enabling support for arbitrarily long contexts.
1624

1725
The attention score calculation formulas are as follows,
1826

1927
$$
20-
\begin{align}
28+
\begin{aligned}
2129
score_{ij}^{1} &= (q_iR_i)(k_jR_j)^T, && i-j<w \\
2230
score_{ij}^{2} &= (q_iR_w)(k_j)^T, && i-j\ge w
23-
\end{align}
31+
\end{aligned}
2432
$$
2533

2634
ReRoPE extends context length effectively but requires double attention—local within w and global compressed—significantly reducing throughput. Despite this overhead, it remains valuable for training-free long contexts, especially when combined with local attention windows to balance efficiency.
@@ -37,7 +45,14 @@ ReRoPE extends context length effectively but requires double attention—local
3745

3846
## 🏆 Results
3947

40-
![alt text](results.png)
48+
<div align="center">
49+
50+
### The Experiment Results
51+
![ReRoPE Results](../../_static/images/rerope_performace.png)
52+
53+
The experiment is based on a hybrid Transformer-GAU (Gated Attention Unit) model with a size of 100M parameters. $logn$ indicates we add the scale factor $log n$⁡ at pretraining stage; $log n^{*}$ denotes we apply the scale factor to the attention matrix only for text exceeding the max sequence length, without any pretraining; $w256$ denotes the rerope windopw $w=256$.
54+
55+
</div>
4156

4257
## 🚀 Quick Start
4358

@@ -46,12 +61,12 @@ ReRoPE extends context length effectively but requires double attention—local
4661
For installation instructions, please refer to the UCM's top-level README. Once UCM is installed, ReRoPE is naturally supported by running the following example python scripts.
4762

4863
```python
49-
export VLLM_ATTENTION_BACKEND = TRITON_ATTN_VLLM_V1
50-
export VLLM_USE_REROPE = true
64+
export VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
65+
export VLLM_USE_REROPE=true
5166
export DATA_DIR=/home/data/kv_cache
5267
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
53-
export REROPE_WINDOW = 32768
54-
export TRAINING_LENGTH = 32768
68+
export REROPE_WINDOW=32768
69+
export TRAINING_LENGTH=32768
5570

5671
python examples/offline_inference_rerope.py
5772
```

docs/source/user-guide/sparse-attention/cacheblend.md

Lines changed: 7 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
![blend_scheme.jpg](../../_static/images/blend_scheme.jpg)
55

6-
**🚀 Knowledge Cached Fusion Algorithm | 📄 EuroSys 2025 Paper **
6+
**🚀 Knowledge Cached Fusion Algorithm | 📄 EuroSys 2025 Paper**
77

88
[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/ModelEngine-Group/unified-cache-management/blob/main/LICENSE)
99
[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://python.org)
@@ -28,11 +28,11 @@ CacheBlend reduces TTFT by 2.2 ~ 3.3× and increases throughput by 2.8 ~ 5× und
2828
## 🧠 Ucm Implementation
2929

3030
### Native Block-Wise Chunk KV Cache Dump, Load, PostProcess and Recompute
31-
1. **🔐 Chunk Hash Encoding**: Similar as prefix hash encoder, hash all blocks in each chunk from the same hash meta beginning.
32-
2. **⚡ Combine Prefix Cache and Chunk Cache**: Since chunk cache and native prefix cache share the same hash space, ucm first performs prefix cache lookup to fetch fully reused cache and then conduct chunk cache lookup to fetch the candidate cache for blending.
33-
3. **🎯 Delta-Rope PostProcess**: Rectify loaded chunk cache according to their position in the new request.
34-
3. **🔍 Integrate Cache Blend and First Token Generation**: Construct compute mask and attention meta according to HKVD tokens, cache miss tokens and suffix tokens, then compute their kv cache in a single model forward stage.
35-
4. **🚀 Comprehensive Hook for LLM Forward Pipeline**: Based on ucm sparse module, blend module sparse the prefill tokens not only in attention stage but also in ffn, layer stage.
31+
1. **🔐 Chunk Hash Encoding**: Similar as prefix hash, blend connector encode the blocks of each chunk from the same hash meta beginning.
32+
2. **⚡ Combine Prefix Cache and Chunk Cache**: Since chunk cache and native prefix cache share the same hash space, they can be stored and shared in a single store.When look up chunk cache, Blend connector first performs prefix cache lookup to the fully reused part and then conduct chunk cache lookup to fetch the candidate cache for blending.
33+
3. **🎯 Delta-Rope PostProcess**: Rectify the loaded chunk cache according to their position in the new request.
34+
3. **🔍 Integrate Cache Blend and First Token Generation**: Construct compute mask of the HKVD tokens, cache miss tokens and suffix tokens, then modify attention metadata to support the combination of chunk cache blending, missing cache recomputing and first token generation.
35+
4. **🚀 Comprehensive Hook for LLM Forward Pipeline**: Based on and extended from ucm sparse, blend sparse module reduces the input tokens for all the computation kernel, not just the attention kernel.
3636

3737
## 🚀 Quick Start
3838

@@ -97,13 +97,4 @@ Llama-based models and Qwen-based models now are available
9797
pages={94--109},
9898
year={2025}
9999
}
100-
```
101-
102-
103-
---
104-
105-
<div align="center">
106-
107-
**🌟 Star [UCM](https://github.com/ModelEngine-Group/unified-cache-management) repository if you find KvComp useful!**
108-
109-
</div>
100+
```

docs/source/user-guide/sparse-attention/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,4 +41,5 @@ esa
4141
gsa
4242
kvcomp
4343
kvstar
44+
cacheblend
4445
:::

examples/offline_inference_blend.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@ def main():
186186
# choose one data row in LongBenchV1 (wikimqa)
187187
assert os.path.isfile(
188188
path_to_dataset
189-
), f"Incorrect dataset path. Please specify the dataset path by `export DATASET_PATH=/path/to/longbench/multifieldqa_zh.jsonl`"
189+
), f"Incorrect dataset path. Please specify the dataset path by `export DATASET_PATH=/home/data/Longbench/data/2wikimqa.jsonl`"
190190
with open(path_to_dataset, "r") as f:
191191
lines = f.readlines()
192192
dataset_row = json.loads(lines[0])

examples/ucm_config_example.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ load_only_first_rank: false
3131
# Or for GSA:
3232
# GSA: {}
3333
# Or for KvCompOnDevice:
34-
# GSA:
34+
# KvCompOnDevice:
3535
# "kvcompOnDevice_config_path": "workspace/unified-cache-management/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json"
3636

3737

0 commit comments

Comments
 (0)