ModelEngine-Group
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 12 additions & 12 deletions b/‎.github/CODEOWNERS‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎…rce/user-guide/triton-rerope/results.png‎ ‎…rce/_static/images/rerope_performace.png‎docs/source/user-guide/triton-rerope/results.png renamed to docs/source/_static/images/rerope_performace.png b/‎…rce/user-guide/triton-rerope/results.png‎ ‎…rce/_static/images/rerope_performace.png‎docs/source/user-guide/triton-rerope/results.png renamed to docs/source/_static/images/rerope_performace.png
diff --git a/‎docs/source/index.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/index.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎…ource/user-guide/triton-rerope/rerope.md‎ ‎docs/source/user-guide/rerope/rerope.md‎docs/source/user-guide/triton-rerope/rerope.md renamed to docs/source/user-guide/rerope/rerope.md
Lines changed: 24 additions & 9 deletions b/‎…ource/user-guide/triton-rerope/rerope.md‎ ‎docs/source/user-guide/rerope/rerope.md‎docs/source/user-guide/triton-rerope/rerope.md renamed to docs/source/user-guide/rerope/rerope.md
Lines changed: 24 additions & 9 deletions
diff --git a/‎docs/source/user-guide/sparse-attention/cacheblend.md‎
Lines changed: 7 additions & 16 deletions b/‎docs/source/user-guide/sparse-attention/cacheblend.md‎
Lines changed: 7 additions & 16 deletions
diff --git a/‎docs/source/user-guide/sparse-attention/index.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/user-guide/sparse-attention/index.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/offline_inference_blend.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/offline_inference_blend.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/ucm_config_example.yaml‎
Lines changed: 1 addition & 1 deletion b/‎examples/ucm_config_example.yaml‎
Lines changed: 1 addition & 1 deletion
@@ -2,37 +2,37 @@
 # for more info about CODEOWNERS file
 
 * @mag1c-h @ygwpz @FangRun2 @Tarrei
-/.github @Wwwzff @hek14 @ygwpz @mag1c-h @FangRun2 @Tarrei
+/.github @Wwwzff @Infinite666 @ygwpz @mag1c-h @FangRun2 @Tarrei
 
-/ucm/sparse @wuhuxiao @wangwenxin0312 @hek14 @ygwpz @mag1c-h
-/ucm/sparse/cache_blend @wuhuxiao @hek14 @ygwpz @mag1c-h
-/ucm/sparse/esa @wangwenxin0312 @hek14 @ygwpz @mag1c-h
-/ucm/sparse/gsa @Zbm1996 @zbb200819 @yxkyong @HaoLi980405  @wuhuxiao @hek14 @ygwpz @mag1c-h
-/ucm/sparse/kvcomp @leideng @pengwwang @wuhuxiao @hek14 @ygwpz @mag1c-h
-/ucm/sparse/kvstar @saki-daisuki @summer-ai007 @xwLearnsLLM @wuhuxiao @hek14 @ygwpz @mag1c-h
+/ucm/sparse @wuhuxiao @wangwenxin0312 @Infinite666 @ygwpz @mag1c-h
+/ucm/sparse/cache_blend @wuhuxiao @Infinite666 @ygwpz @mag1c-h
+/ucm/sparse/esa @wangwenxin0312 @Infinite666 @ygwpz @mag1c-h
+/ucm/sparse/gsa @Zbm1996 @zbb200819 @yxkyong @HaoLi980405  @wuhuxiao @Infinite666 @ygwpz @mag1c-h
+/ucm/sparse/kvcomp @leideng @pengwwang @wuhuxiao @Infinite666 @ygwpz @mag1c-h
+/ucm/sparse/kvstar @saki-daisuki @summer-ai007 @xwLearnsLLM @wuhuxiao @Infinite666 @ygwpz @mag1c-h
 
 /ucm/store @mag1c-h @ygwpz
 /ucm/store/dramstore @harrisonyhq @mag1c-h @ygwpz
 /ucm/store/localstore @mag1c-h @ygwpz
 /ucm/store/mooncakestore @chinesezyc @mag1c-h @ygwpz
 /ucm/store/nfsstore @mag1c-h @ygwpz
 
-/ucm/integration @qyh111 @harrisonyhq @ygwpz @mag1c-h @hek14
+/ucm/integration @qyh111 @harrisonyhq @ygwpz @mag1c-h @Infinite666
 
 /ucm/pd @flesher0813 @ygwpz @mag1c-h
 
-/ucm/sandbox @Wwwzff @hek14 @ygwpz @mag1c-h @FangRun2 @Tarrei
+/ucm/sandbox @Wwwzff @Infinite666 @ygwpz @mag1c-h @FangRun2 @Tarrei
 
 /benchmarks @flesher0813 @ygwpz @mag1c-h
 
 /docker @harrisonyhq @ygwpz @mag1c-h
 
-/docs @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei @hek14
-/docs/source/user-guide/sparse-attention/esa.md @wangwenxin0312 @hek14 @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei
+/docs @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei @Infinite666
+/docs/source/user-guide/sparse-attention/esa.md @wangwenxin0312 @Infinite666 @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei
 /docs/source/user-guide/sparse-attention/gsa.md @Zbm1996 @zbb200819 @yxkyong @HaoLi980405 @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei
 /docs/source/user-guide/sparse-attention/kvcomp.md @leideng @pengwwang @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei
 /docs/source/user-guide/sparse-attention/kvstar.md @saki-daisuki @summer-ai007 @flesher0813 @ygwpz @mag1c-h @FangRun2 @Tarrei
 
-/examples @harrisonyhq @ygwpz @mag1c-h @hek14
+/examples @harrisonyhq @ygwpz @mag1c-h @Infinite666
 
 /test @Wwwzff @ygwpz @mag1c-h
@@ -57,6 +57,7 @@ user-guide/prefix-cache/index
 user-guide/sparse-attention/index
 user-guide/pd-disaggregation/index
 user-guide/metrics/metrics
+user-guide/rerope/rerope
 :::
 
 :::{toctree}
 
@@ -1,26 +1,34 @@
-# Rectified Rotary Position Embeddings (ReRoPE)
+# Rectified Rotary Position Embeddings
 
-Using ReRoPE, we can more effectively extend the context length of LLM without the need for fine-tuning. This is about the Triton implementation of ReRoPE and its integration into the vLLM inference framework.
+Using Rectified Rotary Position Embeddings (ReRoPE), we can more effectively extend the context length of LLM without the need for fine-tuning. This is about the Triton implementation of ReRoPE and its integration into the vLLM inference framework.
+
+<div align="center">
 
 **🚀 ReRoPE | 📄 blog [https://kexue.fm/archives/9708] [https://normxu.github.io/Rethinking-Rotary-Position-Embedding-3]**
 
+
 [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/ModelEngine-Group/unified-cache-management/blob/main/LICENSE)
 [![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://python.org)
 
+</div>
 
 ## 🌟 What is ReRoPE? 
 
+<div align="center">
+
 <img src="https://raw.githubusercontent.com/bojone/rerope/main/idea.png" width=750>
 
+</div>
+
 This approach combines direct extrapolation with position interpolation. A window size $w$ is established, where a position interval of $1$ is used within the window, and an interval of $\frac{1}{k}$ is applied outside. As $k \to \infty$, this simplifies to the form illustrated above. Under this scheme, the position encoding range never exceeds $w$ regardless of input length, potentially enabling support for arbitrarily long contexts.
 
 The attention score calculation formulas are as follows,
 
 $$
-\begin{align}
+\begin{aligned}
 score_{ij}^{1} &= (q_iR_i)(k_jR_j)^T, && i-j<w \\
 score_{ij}^{2} &= (q_iR_w)(k_j)^T, && i-j\ge w
-\end{align}
+\end{aligned}
 $$
 
 ReRoPE extends context length effectively but requires double attention—local within w and global compressed—significantly reducing throughput. Despite this overhead, it remains valuable for training-free long contexts, especially when combined with local attention windows to balance efficiency.
@@ -37,7 +45,14 @@ ReRoPE extends context length effectively but requires double attention—local
 
 ## 🏆 Results
 
-![alt text](results.png)
+<div align="center">
+
+### The Experiment Results
+![ReRoPE Results](../../_static/images/rerope_performace.png)
+
+The experiment is based on a hybrid Transformer-GAU (Gated Attention Unit) model with a size of 100M parameters. $logn$ indicates we add the scale factor $log n$⁡ at pretraining stage; $log n^{*}$ denotes we apply the scale factor to the attention matrix only for text exceeding the max sequence length, without any pretraining; $w256$ denotes the rerope windopw $w=256$.
+
+</div>
 
 ## 🚀 Quick Start
 
@@ -46,12 +61,12 @@ ReRoPE extends context length effectively but requires double attention—local
 For installation instructions, please refer to the UCM's top-level README. Once UCM is installed, ReRoPE is naturally supported by running the following example python scripts.
 
 ```python
-export VLLM_ATTENTION_BACKEND = TRITON_ATTN_VLLM_V1
-export VLLM_USE_REROPE = true
+export VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
+export VLLM_USE_REROPE=true
 export DATA_DIR=/home/data/kv_cache
 export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
-export REROPE_WINDOW = 32768
-export TRAINING_LENGTH = 32768
+export REROPE_WINDOW=32768
+export TRAINING_LENGTH=32768
 
 python examples/offline_inference_rerope.py
 ```
 
@@ -3,7 +3,7 @@
 
 ![blend_scheme.jpg](../../_static/images/blend_scheme.jpg)
 
-**🚀 Knowledge Cached Fusion Algorithm | 📄 EuroSys 2025 Paper **
+**🚀 Knowledge Cached Fusion Algorithm | 📄 EuroSys 2025 Paper**
 
 [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/ModelEngine-Group/unified-cache-management/blob/main/LICENSE)
 [![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://python.org)
@@ -28,11 +28,11 @@ CacheBlend reduces TTFT by 2.2 ~ 3.3× and increases throughput by 2.8 ~ 5× und
 ## 🧠 Ucm Implementation
 
 ### Native Block-Wise Chunk KV Cache Dump, Load, PostProcess and Recompute
-1. **🔐 Chunk Hash Encoding**: Similar as prefix hash encoder, hash all blocks in each chunk from the same hash meta beginning.
-2. **⚡ Combine Prefix Cache and Chunk Cache**: Since chunk cache and native prefix cache share the same hash space, ucm first performs prefix cache lookup to fetch fully reused cache and then conduct chunk cache lookup to fetch the candidate cache for blending.
-3. **🎯 Delta-Rope PostProcess**: Rectify loaded chunk cache according to their position in the new request.
-3. **🔍 Integrate Cache Blend and First Token Generation**: Construct compute mask and attention meta according to HKVD tokens, cache miss tokens and suffix tokens, then compute their kv cache in a single model forward stage.
-4. **🚀 Comprehensive Hook for LLM Forward Pipeline**: Based on ucm sparse module, blend module sparse the prefill tokens not only in attention stage but also in ffn, layer stage.
+1. **🔐 Chunk Hash Encoding**: Similar as prefix hash, blend connector encode the blocks of each chunk from the same hash meta beginning.
+2. **⚡ Combine Prefix Cache and Chunk Cache**: Since chunk cache and native prefix cache share the same hash space, they can be stored and shared in a single store.When look up chunk cache, Blend connector first performs prefix cache lookup to the fully reused part and then conduct chunk cache lookup to fetch the candidate cache for blending.
+3. **🎯 Delta-Rope PostProcess**: Rectify the loaded chunk cache according to their position in the new request.
+3. **🔍 Integrate Cache Blend and First Token Generation**: Construct compute mask of the HKVD tokens, cache miss tokens and suffix tokens, then modify attention metadata to support the combination of chunk cache blending, missing cache recomputing and first token generation.
+4. **🚀 Comprehensive Hook for LLM Forward Pipeline**: Based on and extended from ucm sparse, blend sparse module reduces the input tokens for all the computation kernel, not just the attention kernel.
 
 ## 🚀 Quick Start
 
@@ -97,13 +97,4 @@ Llama-based models and Qwen-based models now are available
   pages={94--109},
   year={2025}
 }
-```
-
-
----
-
-<div align="center">
-
-**🌟 Star [UCM](https://github.com/ModelEngine-Group/unified-cache-management) repository if you find KvComp useful!**
-
-</div>
+```
@@ -41,4 +41,5 @@ esa
 gsa
 kvcomp
 kvstar
+cacheblend
 :::
@@ -186,7 +186,7 @@ def main():
         # choose one data row in LongBenchV1 (wikimqa)
         assert os.path.isfile(
             path_to_dataset
-        ), f"Incorrect dataset path. Please specify the dataset path by `export DATASET_PATH=/path/to/longbench/multifieldqa_zh.jsonl`"
+        ), f"Incorrect dataset path. Please specify the dataset path by `export DATASET_PATH=/home/data/Longbench/data/2wikimqa.jsonl`"
         with open(path_to_dataset, "r") as f:
             lines = f.readlines()
         dataset_row = json.loads(lines[0])
 
@@ -31,7 +31,7 @@ load_only_first_rank: false
   # Or for GSA:
   # GSA: {}
   # Or for KvCompOnDevice:
-  # GSA:
+  # KvCompOnDevice:
   #   "kvcompOnDevice_config_path": "workspace/unified-cache-management/ucm/sparse/kvcomp/configs/kvcomp_qwen3_32B_config.json"
-Original file line number
+Diff line change
 gsa
 kvcomp
 kvstar
 +cacheblend
 :::