ModelEngine-Group
diff --git a/‎docs/source/user-guide/triton-rerope/rerope.md‎
Lines changed: 92 additions & 0 deletions b/‎docs/source/user-guide/triton-rerope/rerope.md‎
Lines changed: 92 additions & 0 deletions
diff --git a/‎docs/source/user-guide/triton-rerope/results.png‎
72.6 KB b/‎docs/source/user-guide/triton-rerope/results.png‎
72.6 KB
diff --git a/‎examples/offline_inference_rerope.py‎
Lines changed: 193 additions & 0 deletions b/‎examples/offline_inference_rerope.py‎
Lines changed: 193 additions & 0 deletions
@@ -0,0 +1,92 @@
+# Rectified Rotary Position Embeddings (ReRoPE)
+
+Using ReRoPE, we can more effectively extend the context length of LLM without the need for fine-tuning. This is about the Triton implementation of ReRoPE and its integration into the vLLM inference framework.
+
+**🚀 ReRoPE | 📄 blog [https://kexue.fm/archives/9708] [https://normxu.github.io/Rethinking-Rotary-Position-Embedding-3]**
+
+[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/ModelEngine-Group/unified-cache-management/blob/main/LICENSE)
+[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://python.org)
+
+
+## 🌟 What is ReRoPE? 
+
+<img src="https://raw.githubusercontent.com/bojone/rerope/main/idea.png" width=750>
+
+This approach combines direct extrapolation with position interpolation. A window size $w$ is established, where a position interval of $1$ is used within the window, and an interval of $\frac{1}{k}$ is applied outside. As $k \to \infty$, this simplifies to the form illustrated above. Under this scheme, the position encoding range never exceeds $w$ regardless of input length, potentially enabling support for arbitrarily long contexts.
+
+The attention score calculation formulas are as follows,
+
+$$
+\begin{align}
+score_{ij}^{1} &= (q_iR_i)(k_jR_j)^T, && i-j<w \\
+score_{ij}^{2} &= (q_iR_w)(k_j)^T, && i-j\ge w
+\end{align}
+$$
+
+ReRoPE extends context length effectively but requires double attention—local within w and global compressed—significantly reducing throughput. Despite this overhead, it remains valuable for training-free long contexts, especially when combined with local attention windows to balance efficiency.
+
+## 🧠 Triton ReRoPE Implementation
+
+- Load Data
+
+  Compared to the triton rope implementation, data loading requires passing query2 with alternative rotary embedding position and unrotated key2.
+
+- Construct ReRoPE Mask
+
+  During attention computation, the selection between attention score paths depends on the relative distance between query and key, necessitating construction of a rerope mask.
+
+## 🏆 Results
+
+![alt text](results.png)
+
+## 🚀 Quick Start
+
+### Installation
+
+For installation instructions, please refer to the UCM's top-level README. Once UCM is installed, ReRoPE is naturally supported by running the following example python scripts.
+
+```python
+export VLLM_ATTENTION_BACKEND = TRITON_ATTN_VLLM_V1
+export VLLM_USE_REROPE = true
+export DATA_DIR=/home/data/kv_cache
+export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
+export REROPE_WINDOW = 32768
+export TRAINING_LENGTH = 32768
+
+python examples/offline_inference_rerope.py
+```
+
+### Basic Usage
+
+We need to modify the max_position_embeddings of the model according to the input length of prompts, as shown below.
+
+```python
+llm_args = EngineArgs(
+        model=model,
+        kv_transfer_config=ktc,
+        hf_overrides={
+            "max_position_embeddings": 327680,
+        },
+        gpu_memory_utilization=0.9,
+        max_num_batched_tokens=8192,
+        block_size=16,
+        enforce_eager=True,
+        tensor_parallel_size=2,
+    )
+```
+
+## 📊 Supported Models
+
+Qwen-based models now are available
+
+
+## 🎓 Cite
+
+```
+@misc{rerope2023,
+  title={Rectified Rotary Position Embeddings},
+  author={Jianlin Su},
+  year={2023},
+  howpublished={\url{https://github.com/bojone/rerope}},
+}
+```
@@ -0,0 +1,193 @@
+import contextlib
+import json
+import os
+import sys
+import time
+from dataclasses import asdict
+
+from transformers import AutoTokenizer
+
+# setting for rerope
+os.environ["VLLM_USE_REROPE"] = "true"
+
+# Third Party
+from vllm import LLM, SamplingParams
+from vllm.config import KVTransferConfig
+from vllm.engine.arg_utils import EngineArgs
+
+from ucm.logger import init_logger
+
+logger = init_logger(__name__)
+
+
+def setup_environment_variables():
+    os.environ["VLLM_USE_V1"] = "1"
+    os.environ["PYTHONHASHSEED"] = "123456"
+
+    os.environ["VLLM_ATTENTION_BACKEND"] = "TRITON_ATTN_VLLM_V1"
+    os.environ["REROPE_WINDOW"] = "32768"
+    os.environ["TRAINING_LENGTH"] = "32768"
+
+    global data_dir
+    data_dir = os.getenv("DATA_DIR", "/home/data/kv_cache")
+    if not os.path.isdir(data_dir):
+        data_dir = input(
+            "Enter the directory for UCMStore to save kv cache, e.g. /home/data/kv_cache: "
+        )
+        create = input(f"Directory {data_dir} dose not exist. Create it? (Y/n): ")
+        if create.lower() == "y":
+            os.makedirs(data_dir, exist_ok=True)
+        else:
+            print("Exiting. Directory not created.")
+            sys.exit(1)
+
+
+@contextlib.contextmanager
+def build_llm_with_uc(module_path: str, name: str, model: str):
+    ktc = KVTransferConfig(
+        kv_connector=name,
+        kv_connector_module_path=module_path,
+        kv_role="kv_both",
+        kv_connector_extra_config={
+            "ucm_connectors": [
+                {
+                    "ucm_connector_name": "UcmNfsStore",
+                    "ucm_connector_config": {
+                        "storage_backends": data_dir,
+                        "use_direct": False,
+                    },
+                }
+            ],
+        },
+    )
+
+    llm_args = EngineArgs(
+        model=model,
+        kv_transfer_config=ktc,
+        hf_overrides={
+            "max_position_embeddings": 327680,
+        },
+        gpu_memory_utilization=0.9,
+        max_num_batched_tokens=8192,
+        block_size=16,
+        enforce_eager=True,
+        tensor_parallel_size=2,
+    )
+
+    llm = LLM(**asdict(llm_args))
+    try:
+        yield llm
+    finally:
+        logger.info("LLM engine is exiting.")
+
+
+def print_output(
+    llm: LLM,
+    prompt: list[str],
+    sampling_params: SamplingParams,
+    req_str: str,
+):
+    start = time.time()
+    outputs = llm.generate(prompt, sampling_params)
+    print("-" * 50)
+    for output in outputs:
+        generated_text = output.outputs[0].text
+        print(f"Generated text: {generated_text!r}")
+    print(f"Generation took {time.time() - start:.2f} seconds, {req_str} request done.")
+    print("-" * 50)
+
+
+def main():
+    module_path = "ucm.integration.vllm.ucm_connector"
+    name = "UCMConnector"
+    model = os.getenv("MODEL_PATH", "/home/models/Qwen2.5-14B-Instruct")
+    if not os.path.isdir(model):
+        model = input("Enter path to model, e.g. /home/models/Qwen2.5-14B-Instruct: ")
+        if not os.path.isdir(model):
+            print("Exiting. Incorrect model_path")
+            sys.exit(1)
+
+    tokenizer = AutoTokenizer.from_pretrained(model, use_chat_template=True)
+    setup_environment_variables()
+
+    with build_llm_with_uc(module_path, name, model) as llm:
+
+        data_all = []
+        path_to_dataset = os.getenv(
+            "DATASET_PATH", "/home/data/Longbench/data/multifieldqa_zh.jsonl"
+        )
+        if not os.path.isfile(path_to_dataset):
+            path_to_dataset = input(
+                "Enter path to one of the longbench dataset, e.g. /home/data/Longbench/data/multifieldqa_zh.jsonl: "
+            )
+            if not os.path.isfile(path_to_dataset):
+                print("Exiting. Incorrect dataset path")
+                sys.exit(1)
+        with open(path_to_dataset, "r", encoding="utf-8") as f:
+            for line in f:
+                data_all.append(json.loads(line))
+
+        materials = []
+        questions = []
+        references = []
+        batch_size = 30
+        num_batch = 2
+        for idx in range(num_batch):
+            data = data_all[idx * batch_size : (idx + 1) * batch_size]
+
+            materials.append(
+                "\n\n".join(
+                    [
+                        f"【语料{i+1}】\n{item.get('context', '')}"
+                        for i, item in enumerate(data)
+                    ]
+                )
+            )
+            questions.append(
+                "\n".join(
+                    [
+                        f"{i+1}. {item.get('input', '')}"
+                        for i, item in enumerate(data[:15])
+                    ]
+                )
+            )
+            references.append(
+                [
+                    f"{i+1}. {item.get('answers', '')}"
+                    for i, item in enumerate(data[:15])
+                ]
+            )
+
+        system_prompt = "你是一个AI助手，请根据以下材料回答问题。"
+        tokenized_inputs = []
+        for material, question in zip(materials, questions):
+            content = (
+                "请根据以下文本内容回答后面的问题：\n\n"
+                "【文本内容开始】\n"
+                f"{material}\n"
+                "【文本内容结束】\n\n"
+                "请直接回答以下问题：\n"
+                f"{question}"
+            )
+
+            messages = [
+                {"role": "system", "content": system_prompt},
+                {"role": "user", "content": content},
+            ]
+            inputs = tokenizer.apply_chat_template(
+                messages,
+                add_generation_prompt=True,
+                tokenize=False,
+            )
+            tokenized_inputs.append(inputs)
+
+        sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=2048)
+
+        for req in range(num_batch):
+            print_output(
+                llm, tokenized_inputs[req], sampling_params, "request_" + str(req)
+            )
+
+
+if __name__ == "__main__":
+    main()