feature for triton rerope #497

xinSky00 · 2025-12-10T02:03:11Z

Purpose

This PR implements a Triton kernel for ReRoPE to enable efficient context length extension in vLLM. The ReRoPE-based approach delivers 3-5x speedup for long sequences.

It implements segment-wise attention computation, applying full rotary embeddings within a fixed window while constraining positional encodings beyond the window boundary, enabling models to handle sequences beyond their pre-training length without fine-tuning.

Modifications

New Environment Variable: VLLM_ATTENTION_BACKEND

Must be set to TRITON_ATTN_VLLM_V1 to enable the Triton-based backend
Usage: export VLLM_ATTENTION_BACKEND="TRITON_ATTN_VLLM_V1"
Default: FLASH_ATTN (standard FlashAttention backend)

Modified Model Configuration Parameter: max_position_embeddings

Users must adjust this parameter via --hf-overrides to match their target input length
This ensures the RoPE embeddings are properly computed for extended sequences
Example: --hf-overrides '{"max_position_embeddings": 327680}'

ReRoPE-specific parameters: rerope_window and training_length should be configured based on the model's original pre-training length

These values determine the segment boundaries for attention computation and must align with the model's original training configuration

Test

run the file of offline_inference_rerope.py

os.environ["VLLM_ATTENTION_BACKEND"] = "TRITON_ATTN_VLLM_V1"
os.environ["VLLM_USE_REROPE"] = "true"
model: Qwen2.5-14B-Instruct
Dataset: multifieldqa_zh.jsonl
results
- prompt length: about 130k tokens
- prompt length: about 315k tokens

ucm/integration/vllm/patch/patch_funcs/v092/vllm_rerope_patch.py

xinSky00 requested review from harrisonyhq, hek14, mag1c-h, qyh111, wangwenxin0312, wuhuxiao and ygwpz as code owners December 10, 2025 02:03

xinSky00 requested review from FangRun2, Tarrei and flesher0813 as code owners December 23, 2025 11:27

hek14 previously approved these changes Dec 23, 2025

View reviewed changes

xinSky00 dismissed hek14’s stale review via cf31a2b December 24, 2025 08:59

xinSky00 force-pushed the feature_rerope_clean branch from 5b2be7c to 2c033a8 Compare December 25, 2025 02:07

wangwenxin0312 previously approved these changes Dec 25, 2025

View reviewed changes

xinSky00 dismissed wangwenxin0312’s stale review via 4a48d6b December 25, 2025 08:46

xinSky00 force-pushed the feature_rerope_clean branch 3 times, most recently from 28fcfb3 to 0891d99 Compare December 26, 2025 08:23

feature for triton rerope

543c0a8

xinSky00 force-pushed the feature_rerope_clean branch from 0891d99 to 543c0a8 Compare December 26, 2025 08:29

wangwenxin0312 approved these changes Dec 26, 2025

View reviewed changes

wuhuxiao approved these changes Dec 26, 2025

View reviewed changes

qyh111 approved these changes Dec 26, 2025

View reviewed changes

wangwenxin0312 requested a review from hek14 December 26, 2025 12:50

harrisonyhq approved these changes Dec 26, 2025

View reviewed changes

hek14 approved these changes Dec 26, 2025

View reviewed changes

ucm/integration/vllm/patch/patch_funcs/v092/vllm_rerope_patch.py Outdated Show resolved Hide resolved

hek14 merged commit 6899198 into ModelEngine-Group:develop Dec 26, 2025
3 checks passed

xinSky00 added a commit to xinSky00/unified-cache-management that referenced this pull request Dec 31, 2025

feature for triton rerope (ModelEngine-Group#497)

02d697f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature for triton rerope #497

feature for triton rerope #497

Uh oh!

xinSky00 commented Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

feature for triton rerope #497

feature for triton rerope #497

Uh oh!

Conversation

xinSky00 commented Dec 10, 2025

Purpose

Modifications

Test

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants