Skip to content

Conversation

@xinSky00
Copy link
Contributor

Purpose

This PR implements a Triton kernel for ReRoPE to enable efficient context length extension in vLLM. The ReRoPE-based approach delivers 3-5x speedup for long sequences.

It implements segment-wise attention computation, applying full rotary embeddings within a fixed window while constraining positional encodings beyond the window boundary, enabling models to handle sequences beyond their pre-training length without fine-tuning.

Modifications

  1. New Environment Variable: VLLM_ATTENTION_BACKEND
  • Must be set to TRITON_ATTN_VLLM_V1 to enable the Triton-based backend
  • Usage: export VLLM_ATTENTION_BACKEND="TRITON_ATTN_VLLM_V1"
  • Default: FLASH_ATTN (standard FlashAttention backend)
  1. Modified Model Configuration Parameter: max_position_embeddings
  • Users must adjust this parameter via --hf-overrides to match their target input length
  • This ensures the RoPE embeddings are properly computed for extended sequences
  • Example: --hf-overrides '{"max_position_embeddings": 327680}'
  1. ReRoPE-specific parameters: rerope_window and training_length should be configured based on the model's original pre-training length
  • These values determine the segment boundaries for attention computation and must align with the model's original training configuration

Test

run the file of offline_inference_rerope.py

  • os.environ["VLLM_ATTENTION_BACKEND"] = "TRITON_ATTN_VLLM_V1"
  • os.environ["VLLM_USE_REROPE"] = "true"
  • model: Qwen2.5-14B-Instruct
  • Dataset: multifieldqa_zh.jsonl
  • results
    • prompt length: about 130k tokens
      img_v3_02sr_fe6a1e47-07a6-45a4-8646-0b9b45134adg
    • prompt length: about 315k tokens
      img_v3_02sr_5c35a298-349c-4540-be61-11d36c845b4g

hek14
hek14 previously approved these changes Dec 23, 2025
wangwenxin0312
wangwenxin0312 previously approved these changes Dec 25, 2025
@xinSky00 xinSky00 force-pushed the feature_rerope_clean branch 3 times, most recently from 28fcfb3 to 0891d99 Compare December 26, 2025 08:23
@xinSky00 xinSky00 force-pushed the feature_rerope_clean branch from 0891d99 to 543c0a8 Compare December 26, 2025 08:29
@wangwenxin0312 wangwenxin0312 requested a review from hek14 December 26, 2025 12:50
@hek14 hek14 merged commit 6899198 into ModelEngine-Group:develop Dec 26, 2025
3 checks passed
xinSky00 added a commit to xinSky00/unified-cache-management that referenced this pull request Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants