Question about input tensor splitting in RoPE #790

JiangJiaWei1103 · 2025-08-27T14:01:42Z

JiangJiaWei1103
Aug 27, 2025

Hi, I would like to clarify why the input tensor ($q$ and $k$) is split into the first and second halves, instead of odd and even halves. My understanding is that minus applies only to odd-indexed dimensions (i.e., $q_1$, $q_3$, ..., $q_{d-1}$).

def apply_rope(x, cos, sin):
    # x: (batch_size, num_heads, seq_len, head_dim)
    batch_size, num_heads, seq_len, head_dim = x.shape
    assert head_dim % 2 == 0, "Head dimension must be even"

    # Split x into first half and second half
    x1 = x[..., : head_dim // 2]  # First half
    x2 = x[..., head_dim // 2 :]  # Second half

    # Adjust sin and cos shapes
    cos = cos[:seq_len, :].unsqueeze(0).unsqueeze(0)  # Shape: (1, 1, seq_len, head_dim)
    sin = sin[:seq_len, :].unsqueeze(0).unsqueeze(0)

    # Apply the rotary transformation
    rotated = torch.cat((-x2, x1), dim=-1)
    x_rotated = (x * cos) + (rotated * sin)

    # It's ok to use lower-precision after applying cos and sin rotation
    return x_rotated.to(dtype=x.dtype)

Thanks so much for your help.

Answered by casinca

Aug 28, 2025

Hello,

My understanding is that there are multiple variants to compute RoPE and since Sebastian is loading/using weights from HuggingFace, he has to also match their way (2 halves variant) of computing RoPE. Otherwise you'd be rotating wrong features together (in K and Q) and you'll end up with a nasty silent bug (it'll still work but performance will be subpar)

In theory you could do the pairing any way you want, if you train from scratch, it wouldn't matter. But if you use pretrained weights, you'll have to be consistent: Either you change your RoPE computation or you reorder/permute the weights in the pretrained Q_w and K_w to match your own pairing variant.

There was a thread on the H…

View full answer

casinca · 2025-08-28T13:35:04Z

casinca
Aug 28, 2025

Hello,

My understanding is that there are multiple variants to compute RoPE and since Sebastian is loading/using weights from HuggingFace, he has to also match their way (2 halves variant) of computing RoPE. Otherwise you'd be rotating wrong features together (in K and Q) and you'll end up with a nasty silent bug (it'll still work but performance will be subpar)

In theory you could do the pairing any way you want, if you train from scratch, it wouldn't matter. But if you use pretrained weights, you'll have to be consistent: Either you change your RoPE computation or you reorder/permute the weights in the pretrained Q_w and K_w to match your own pairing variant.

There was a thread on the HF repo highlighting this: huggingface/transformers#25199
Hope that helps

2 replies

rasbt Aug 28, 2025
Maintainer

Thanks for answering, @casinca . And yes, this is exactly right. (There was also a good RoPE implementation discussion here #751). I am also testing against HF transformers as a reference implementation to avoid such silent bugs:

LLMs-from-scratch/pkg/llms_from_scratch/tests/test_qwen3.py

Line 146 in a3a62c5

def test_rope():

JiangJiaWei1103 Aug 31, 2025
Author

Thank you, @casinca and @rasbt, for the clarification and excellent references. I'll review them and conduct some additional experiments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about input tensor splitting in RoPE #790

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about input tensor splitting in RoPE #790

Uh oh!

Uh oh!

JiangJiaWei1103 Aug 27, 2025

Replies: 1 comment · 2 replies

Uh oh!

casinca Aug 28, 2025

Uh oh!

rasbt Aug 28, 2025 Maintainer

Uh oh!

JiangJiaWei1103 Aug 31, 2025 Author

JiangJiaWei1103
Aug 27, 2025

Replies: 1 comment 2 replies

casinca
Aug 28, 2025

rasbt Aug 28, 2025
Maintainer

JiangJiaWei1103 Aug 31, 2025
Author