Question about input tensor splitting in RoPE #790
-
Hi, I would like to clarify why the input tensor ( def apply_rope(x, cos, sin):
# x: (batch_size, num_heads, seq_len, head_dim)
batch_size, num_heads, seq_len, head_dim = x.shape
assert head_dim % 2 == 0, "Head dimension must be even"
# Split x into first half and second half
x1 = x[..., : head_dim // 2] # First half
x2 = x[..., head_dim // 2 :] # Second half
# Adjust sin and cos shapes
cos = cos[:seq_len, :].unsqueeze(0).unsqueeze(0) # Shape: (1, 1, seq_len, head_dim)
sin = sin[:seq_len, :].unsqueeze(0).unsqueeze(0)
# Apply the rotary transformation
rotated = torch.cat((-x2, x1), dim=-1)
x_rotated = (x * cos) + (rotated * sin)
# It's ok to use lower-precision after applying cos and sin rotation
return x_rotated.to(dtype=x.dtype) Thanks so much for your help. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hello, My understanding is that there are multiple variants to compute RoPE and since Sebastian is loading/using weights from HuggingFace, he has to also match their way (2 halves variant) of computing RoPE. Otherwise you'd be rotating wrong features together (in K and Q) and you'll end up with a nasty silent bug (it'll still work but performance will be subpar) In theory you could do the pairing any way you want, if you train from scratch, it wouldn't matter. But if you use pretrained weights, you'll have to be consistent: Either you change your RoPE computation or you reorder/permute the weights in the pretrained Q_w and K_w to match your own pairing variant. There was a thread on the HF repo highlighting this: huggingface/transformers#25199 |
Beta Was this translation helpful? Give feedback.
Hello,
My understanding is that there are multiple variants to compute RoPE and since Sebastian is loading/using weights from HuggingFace, he has to also match their way (2 halves variant) of computing RoPE. Otherwise you'd be rotating wrong features together (in K and Q) and you'll end up with a nasty silent bug (it'll still work but performance will be subpar)
In theory you could do the pairing any way you want, if you train from scratch, it wouldn't matter. But if you use pretrained weights, you'll have to be consistent: Either you change your RoPE computation or you reorder/permute the weights in the pretrained Q_w and K_w to match your own pairing variant.
There was a thread on the H…