Skip to content

Add fused RoPE kernel using CuTe DSL#1

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/add-fused-rope-implementation
Draft

Add fused RoPE kernel using CuTe DSL#1
Copilot wants to merge 3 commits intomainfrom
copilot/add-fused-rope-implementation

Conversation

Copy link

Copilot AI commented Feb 28, 2026

Implements a fused Rotary Position Embedding (RoPE) operation using NVIDIA's CuTe DSL (CUTLASS), fusing Q and K rotation into a single kernel launch.

Kernel (rope_sm90.cu)

  • One CTA per (token, head) pair; threads stride over head_dim/2
  • cute::make_tensor / make_gmem_ptr / make_layout for tensor views
  • Compile-time HEAD_DIM specialisation via cute::Int<N> (64 / 128 / 256)
  • Supports fp16 and bf16; accumulates in float32
  • rotate_half convention: x_out = x * cos + rotate_half(x) * sin

Python interface (rope.py)

  • Falls back to a pure-PyTorch reference when the CUDA extension is not compiled

Build

  • rope_sm90.cu added to the CUDA extension sources in setup.py
  • fused_rope_forward registered in bindings.cpp and exported from rynn_scale.ops

Usage

from rynn_scale.ops import fused_rope

# q, k: [total_tokens, num_{q,k}_heads, head_dim]  fp16/bf16
# cos, sin: [total_tokens, head_dim/2]              float32
q_out, k_out = fused_rope(q, k, cos, sin)
Original prompt

帮我用cutedsl写一个fused rope,要求:

创建自 VS Code


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copilot AI and others added 2 commits February 28, 2026 06:51
- rope_sm90.cu: CUDA kernel with CuTe make_tensor/make_layout DSL.
  One CTA per (token, head) pair; supports fp16/bf16, head_dim 64/128/256.
  Fuses Q and K rotation in a single kernel launch.
- bindings.cpp: Add fused_rope_forward PyBind11 binding.
- rope.py: Python wrapper with pure-PyTorch fallback.
- ops/__init__.py: Export fused_rope.
- setup.py: Add rope_sm90.cu to CUDA extension sources.

Co-authored-by: lkhl <78654844+lkhl@users.noreply.github.com>
Co-authored-by: lkhl <78654844+lkhl@users.noreply.github.com>
Copilot AI changed the title [WIP] Add implementation for fused rope using cutedsl Add fused RoPE kernel using CuTe DSL Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants