Add fused RoPE kernel using CuTe DSL by Copilot · Pull Request #1 · alibaba-damo-academy/RynnScale

Copilot · 2026-02-28T06:42:00Z

Implements a fused Rotary Position Embedding (RoPE) operation using NVIDIA's CuTe DSL (CUTLASS), fusing Q and K rotation into a single kernel launch.

Kernel (`rope_sm90.cu`)

One CTA per (token, head) pair; threads stride over head_dim/2
cute::make_tensor / make_gmem_ptr / make_layout for tensor views
Compile-time HEAD_DIM specialisation via cute::Int<N> (64 / 128 / 256)
Supports fp16 and bf16; accumulates in float32
rotate_half convention: x_out = x * cos + rotate_half(x) * sin

Python interface (`rope.py`)

Falls back to a pure-PyTorch reference when the CUDA extension is not compiled

Build

rope_sm90.cu added to the CUDA extension sources in setup.py
fused_rope_forward registered in bindings.cpp and exported from rynn_scale.ops

Usage

from rynn_scale.ops import fused_rope

# q, k: [total_tokens, num_{q,k}_heads, head_dim]  fp16/bf16
# cos, sin: [total_tokens, head_dim/2]              float32
q_out, k_out = fused_rope(q, k, cos, sin)

Original prompt

帮我用cutedsl写一个fused rope，要求：

创建自 VS Code。

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

CLAassistant · 2026-02-28T06:42:10Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

- rope_sm90.cu: CUDA kernel with CuTe make_tensor/make_layout DSL. One CTA per (token, head) pair; supports fp16/bf16, head_dim 64/128/256. Fuses Q and K rotation in a single kernel launch. - bindings.cpp: Add fused_rope_forward PyBind11 binding. - rope.py: Python wrapper with pure-PyTorch fallback. - ops/__init__.py: Export fused_rope. - setup.py: Add rope_sm90.cu to CUDA extension sources. Co-authored-by: lkhl <78654844+lkhl@users.noreply.github.com>

Co-authored-by: lkhl <78654844+lkhl@users.noreply.github.com>

Initial plan

09f6d31

Copilot AI assigned Copilot and lkhl Feb 28, 2026

Copilot started work on behalf of lkhl February 28, 2026 06:42 View session

Copilot AI and others added 2 commits February 28, 2026 06:51

Remove _codeql_detected_source_root artifact and add to .gitignore

e215f28

Co-authored-by: lkhl <78654844+lkhl@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add implementation for fused rope using cutedsl~~ Add fused RoPE kernel using CuTe DSL Feb 28, 2026

Copilot finished work on behalf of lkhl February 28, 2026 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fused RoPE kernel using CuTe DSL#1

Add fused RoPE kernel using CuTe DSL#1
Copilot wants to merge 3 commits intomainfrom
copilot/add-fused-rope-implementation

Copilot AI commented Feb 28, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Kernel (rope_sm90.cu)

Python interface (rope.py)

Build

Usage

Uh oh!

CLAassistant commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Feb 28, 2026 •

edited

Loading

Kernel (`rope_sm90.cu`)

Python interface (`rope.py`)