Skip to content

[Cherry-Pick][Optimization][#7687] optimize tritonmoe_preprocess op#7786

Merged
yongqiangma merged 4 commits into
PaddlePaddle:release/2.6from
yongqiangma:moe_align_26
May 14, 2026
Merged

[Cherry-Pick][Optimization][#7687] optimize tritonmoe_preprocess op#7786
yongqiangma merged 4 commits into
PaddlePaddle:release/2.6from
yongqiangma:moe_align_26

Conversation

@yongqiangma
Copy link
Copy Markdown
Collaborator

Motivation

tritonmoe_preprocess 内嵌 kernel 硬编码了 num_experts 白名单(2/8/32/64/128/160/256),不支持任意 expert 数量,且排序与对齐分两步完成效率偏低。本次重构将对齐逻辑提取到独立 moe_align_kernel.cu,通过三路分发策略提升适用范围和性能。
测试平台: NVIDIA A100-SXM4-80GB,Driver Version: 535.230.02 CUDA Version: 12.9

num_tokens num_experts old(ms) new(ms) enhance
256 64 0.0208 0.0209 -0.48%
512 64 0.0212 0.0209 1.44%
1024 64 0.0196 0.0195 0.51%
2048 64 0.0211 0.0205 2.93%
4096 64 0.0232 0.0227 2.20%
8192 64 0.0281 0.0263 6.84%
16384 64 0.0374 0.0285 31.23%
32768 64 0.0557 0.0288 93.40%
65536 64 0.0917 0.0291 215.12%
163840 64 0.2026 0.0306 562.09%
256 256 0.026 0.0209 24.40%
512 256 0.0262 0.0197 32.99%
1024 256 0.0245 0.0197 24.37%
2048 256 0.0257 0.0207 24.15%
4096 256 0.0284 0.0229 24.02%
8192 256 0.0323 0.0264 22.35%
16384 256 0.0411 0.041 0.24%
32768 256 0.0594 0.0415 43.13%
65536 256 0.096 0.0417 130.22%
163840 256 0.1985 0.0433 358.43%



Modifications

  • 新增 custom_ops/gpu_ops/moe/moe_align_kernel.cu:实现三路分发的 moe_align_block_size 函数模板及对应 CUDA kernels(小批量单 block kernel、大批量 cooperative kernel、通用双 kernel)
  • custom_ops/gpu_ops/helper.h:将 CEILDIV 宏工具函数上移至公共头文件
  • custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu:删除旧内嵌 kernel,改为声明并调用 moe_align_block_size;支持任意 num_experts;修复小 token 分支的 max_num_tokens_padded 计算;输出名从 expert_ids 改为 experts_ids
  • custom_ops/setup_ops.py:将新文件加入两处编译源列表
  • tests/operators/test_tritonmoe_preprocess.py:新增覆盖各分支场景(空输入、单 token、均匀分布、大批量等)的正确性测试

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 12, 2026 06:19
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 12, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在重构并优化 tritonmoe_preprocess 的对齐/排序逻辑:去除旧 kernel 对 num_experts 的白名单限制,将对齐逻辑抽离为独立的 CUDA 实现,并补充更全面的正确性测试,以提升适用范围与性能表现。

Changes:

  • 新增 moe_align_kernel.cu,实现三路分发策略(小 batch 单 block / 大 batch cooperative / 通用双 kernel)来完成 MoE 对齐与 token 排序。
  • tritonmoe_preprocess.cu 改为调用 moe_align_block_size,支持任意维度输入 numel 计算,并修正小 token 分支下的 padded size 计算方式与 blocks 计算方式。
  • 扩充 test_tritonmoe_preprocess.py,覆盖更多形状与分支场景;setup_ops.py 增加新文件编译入口;helper.h 上移 CEILDIV 宏。

需要注意:PR 标题中 “optmize” 建议更正为 “optimize”,以符合仓库标题规范与可检索性。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
custom_ops/gpu_ops/moe/moe_align_kernel.cu 新增 MoE 对齐与 token 排序 CUDA kernel(多策略分发)
custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu 移除旧内嵌 kernel,改为调用 moe_align_block_size 并调整 shape 推导/输出命名
custom_ops/gpu_ops/helper.h CEILDIV 提升为公共宏,供多处 CUDA 代码复用
custom_ops/setup_ops.py 将新增的 moe_align_kernel.cu 加入编译源列表
tests/operators/test_tritonmoe_preprocess.py 新增/重构正确性测试,覆盖更多输入形状与边界场景

Comment on lines +76 to +80
int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
for (int32_t i = threadIdx.x; i < total_vecs; i += blockDim.x) {
out_ptr[i] = fill_vec;
}
Comment on lines +274 to +279
int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
for (int32_t i = bid * nthreads + tid; i < total_vecs;
i += nblocks * nthreads) {
out_ptr[i] = fill_vec;
}
Comment on lines +553 to +571
// Original 2-kernel approach (for medium inputs or cooperative fallback)
auto align_kernel = moe_align_block_size_kernel<scalar_t>;

const size_t scan_size = next_pow_2(num_experts);
const size_t shared_mem_size =
(num_experts + (num_experts + 1) + scan_size + WARP_SIZE) *
sizeof(int32_t);
align_kernel<<<2, threads, shared_mem_size, stream>>>(
topk_ids.data<scalar_t>(),
sorted_token_ids.data<int32_t>(),
experts_ids.data<int32_t>(),
num_tokens_post_pad.data<int32_t>(),
num_experts,
block_size,
numel,
cumsum_buffer.data<int32_t>(),
pad_sorted_token_ids,
scan_size,
max_num_tokens_padded);
Comment on lines +467 to +475
bool small_batch_expert_mode = (numel < 1024) && (num_experts <= 64);

if (small_batch_expert_mode) {
const int32_t expert_threads = max((int32_t)num_experts, WARP_SIZE);
constexpr int32_t fill_threads = 256;
const int32_t shared_mem_size =
((expert_threads + 1) * num_experts + (num_experts + 1)) *
sizeof(int32_t);

Comment on lines +267 to +275
def setUp(self):
if not _AVAILABLE:
self.skipTest("CUDA or fastdeploy not available")

def test_docstring_example(self):
"""Reproduce the example from the function docstring."""
topk_ids = paddle.to_tensor([[2, 3, 4], [1, 2, 4], [1, 3, 4], [1, 2, 3]], dtype="int64")
_verify(topk_ids, block_size=4, num_experts=5, label="docstring_example")


DEVICE = "gpu"

# 仅对小规模 case 打印详细 tensor,超过此阈值只打印统计摘要
Comment on lines +447 to +486
if not _AVAILABLE:
print("SKIP: CUDA or fastdeploy not available.")
else:
basic = TestTritonMoePreprocessBasic()
basic.test_docstring_example()
basic.test_single_token_single_expert()
basic.test_all_tokens_same_expert()
basic.test_uniform_1d()
basic.test_topk_equals_num_experts()
basic.test_num_tokens_less_than_num_experts()
basic.test_exact_block_boundary()
basic.test_block_size_1()

edge = TestTritonMoePreprocessEdgeCases()
edge.test_empty_topk_ids()
edge.test_one_expert()
edge.test_large_block_size()
edge.test_int64_dtype()

real = TestTritonMoePreprocessRealistic()
for num_tokens, num_experts, block_size in [
(256, 8, 16),
(1024, 16, 16),
(4096, 64, 16),
(8192, 64, 32),
(8192, 128, 64),
(16384, 256, 128),
]:
real._run_uniform_distribution(num_tokens, num_experts, block_size)
for num_tokens, top_k, num_experts, block_size in [
(512, 2, 8, 16),
(1024, 4, 16, 16),
(2048, 8, 64, 16),
]:
real._run_topk_2d(num_tokens, top_k, num_experts, block_size)
for alpha in [0.5, 1.2, 2.0]:
real._run_zipf_distribution(alpha)
real.test_deterministic_with_fixed_seed()

print("\n*** All direct-run tests passed ***")
Comment on lines +48 to +64
DEVICE = "gpu"

# 仅对小规模 case 打印详细 tensor,超过此阈值只打印统计摘要
_PRINT_TENSOR_NUMEL_LIMIT = 64


def _fmt_tensor(t: paddle.Tensor, name: str) -> str:
t_cpu = t.cpu()
if t_cpu.numel() <= _PRINT_TENSOR_NUMEL_LIMIT:
return f"{name}{list(t_cpu.shape)} = {t_cpu.tolist()}"
return (
f"{name}{list(t_cpu.shape)} | "
f"min={int(t_cpu.min())} max={int(t_cpu.max())} "
f"mean={float(t_cpu.cast('float32').mean()):.2f} numel={t_cpu.numel()}"
)


PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 12, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-14 18:29:57

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⏳ CI 尚未完成:有 2 个 required 任务运行中、2 个 required 任务等待中,当前暂无 required 任务失败。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 29 2 2 3 0

2 任务状态汇总

2.1 Required任务 : 6/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - - -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - - -
⏸️ Extracted partial CE model tasks to run in CI. / run_ce_cases - 等待中 - - -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
其余 6 个必选任务通过 - - - - -

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 18m51s Job -
Trigger Jenkins for PR 50s Job -
⏸️ CI_HPU - - -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 任务失败。

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.6@df637af). Learn more about missing BASE report.

Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7786   +/-   ##
==============================================
  Coverage               ?   72.46%           
==============================================
  Files                  ?      381           
  Lines                  ?    54139           
  Branches               ?     8456           
==============================================
  Hits                   ?    39230           
  Misses                 ?    12152           
  Partials               ?     2757           
Flag Coverage Δ
GPU 72.46% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 14, 2026 06:23
@yongqiangma yongqiangma changed the title [Cherry-Pick][Optimization] optmize tritonmoe_preprocess op [Cherry-Pick][Optimization][#7687] optimize tritonmoe_preprocess op May 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (3)

custom_ops/gpu_ops/moe/moe_align_kernel.cu:279

  • cooperative kernel 同样用 int4 按 total_vecs=ceil(max_num_tokens_padded/4) 填充 sorted_token_ids;当 max_num_tokens_padded 不是 4 的倍数时,最后一次写入会越界到 buffer 末尾之外。建议改为对齐长度或单独处理最后不足 VEC_SIZE 的 tail,避免 OOB 写。
  if (pad_sorted_token_ids) {
    Vec fill_vec;
    fill_vec.x = fill_vec.y = fill_vec.z = fill_vec.w =
        static_cast<int32_t>(numel);
    int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
    Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
    for (int32_t i = bid * nthreads + tid; i < total_vecs;
         i += nblocks * nthreads) {
      out_ptr[i] = fill_vec;
    }

tests/operators/test_tritonmoe_preprocess.py:323

  • 同上:EdgeCases 这组用例 setUp 未设置 paddle.set_device("gpu"),但多数 case(除 empty_topk_ids 外)创建的 tensor 默认会在 CPU 上,可能导致 GPU op 调用失败或测试顺序相关。建议在 setUp 中显式 set_device("gpu") 或在 _verify 内做统一迁移。
    def setUp(self):
        if not _AVAILABLE:
            self.skipTest("CUDA or fastdeploy not available")

tests/operators/test_tritonmoe_preprocess.py:356

  • 同上:Realistic 这组用例 setUp 未设置 paddle.set_device("gpu"),会使 paddle.randint/to_tensor 等默认产出 CPU tensor,从而导致 tritonmoe_preprocess 调用失败或用例执行顺序依赖。建议在 setUp 中显式 set_device("gpu"),并可顺带加上 device_count 检查以避免无 GPU 的 CUDA build 环境报错。
    def setUp(self):
        if not _AVAILABLE:
            self.skipTest("CUDA or fastdeploy not available")

Comment on lines +76 to +80
int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
for (int32_t i = threadIdx.x; i < total_vecs; i += blockDim.x) {
out_ptr[i] = fill_vec;
}
Comment on lines +91 to +96
const size_t tid = threadIdx.x;
const size_t stride = blockDim.x;

if (tid < num_experts) {
shared_counts[tid] = 0;
}
Comment on lines +469 to +472
if (small_batch_expert_mode) {
const int32_t expert_threads = max((int32_t)num_experts, WARP_SIZE);
constexpr int32_t fill_threads = 256;
const int32_t shared_mem_size =
if (err == cudaSuccess) {
return;
}
// Fall through to original path if cooperative launch failed

def setUp(self):
if not _AVAILABLE:
self.skipTest("CUDA or fastdeploy not available")
Comment on lines +161 to +162
print(f"\n{tag}ALL CHECKS PASSED")
print(sep)
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-14 18:11:12

📋 Review 摘要

PR 概述:将 tritonmoe_preprocess 的对齐 kernel 重构为独立的 moe_align_kernel.cu,引入三路分发策略(small-batch 单 block / cooperative / 双 kernel),去除 num_experts 硬编码白名单,支持任意 expert 数量。
变更范围custom_ops/gpu_ops/moe/custom_ops/gpu_ops/helper.hcustom_ops/setup_ops.pytests/operators/
影响面 Tag[OP] [Optimization]

📝 PR 规范检查

标题格式不符合 Cherry-Pick 规范:[#7687] 应写在描述末尾作为 (#7687),而非作为独立括号标签插在中间。

标题建议(可直接复制):

  • [Cherry-Pick][Optimization] optimize tritonmoe_preprocess op(#7687)

描述结构(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist)完整,已合规,无需修改。

问题

级别 文件 概述
❓ 疑问 custom_ops/gpu_ops/helper.h:76 CEILDIV 宏参数未加括号,且 a+b-1a 接近 INT_MAX 时存在溢出风险
❓ 疑问 custom_ops/gpu_ops/moe/moe_align_kernel.cu:454 static 变量缓存 SM 数量,在同进程多设备场景下可能缓存错误设备属性

总体评价

重构思路清晰,三路分发策略对大 batch 性能提升显著(benchmark 数据翔实),新增测试覆盖了边界和各分支场景。有两处实现细节(宏定义安全性、static 缓存多设备风险)建议作者确认,不阻塞合并。

using json = nlohmann::json;
#endif

#define CEILDIV(a, b) (((a + b - 1) / b))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 CEILDIV 宏存在两个潜在问题:

  1. 整数溢出a + b - 1a 接近 INT_MAX 时会溢出(例如 int32_t 场景)。
  2. 宏参数未加括号:若传入表达式(如 CEILDIV(x+1, y)),a 展开后无括号保护可能导致运算符优先级错误。

建议修改为:

#define CEILDIV(a, b) (((a) + (b) - 1) / (b))

或使用 inline 函数以避免宏的副作用:

template <typename T>
inline T ceildiv(T a, T b) { return (a + b - 1) / b; }

template <typename scalar_t>
void moe_align_block_size(const paddle::Tensor& topk_ids,
int64_t num_experts,
int64_t block_size,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 cached_max_blocks_per_smcached_num_sms 使用函数级 static 变量,存在两个潜在风险:

  1. 多设备缓存错误:如果同一进程中存在多 GPU 设备(如测试环境切换 cudaSetDevice),第一次调用会缓存某设备的 SM 数量,后续在不同设备上调用将使用错误的 SM 数,导致 coop_blocks 超出目标设备上限,cooperative launch 报错。
  2. 线程安全:多线程并发首次调用时存在 double-init 竞态(虽然写入的值相同,但 cudaGetDevice 可能返回不同设备 ID)。

建议按 device_id 做 per-device 缓存,或在调用前直接查询(overhead 极小):

int device_id;
cudaGetDevice(&device_id);
int max_blocks_per_sm, num_sms;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks_per_sm, ...);
cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, device_id);

@yongqiangma yongqiangma merged commit 72beb9e into PaddlePaddle:release/2.6 May 14, 2026
35 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants