[Cherry-Pick][Optimization][#7687] optimize tritonmoe_preprocess op by yongqiangma · Pull Request #7786 · PaddlePaddle/FastDeploy

yongqiangma · 2026-05-12T06:19:07Z

Motivation

原 tritonmoe_preprocess 内嵌 kernel 硬编码了 num_experts 白名单（2/8/32/64/128/160/256），不支持任意 expert 数量，且排序与对齐分两步完成效率偏低。本次重构将对齐逻辑提取到独立 moe_align_kernel.cu，通过三路分发策略提升适用范围和性能。
测试平台： NVIDIA A100-SXM4-80GB，Driver Version: 535.230.02 CUDA Version: 12.9

num_tokens	num_experts	old(ms)	new(ms)	enhance
256	64	0.0208	0.0209	-0.48%
512	64	0.0212	0.0209	1.44%
1024	64	0.0196	0.0195	0.51%
2048	64	0.0211	0.0205	2.93%
4096	64	0.0232	0.0227	2.20%
8192	64	0.0281	0.0263	6.84%
16384	64	0.0374	0.0285	31.23%
32768	64	0.0557	0.0288	93.40%
65536	64	0.0917	0.0291	215.12%
163840	64	0.2026	0.0306	562.09%
256	256	0.026	0.0209	24.40%
512	256	0.0262	0.0197	32.99%
1024	256	0.0245	0.0197	24.37%
2048	256	0.0257	0.0207	24.15%
4096	256	0.0284	0.0229	24.02%
8192	256	0.0323	0.0264	22.35%
16384	256	0.0411	0.041	0.24%
32768	256	0.0594	0.0415	43.13%
65536	256	0.096	0.0417	130.22%
163840	256	0.1985	0.0433	358.43%

Modifications

新增 custom_ops/gpu_ops/moe/moe_align_kernel.cu：实现三路分发的 moe_align_block_size 函数模板及对应 CUDA kernels（小批量单 block kernel、大批量 cooperative kernel、通用双 kernel）
custom_ops/gpu_ops/helper.h：将 CEILDIV 宏工具函数上移至公共头文件
custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu：删除旧内嵌 kernel，改为声明并调用 moe_align_block_size；支持任意 num_experts；修复小 token 分支的 max_num_tokens_padded 计算；输出名从 expert_ids 改为 experts_ids
custom_ops/setup_ops.py：将新文件加入两处编译源列表
tests/operators/test_tritonmoe_preprocess.py：新增覆盖各分支场景（空输入、单 token、均匀分布、大批量等）的正确性测试

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-12T06:19:15Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在重构并优化 tritonmoe_preprocess 的对齐/排序逻辑：去除旧 kernel 对 num_experts 的白名单限制，将对齐逻辑抽离为独立的 CUDA 实现，并补充更全面的正确性测试，以提升适用范围与性能表现。

Changes:

新增 moe_align_kernel.cu，实现三路分发策略（小 batch 单 block / 大 batch cooperative / 通用双 kernel）来完成 MoE 对齐与 token 排序。
tritonmoe_preprocess.cu 改为调用 moe_align_block_size，支持任意维度输入 numel 计算，并修正小 token 分支下的 padded size 计算方式与 blocks 计算方式。
扩充 test_tritonmoe_preprocess.py，覆盖更多形状与分支场景；setup_ops.py 增加新文件编译入口；helper.h 上移 CEILDIV 宏。

需要注意：PR 标题中 “optmize” 建议更正为 “optimize”，以符合仓库标题规范与可检索性。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`custom_ops/gpu_ops/moe/moe_align_kernel.cu`	新增 MoE 对齐与 token 排序 CUDA kernel（多策略分发）
`custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu`	移除旧内嵌 kernel，改为调用 `moe_align_block_size` 并调整 shape 推导/输出命名
`custom_ops/gpu_ops/helper.h`	将 `CEILDIV` 提升为公共宏，供多处 CUDA 代码复用
`custom_ops/setup_ops.py`	将新增的 `moe_align_kernel.cu` 加入编译源列表
`tests/operators/test_tritonmoe_preprocess.py`	新增/重构正确性测试，覆盖更多输入形状与边界场景

+      int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
+      Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
+      for (int32_t i = threadIdx.x; i < total_vecs; i += blockDim.x) {
+        out_ptr[i] = fill_vec;
+      }


+    int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
+    Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
+    for (int32_t i = bid * nthreads + tid; i < total_vecs;
+         i += nblocks * nthreads) {
+      out_ptr[i] = fill_vec;
+    }


+    // Original 2-kernel approach (for medium inputs or cooperative fallback)
+    auto align_kernel = moe_align_block_size_kernel<scalar_t>;
+
+    const size_t scan_size = next_pow_2(num_experts);
+    const size_t shared_mem_size =
+        (num_experts + (num_experts + 1) + scan_size + WARP_SIZE) *
+        sizeof(int32_t);
+    align_kernel<<<2, threads, shared_mem_size, stream>>>(
+        topk_ids.data<scalar_t>(),
+        sorted_token_ids.data<int32_t>(),
+        experts_ids.data<int32_t>(),
+        num_tokens_post_pad.data<int32_t>(),
+        num_experts,
+        block_size,
+        numel,
+        cumsum_buffer.data<int32_t>(),
+        pad_sorted_token_ids,
+        scan_size,
+        max_num_tokens_padded);


+  bool small_batch_expert_mode = (numel < 1024) && (num_experts <= 64);
+
+  if (small_batch_expert_mode) {
+    const int32_t expert_threads = max((int32_t)num_experts, WARP_SIZE);
+    constexpr int32_t fill_threads = 256;
+    const int32_t shared_mem_size =
+        ((expert_threads + 1) * num_experts + (num_experts + 1)) *
+        sizeof(int32_t);
+


+    def setUp(self):
+        if not _AVAILABLE:
+            self.skipTest("CUDA or fastdeploy not available")
+
+    def test_docstring_example(self):
+        """Reproduce the example from the function docstring."""
+        topk_ids = paddle.to_tensor([[2, 3, 4], [1, 2, 4], [1, 3, 4], [1, 2, 3]], dtype="int64")
+        _verify(topk_ids, block_size=4, num_experts=5, label="docstring_example")
+


+
+DEVICE = "gpu"
+
+# 仅对小规模 case 打印详细 tensor，超过此阈值只打印统计摘要


+    if not _AVAILABLE:
+        print("SKIP: CUDA or fastdeploy not available.")
+    else:
+        basic = TestTritonMoePreprocessBasic()
+        basic.test_docstring_example()
+        basic.test_single_token_single_expert()
+        basic.test_all_tokens_same_expert()
+        basic.test_uniform_1d()
+        basic.test_topk_equals_num_experts()
+        basic.test_num_tokens_less_than_num_experts()
+        basic.test_exact_block_boundary()
+        basic.test_block_size_1()
+
+        edge = TestTritonMoePreprocessEdgeCases()
+        edge.test_empty_topk_ids()
+        edge.test_one_expert()
+        edge.test_large_block_size()
+        edge.test_int64_dtype()
+
+        real = TestTritonMoePreprocessRealistic()
+        for num_tokens, num_experts, block_size in [
+            (256, 8, 16),
+            (1024, 16, 16),
+            (4096, 64, 16),
+            (8192, 64, 32),
+            (8192, 128, 64),
+            (16384, 256, 128),
+        ]:
+            real._run_uniform_distribution(num_tokens, num_experts, block_size)
+        for num_tokens, top_k, num_experts, block_size in [
+            (512, 2, 8, 16),
+            (1024, 4, 16, 16),
+            (2048, 8, 64, 16),
+        ]:
+            real._run_topk_2d(num_tokens, top_k, num_experts, block_size)
+        for alpha in [0.5, 1.2, 2.0]:
+            real._run_zipf_distribution(alpha)
+        real.test_deterministic_with_fixed_seed()
+
+        print("\n*** All direct-run tests passed ***")


+DEVICE = "gpu"
+
+# 仅对小规模 case 打印详细 tensor，超过此阈值只打印统计摘要
+_PRINT_TENSOR_NUMEL_LIMIT = 64
+
+
+def _fmt_tensor(t: paddle.Tensor, name: str) -> str:
+    t_cpu = t.cpu()
+    if t_cpu.numel() <= _PRINT_TENSOR_NUMEL_LIMIT:
+        return f"{name}{list(t_cpu.shape)} = {t_cpu.tolist()}"
+    return (
+        f"{name}{list(t_cpu.shape)} | "
+        f"min={int(t_cpu.min())} max={int(t_cpu.max())} "
+        f"mean={float(t_cpu.cast('float32').mean()):.2f} numel={t_cpu.numel()}"
+    )
+
+


PaddlePaddle-bot · 2026-05-12T07:00:51Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-14 18:29:57

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 7281a4b
Merge base: df637af (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

⏳ CI 尚未完成：有 2 个 required 任务运行中、2 个 required 任务等待中，当前暂无 required 任务失败。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
36(0)	36	29	2	2	3	0

2 任务状态汇总

2.1 Required任务 : 6/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	-	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	-	-
⏸️	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	等待中	-	-	-
⏸️	`Run Four Cards Tests / run_4_cards_tests`	-	等待中	-	-	-
✅	其余 6 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	18m51s	Job	-
❌	`Trigger Jenkins for PR`	50s	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 23 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 任务失败。

codecov-commenter · 2026-05-12T08:14:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.6@df637af). Learn more about missing BASE report.

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7786   +/-   ##
==============================================
  Coverage               ?   72.46%           
==============================================
  Files                  ?      381           
  Lines                  ?    54139           
  Branches               ?     8456           
==============================================
  Hits                   ?    39230           
  Misses                 ?    12152           
  Partials               ?     2757

Flag	Coverage Δ
GPU	`72.46% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (3)

custom_ops/gpu_ops/moe/moe_align_kernel.cu:279

cooperative kernel 同样用 int4 按 total_vecs=ceil(max_num_tokens_padded/4) 填充 sorted_token_ids；当 max_num_tokens_padded 不是 4 的倍数时，最后一次写入会越界到 buffer 末尾之外。建议改为对齐长度或单独处理最后不足 VEC_SIZE 的 tail，避免 OOB 写。

  if (pad_sorted_token_ids) {
    Vec fill_vec;
    fill_vec.x = fill_vec.y = fill_vec.z = fill_vec.w =
        static_cast<int32_t>(numel);
    int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
    Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
    for (int32_t i = bid * nthreads + tid; i < total_vecs;
         i += nblocks * nthreads) {
      out_ptr[i] = fill_vec;
    }

tests/operators/test_tritonmoe_preprocess.py:323

同上：EdgeCases 这组用例 setUp 未设置 paddle.set_device("gpu")，但多数 case（除 empty_topk_ids 外）创建的 tensor 默认会在 CPU 上，可能导致 GPU op 调用失败或测试顺序相关。建议在 setUp 中显式 set_device("gpu") 或在 _verify 内做统一迁移。

    def setUp(self):
        if not _AVAILABLE:
            self.skipTest("CUDA or fastdeploy not available")

tests/operators/test_tritonmoe_preprocess.py:356

同上：Realistic 这组用例 setUp 未设置 paddle.set_device("gpu")，会使 paddle.randint/to_tensor 等默认产出 CPU tensor，从而导致 tritonmoe_preprocess 调用失败或用例执行顺序依赖。建议在 setUp 中显式 set_device("gpu")，并可顺带加上 device_count 检查以避免无 GPU 的 CUDA build 环境报错。

    def setUp(self):
        if not _AVAILABLE:
            self.skipTest("CUDA or fastdeploy not available")

+      int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
+      Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
+      for (int32_t i = threadIdx.x; i < total_vecs; i += blockDim.x) {
+        out_ptr[i] = fill_vec;
+      }


+  const size_t tid = threadIdx.x;
+  const size_t stride = blockDim.x;
+
+  if (tid < num_experts) {
+    shared_counts[tid] = 0;
+  }


+  if (small_batch_expert_mode) {
+    const int32_t expert_threads = max((int32_t)num_experts, WARP_SIZE);
+    constexpr int32_t fill_threads = 256;
+    const int32_t shared_mem_size =


+      if (err == cudaSuccess) {
+        return;
+      }
+      // Fall through to original path if cooperative launch failed


+
+    def setUp(self):
+        if not _AVAILABLE:
+            self.skipTest("CUDA or fastdeploy not available")


+    print(f"\n{tag}ALL CHECKS PASSED")
+    print(sep)


PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-14 18:11:12

📋 Review 摘要

PR 概述：将 tritonmoe_preprocess 的对齐 kernel 重构为独立的 moe_align_kernel.cu，引入三路分发策略（small-batch 单 block / cooperative / 双 kernel），去除 num_experts 硬编码白名单，支持任意 expert 数量。
变更范围：custom_ops/gpu_ops/moe/、custom_ops/gpu_ops/helper.h、custom_ops/setup_ops.py、tests/operators/
影响面 Tag：[OP] [Optimization]

📝 PR 规范检查

标题格式不符合 Cherry-Pick 规范：[#7687] 应写在描述末尾作为 (#7687)，而非作为独立括号标签插在中间。

标题建议（可直接复制）：

[Cherry-Pick][Optimization] optimize tritonmoe_preprocess op(#7687)

描述结构（Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist）完整，已合规，无需修改。

问题

级别	文件	概述
❓ 疑问	`custom_ops/gpu_ops/helper.h:76`	`CEILDIV` 宏参数未加括号，且 `a+b-1` 在 `a` 接近 INT_MAX 时存在溢出风险
❓ 疑问	`custom_ops/gpu_ops/moe/moe_align_kernel.cu:454`	`static` 变量缓存 SM 数量，在同进程多设备场景下可能缓存错误设备属性

总体评价

重构思路清晰，三路分发策略对大 batch 性能提升显著（benchmark 数据翔实），新增测试覆盖了边界和各分支场景。有两处实现细节（宏定义安全性、static 缓存多设备风险）建议作者确认，不阻塞合并。

PaddlePaddle-bot · 2026-05-14T10:15:54Z

 using json = nlohmann::json;
 #endif

+#define CEILDIV(a, b) (((a + b - 1) / b))


❓ 疑问 CEILDIV 宏存在两个潜在问题：

整数溢出：a + b - 1 在 a 接近 INT_MAX 时会溢出（例如 int32_t 场景）。

宏参数未加括号：若传入表达式（如 CEILDIV(x+1, y)），a 展开后无括号保护可能导致运算符优先级错误。

建议修改为：

#define CEILDIV(a, b) (((a) + (b) - 1) / (b))

或使用 inline 函数以避免宏的副作用：

template <typename T> inline T ceildiv(T a, T b) { return (a + b - 1) / b; }

PaddlePaddle-bot · 2026-05-14T10:15:54Z

+template <typename scalar_t>
+void moe_align_block_size(const paddle::Tensor& topk_ids,
+                          int64_t num_experts,
+                          int64_t block_size,


❓ 疑问 cached_max_blocks_per_sm 和 cached_num_sms 使用函数级 static 变量，存在两个潜在风险：

多设备缓存错误：如果同一进程中存在多 GPU 设备（如测试环境切换 cudaSetDevice），第一次调用会缓存某设备的 SM 数量，后续在不同设备上调用将使用错误的 SM 数，导致 coop_blocks 超出目标设备上限，cooperative launch 报错。

线程安全：多线程并发首次调用时存在 double-init 竞态（虽然写入的值相同，但 cudaGetDevice 可能返回不同设备 ID）。

建议按 device_id 做 per-device 缓存，或在调用前直接查询（overhead 极小）：

int device_id; cudaGetDevice(&device_id); int max_blocks_per_sm, num_sms; cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks_per_sm, ...); cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, device_id);

opt moe_align_kernel

091f862

Copilot AI review requested due to automatic review settings May 12, 2026 06:19

yongqiangma had a problem deploying to Metax_ci May 12, 2026 06:19 — with GitHub Actions Failure

Copilot started reviewing on behalf of yongqiangma May 12, 2026 06:19 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

Merge branch 'release/2.6' into moe_align_26

7288323

yongqiangma had a problem deploying to Metax_ci May 13, 2026 11:47 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Sunny-bot1 approved these changes May 14, 2026

View reviewed changes

Merge branch 'release/2.6' into moe_align_26

57e639f

Copilot AI review requested due to automatic review settings May 14, 2026 06:23

yongqiangma had a problem deploying to Metax_ci May 14, 2026 06:23 — with GitHub Actions Failure

Copilot started reviewing on behalf of yongqiangma May 14, 2026 06:24 View session

yongqiangma changed the title ~~[Cherry-Pick][Optimization] optmize tritonmoe_preprocess op~~ [Cherry-Pick][Optimization][#7687] optimize tritonmoe_preprocess op May 14, 2026

Copilot AI reviewed May 14, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

Merge branch 'release/2.6' into moe_align_26

7281a4b

yongqiangma had a problem deploying to Metax_ci May 14, 2026 09:56 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 14, 2026

View reviewed changes

yongqiangma merged commit 72beb9e into PaddlePaddle:release/2.6 May 14, 2026
35 of 38 checks passed


		DEVICE = "gpu"

		# 仅对小规模 case 打印详细 tensor，超过此阈值只打印统计摘要

Conversation

yongqiangma commented May 12, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 6/10 通过

2.2 可选任务 — 23/26 通过

3 失败详情（仅 required）

Uh oh!

codecov-commenter commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

PaddlePaddle-bot commented May 12, 2026 •

edited

Loading

codecov-commenter commented May 12, 2026 •

edited

Loading