[Cherry-Pick][Optimization][#7687] optimize tritonmoe_preprocess op#7786
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在重构并优化 tritonmoe_preprocess 的对齐/排序逻辑:去除旧 kernel 对 num_experts 的白名单限制,将对齐逻辑抽离为独立的 CUDA 实现,并补充更全面的正确性测试,以提升适用范围与性能表现。
Changes:
- 新增
moe_align_kernel.cu,实现三路分发策略(小 batch 单 block / 大 batch cooperative / 通用双 kernel)来完成 MoE 对齐与 token 排序。 tritonmoe_preprocess.cu改为调用moe_align_block_size,支持任意维度输入 numel 计算,并修正小 token 分支下的 padded size 计算方式与 blocks 计算方式。- 扩充
test_tritonmoe_preprocess.py,覆盖更多形状与分支场景;setup_ops.py增加新文件编译入口;helper.h上移CEILDIV宏。
需要注意:PR 标题中 “optmize” 建议更正为 “optimize”,以符合仓库标题规范与可检索性。
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
custom_ops/gpu_ops/moe/moe_align_kernel.cu |
新增 MoE 对齐与 token 排序 CUDA kernel(多策略分发) |
custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu |
移除旧内嵌 kernel,改为调用 moe_align_block_size 并调整 shape 推导/输出命名 |
custom_ops/gpu_ops/helper.h |
将 CEILDIV 提升为公共宏,供多处 CUDA 代码复用 |
custom_ops/setup_ops.py |
将新增的 moe_align_kernel.cu 加入编译源列表 |
tests/operators/test_tritonmoe_preprocess.py |
新增/重构正确性测试,覆盖更多输入形状与边界场景 |
| int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE; | ||
| Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids); | ||
| for (int32_t i = threadIdx.x; i < total_vecs; i += blockDim.x) { | ||
| out_ptr[i] = fill_vec; | ||
| } |
| int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE; | ||
| Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids); | ||
| for (int32_t i = bid * nthreads + tid; i < total_vecs; | ||
| i += nblocks * nthreads) { | ||
| out_ptr[i] = fill_vec; | ||
| } |
| // Original 2-kernel approach (for medium inputs or cooperative fallback) | ||
| auto align_kernel = moe_align_block_size_kernel<scalar_t>; | ||
|
|
||
| const size_t scan_size = next_pow_2(num_experts); | ||
| const size_t shared_mem_size = | ||
| (num_experts + (num_experts + 1) + scan_size + WARP_SIZE) * | ||
| sizeof(int32_t); | ||
| align_kernel<<<2, threads, shared_mem_size, stream>>>( | ||
| topk_ids.data<scalar_t>(), | ||
| sorted_token_ids.data<int32_t>(), | ||
| experts_ids.data<int32_t>(), | ||
| num_tokens_post_pad.data<int32_t>(), | ||
| num_experts, | ||
| block_size, | ||
| numel, | ||
| cumsum_buffer.data<int32_t>(), | ||
| pad_sorted_token_ids, | ||
| scan_size, | ||
| max_num_tokens_padded); |
| bool small_batch_expert_mode = (numel < 1024) && (num_experts <= 64); | ||
|
|
||
| if (small_batch_expert_mode) { | ||
| const int32_t expert_threads = max((int32_t)num_experts, WARP_SIZE); | ||
| constexpr int32_t fill_threads = 256; | ||
| const int32_t shared_mem_size = | ||
| ((expert_threads + 1) * num_experts + (num_experts + 1)) * | ||
| sizeof(int32_t); | ||
|
|
| def setUp(self): | ||
| if not _AVAILABLE: | ||
| self.skipTest("CUDA or fastdeploy not available") | ||
|
|
||
| def test_docstring_example(self): | ||
| """Reproduce the example from the function docstring.""" | ||
| topk_ids = paddle.to_tensor([[2, 3, 4], [1, 2, 4], [1, 3, 4], [1, 2, 3]], dtype="int64") | ||
| _verify(topk_ids, block_size=4, num_experts=5, label="docstring_example") | ||
|
|
|
|
||
| DEVICE = "gpu" | ||
|
|
||
| # 仅对小规模 case 打印详细 tensor,超过此阈值只打印统计摘要 |
| if not _AVAILABLE: | ||
| print("SKIP: CUDA or fastdeploy not available.") | ||
| else: | ||
| basic = TestTritonMoePreprocessBasic() | ||
| basic.test_docstring_example() | ||
| basic.test_single_token_single_expert() | ||
| basic.test_all_tokens_same_expert() | ||
| basic.test_uniform_1d() | ||
| basic.test_topk_equals_num_experts() | ||
| basic.test_num_tokens_less_than_num_experts() | ||
| basic.test_exact_block_boundary() | ||
| basic.test_block_size_1() | ||
|
|
||
| edge = TestTritonMoePreprocessEdgeCases() | ||
| edge.test_empty_topk_ids() | ||
| edge.test_one_expert() | ||
| edge.test_large_block_size() | ||
| edge.test_int64_dtype() | ||
|
|
||
| real = TestTritonMoePreprocessRealistic() | ||
| for num_tokens, num_experts, block_size in [ | ||
| (256, 8, 16), | ||
| (1024, 16, 16), | ||
| (4096, 64, 16), | ||
| (8192, 64, 32), | ||
| (8192, 128, 64), | ||
| (16384, 256, 128), | ||
| ]: | ||
| real._run_uniform_distribution(num_tokens, num_experts, block_size) | ||
| for num_tokens, top_k, num_experts, block_size in [ | ||
| (512, 2, 8, 16), | ||
| (1024, 4, 16, 16), | ||
| (2048, 8, 64, 16), | ||
| ]: | ||
| real._run_topk_2d(num_tokens, top_k, num_experts, block_size) | ||
| for alpha in [0.5, 1.2, 2.0]: | ||
| real._run_zipf_distribution(alpha) | ||
| real.test_deterministic_with_fixed_seed() | ||
|
|
||
| print("\n*** All direct-run tests passed ***") |
| DEVICE = "gpu" | ||
|
|
||
| # 仅对小规模 case 打印详细 tensor,超过此阈值只打印统计摘要 | ||
| _PRINT_TENSOR_NUMEL_LIMIT = 64 | ||
|
|
||
|
|
||
| def _fmt_tensor(t: paddle.Tensor, name: str) -> str: | ||
| t_cpu = t.cpu() | ||
| if t_cpu.numel() <= _PRINT_TENSOR_NUMEL_LIMIT: | ||
| return f"{name}{list(t_cpu.shape)} = {t_cpu.tolist()}" | ||
| return ( | ||
| f"{name}{list(t_cpu.shape)} | " | ||
| f"min={int(t_cpu.min())} max={int(t_cpu.max())} " | ||
| f"mean={float(t_cpu.cast('float32').mean()):.2f} numel={t_cpu.numel()}" | ||
| ) | ||
|
|
||
|
|
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览⏳ CI 尚未完成:有 2 个 required 任务运行中、2 个 required 任务等待中,当前暂无 required 任务失败。
2 任务状态汇总2.1 Required任务 : 6/10 通过
2.2 可选任务 — 23/26 通过
3 失败详情(仅 required)无 required 任务失败。 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7786 +/- ##
==============================================
Coverage ? 72.46%
==============================================
Files ? 381
Lines ? 54139
Branches ? 8456
==============================================
Hits ? 39230
Misses ? 12152
Partials ? 2757
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (3)
custom_ops/gpu_ops/moe/moe_align_kernel.cu:279
- cooperative kernel 同样用 int4 按 total_vecs=ceil(max_num_tokens_padded/4) 填充 sorted_token_ids;当 max_num_tokens_padded 不是 4 的倍数时,最后一次写入会越界到 buffer 末尾之外。建议改为对齐长度或单独处理最后不足 VEC_SIZE 的 tail,避免 OOB 写。
if (pad_sorted_token_ids) {
Vec fill_vec;
fill_vec.x = fill_vec.y = fill_vec.z = fill_vec.w =
static_cast<int32_t>(numel);
int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
for (int32_t i = bid * nthreads + tid; i < total_vecs;
i += nblocks * nthreads) {
out_ptr[i] = fill_vec;
}
tests/operators/test_tritonmoe_preprocess.py:323
- 同上:EdgeCases 这组用例 setUp 未设置 paddle.set_device("gpu"),但多数 case(除 empty_topk_ids 外)创建的 tensor 默认会在 CPU 上,可能导致 GPU op 调用失败或测试顺序相关。建议在 setUp 中显式 set_device("gpu") 或在 _verify 内做统一迁移。
def setUp(self):
if not _AVAILABLE:
self.skipTest("CUDA or fastdeploy not available")
tests/operators/test_tritonmoe_preprocess.py:356
- 同上:Realistic 这组用例 setUp 未设置 paddle.set_device("gpu"),会使 paddle.randint/to_tensor 等默认产出 CPU tensor,从而导致 tritonmoe_preprocess 调用失败或用例执行顺序依赖。建议在 setUp 中显式 set_device("gpu"),并可顺带加上 device_count 检查以避免无 GPU 的 CUDA build 环境报错。
def setUp(self):
if not _AVAILABLE:
self.skipTest("CUDA or fastdeploy not available")
| int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE; | ||
| Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids); | ||
| for (int32_t i = threadIdx.x; i < total_vecs; i += blockDim.x) { | ||
| out_ptr[i] = fill_vec; | ||
| } |
| const size_t tid = threadIdx.x; | ||
| const size_t stride = blockDim.x; | ||
|
|
||
| if (tid < num_experts) { | ||
| shared_counts[tid] = 0; | ||
| } |
| if (small_batch_expert_mode) { | ||
| const int32_t expert_threads = max((int32_t)num_experts, WARP_SIZE); | ||
| constexpr int32_t fill_threads = 256; | ||
| const int32_t shared_mem_size = |
| if (err == cudaSuccess) { | ||
| return; | ||
| } | ||
| // Fall through to original path if cooperative launch failed |
|
|
||
| def setUp(self): | ||
| if not _AVAILABLE: | ||
| self.skipTest("CUDA or fastdeploy not available") |
| print(f"\n{tag}ALL CHECKS PASSED") | ||
| print(sep) |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-14 18:11:12
📋 Review 摘要
PR 概述:将 tritonmoe_preprocess 的对齐 kernel 重构为独立的 moe_align_kernel.cu,引入三路分发策略(small-batch 单 block / cooperative / 双 kernel),去除 num_experts 硬编码白名单,支持任意 expert 数量。
变更范围:custom_ops/gpu_ops/moe/、custom_ops/gpu_ops/helper.h、custom_ops/setup_ops.py、tests/operators/
影响面 Tag:[OP] [Optimization]
📝 PR 规范检查
标题格式不符合 Cherry-Pick 规范:[#7687] 应写在描述末尾作为 (#7687),而非作为独立括号标签插在中间。
标题建议(可直接复制):
[Cherry-Pick][Optimization] optimize tritonmoe_preprocess op(#7687)
描述结构(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist)完整,已合规,无需修改。
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| ❓ 疑问 | custom_ops/gpu_ops/helper.h:76 |
CEILDIV 宏参数未加括号,且 a+b-1 在 a 接近 INT_MAX 时存在溢出风险 |
| ❓ 疑问 | custom_ops/gpu_ops/moe/moe_align_kernel.cu:454 |
static 变量缓存 SM 数量,在同进程多设备场景下可能缓存错误设备属性 |
总体评价
重构思路清晰,三路分发策略对大 batch 性能提升显著(benchmark 数据翔实),新增测试覆盖了边界和各分支场景。有两处实现细节(宏定义安全性、static 缓存多设备风险)建议作者确认,不阻塞合并。
| using json = nlohmann::json; | ||
| #endif | ||
|
|
||
| #define CEILDIV(a, b) (((a + b - 1) / b)) |
There was a problem hiding this comment.
❓ 疑问 CEILDIV 宏存在两个潜在问题:
- 整数溢出:
a + b - 1在a接近INT_MAX时会溢出(例如int32_t场景)。 - 宏参数未加括号:若传入表达式(如
CEILDIV(x+1, y)),a展开后无括号保护可能导致运算符优先级错误。
建议修改为:
#define CEILDIV(a, b) (((a) + (b) - 1) / (b))或使用 inline 函数以避免宏的副作用:
template <typename T>
inline T ceildiv(T a, T b) { return (a + b - 1) / b; }| template <typename scalar_t> | ||
| void moe_align_block_size(const paddle::Tensor& topk_ids, | ||
| int64_t num_experts, | ||
| int64_t block_size, |
There was a problem hiding this comment.
❓ 疑问 cached_max_blocks_per_sm 和 cached_num_sms 使用函数级 static 变量,存在两个潜在风险:
- 多设备缓存错误:如果同一进程中存在多 GPU 设备(如测试环境切换
cudaSetDevice),第一次调用会缓存某设备的 SM 数量,后续在不同设备上调用将使用错误的 SM 数,导致coop_blocks超出目标设备上限,cooperative launch 报错。 - 线程安全:多线程并发首次调用时存在 double-init 竞态(虽然写入的值相同,但
cudaGetDevice可能返回不同设备 ID)。
建议按 device_id 做 per-device 缓存,或在调用前直接查询(overhead 极小):
int device_id;
cudaGetDevice(&device_id);
int max_blocks_per_sm, num_sms;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks_per_sm, ...);
cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, device_id);
Motivation
原
tritonmoe_preprocess内嵌 kernel 硬编码了 num_experts 白名单(2/8/32/64/128/160/256),不支持任意 expert 数量,且排序与对齐分两步完成效率偏低。本次重构将对齐逻辑提取到独立moe_align_kernel.cu,通过三路分发策略提升适用范围和性能。测试平台: NVIDIA A100-SXM4-80GB,Driver Version: 535.230.02 CUDA Version: 12.9
Modifications
custom_ops/gpu_ops/moe/moe_align_kernel.cu:实现三路分发的moe_align_block_size函数模板及对应 CUDA kernels(小批量单 block kernel、大批量 cooperative kernel、通用双 kernel)custom_ops/gpu_ops/helper.h:将CEILDIV宏工具函数上移至公共头文件custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu:删除旧内嵌 kernel,改为声明并调用moe_align_block_size;支持任意 num_experts;修复小 token 分支的 max_num_tokens_padded 计算;输出名从expert_ids改为experts_idscustom_ops/setup_ops.py:将新文件加入两处编译源列表tests/operators/test_tritonmoe_preprocess.py:新增覆盖各分支场景(空输入、单 token、均匀分布、大批量等)的正确性测试Usage or Command
N/A
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.