CUDA: topk-moe: add optional parameter for gpt-oss #16649

am17an · 2025-10-18T11:24:16Z

While looking at this kernel I realized that it is relatively easy to add it for gpt-oss, which does the softmax after the top-k.

Performance on a 4090:

Model	Test	t/s master	t/s cuda_gpt_oss_opt	Speedup
gpt-oss 20B MXFP4 MoE	tg32	170.99	177.68	1.04
gpt-oss 20B MXFP4 MoE	tg64	168.75	175.36	1.04
gpt-oss 20B MXFP4 MoE	tg128	167.01	173.33	1.04

Based on ggml-org#16649.

avidwriter · 2025-10-22T12:10:00Z

how to use this?

am17an · 2025-10-22T12:15:28Z

@avidwriter if you are using the CUDA backend, with the latest master it should already be included

Based on ggml-org#16649.

* vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <[email protected]>

am17an requested a review from slaren as a code owner October 18, 2025 11:24

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 18, 2025

am17an requested a review from JohannesGaessler October 18, 2025 11:28

jeffbolznv mentioned this pull request Oct 18, 2025

vulkan: Update topk_moe fusion to handle gpt's late softmax #16656

Merged

jeffbolznv added a commit to jeffbolznv/llama.cpp that referenced this pull request Oct 21, 2025

vulkan: Update topk_moe fusion to handle gpt's late softmax

34d4122

Based on ggml-org#16649.

am17an added 3 commits October 21, 2025 19:49

CUDA: topk-moe: add optional parameter for gpt-oss

5632159

add parameter to avoid runtime branch

2de54df

use ggml_can_fuse_subgraph

17c3927

am17an force-pushed the cuda_topk_moe_gpt_oss branch from 49a541e to 17c3927 Compare October 21, 2025 11:53

JohannesGaessler approved these changes Oct 21, 2025

View reviewed changes

am17an merged commit 03792ad into ggml-org:master Oct 21, 2025
70 checks passed

am17an deleted the cuda_topk_moe_gpt_oss branch October 21, 2025 15:21

ye-NX pushed a commit to ye-NX/llama.cpp that referenced this pull request Oct 21, 2025

CUDA: topk-moe: add optional parameter for gpt-oss (ggml-org#16649)

7869ac7

am17an mentioned this pull request Oct 22, 2025

CUDA: General GEMV fusion #16715

Merged

pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025

CUDA: topk-moe: add optional parameter for gpt-oss (ggml-org#16649)

8c01a63

jeffbolznv added a commit to jeffbolznv/llama.cpp that referenced this pull request Oct 26, 2025

vulkan: Update topk_moe fusion to handle gpt's late softmax

6cccaef

Based on ggml-org#16649.

am17an mentioned this pull request Oct 28, 2025

CUDA Performance Regression on Jetson AGX Orin #16815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: topk-moe: add optional parameter for gpt-oss #16649

CUDA: topk-moe: add optional parameter for gpt-oss #16649

Uh oh!

am17an commented Oct 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

avidwriter commented Oct 22, 2025

Uh oh!

am17an commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CUDA: topk-moe: add optional parameter for gpt-oss #16649

CUDA: topk-moe: add optional parameter for gpt-oss #16649

Uh oh!

Conversation

am17an commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

avidwriter commented Oct 22, 2025

Uh oh!

am17an commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented Oct 18, 2025 •

edited

Loading