Skip to content

Conversation

am17an
Copy link
Collaborator

@am17an am17an commented Oct 18, 2025

While looking at this kernel I realized that it is relatively easy to add it for gpt-oss, which does the softmax after the top-k.

Performance on a 4090:

Model Test t/s master t/s cuda_gpt_oss_opt Speedup
gpt-oss 20B MXFP4 MoE tg32 170.99 177.68 1.04
gpt-oss 20B MXFP4 MoE tg64 168.75 175.36 1.04
gpt-oss 20B MXFP4 MoE tg128 167.01 173.33 1.04

@am17an am17an requested a review from slaren as a code owner October 18, 2025 11:24
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 18, 2025
jeffbolznv added a commit to jeffbolznv/llama.cpp that referenced this pull request Oct 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant