[Kernel] adopt mxfp8 grouped_gemm and grouped_quant kernel#34381
[Kernel] adopt mxfp8 grouped_gemm and grouped_quant kernel#34381EdalatiAli wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request integrates SGLang's SM100+ MXFP8 blockscaled grouped kernels into vLLM. The changes include adding new CUDA source files for the kernels, updating CMakeLists.txt to handle their compilation, and adding corresponding Python wrappers and tests. The implementation appears solid and well-integrated. I have one suggestion to improve the clarity of an error message in one of the new CUDA files to prevent potential confusion during debugging.
Signed-off-by: EdalatiAli <[email protected]>
Purpose
In order to enable serving mxfp8 MoE models, this PR Integrates SGLang’s SM100+ expert-specialization MXFP8 blockscaled grouped kernels into vLLM so they are built, registered, importable, and test-covered in the vLLM codebase.
Here is source PR for the adopted kernels.
This PR:
es_sm100_mxfp8_blockscaled_grouped_mmandes_sm100_mxfp8_blockscaled_grouped_quantkernel sources into vLLM’s_Cbuild path (CUDA-gated for SM100-compatible targets).csrc/torch_bindings.cppand wires CUDA impls already present in copied kernel files.vllm/_custom_ops.py, including guardedregister_fakehooks for fake/meta tracing compatibility.tests/kernels/moe/test_es_mxfp8_blockscaled_moe.pysgl_kernelimport tovllm._custom_opsded068a76).Test Plan
python3 -m py_compile vllm/_custom_ops.pypython3 -m py_compile tests/kernels/moe/test_es_mxfp8_blockscaled_moe.pypytest -q tests/kernels/moe/test_es_mxfp8_blockscaled_moe.pyTest Result
python3 -m py_compile vllm/_custom_ops.pypassed.python3 -m py_compile tests/kernels/moe/test_es_mxfp8_blockscaled_moe.pypassed.pytest -q tests/kernels/moe/test_es_mxfp8_blockscaled_moe.pypassed.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.