Skip to content

feat: masked layout fp4 gemm using cute-dsl #1331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Aug 13, 2025

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Jul 25, 2025

📌 Description

Implement fp4 gemm (w/ masked layout) requested in sgl-project/sglang#7994
Adapted from cutlass's dense_blockscaled_gemm_persistent example, with DeepGEMM style tile-scheduler

🔍 Related Issues

sgl-project/sglang#7994

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

cc @fzyzcjy
Co-authored-by: Avery Huang [email protected]

Copy link
Contributor

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

@yyihuang yyihuang self-assigned this Jul 25, 2025
@yzh119 yzh119 marked this pull request as ready for review August 7, 2025 08:47
@yyihuang yyihuang marked this pull request as draft August 11, 2025 00:02
@yyihuang yyihuang marked this pull request as ready for review August 13, 2025 03:35
@yyihuang yyihuang added the ready label Aug 13, 2025
@yzh119 yzh119 changed the title [WIP]: Masked layout fp4 gemm using cute-dsl feat: masked layout fp4 gemm using cute-dsl Aug 13, 2025
@yzh119 yzh119 enabled auto-merge (squash) August 13, 2025 04:08
@yzh119
Copy link
Collaborator Author

yzh119 commented Aug 13, 2025

There are still a lot of work to be done:

  1. AOT compile.
  2. API design.

Left them for future PRs, let's unblock users and test functionality first.

return self._num_tiles_executed


"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe to delete this docstring?

a_tensor = cute.make_tensor(
a_ptr,
layout=cute.make_ordered_layout(
(self._m, self._k, self._l),
Copy link

@fengxie fengxie Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a non-blocking comments.

This assumes static shape if it's passed by members. Just double check it's what we are expecting here? To support dynamic shape, m/k/l must be passed via run_cute_ptr's argument list as Int32 type I believe.

Copy link
Collaborator Author

@yzh119 yzh119 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently it's okay to assume we have static shapes, the number of groups should depend on the TP/EP size and N/K are fixed, we can compile one for each cudagraph configuration. For M we can just set a maximum possible value and the kernel execution time will only depend on the value of mask_m tensor, not M.

cc @kaixih for confirmation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. I think it's also one of the advantage of using jit here. You can also selectively choose static shape which usually end-up with better SASS.

@yzh119 yzh119 merged commit 1e62f1a into flashinfer-ai:main Aug 13, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants