Skip to content

Conversation

@yyihuang
Copy link
Collaborator

@yyihuang yyihuang commented Aug 24, 2025

📌 Description

🔍 Related Issues

sgl-project/sglang#9471

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

@yyihuang yyihuang marked this pull request as draft August 24, 2025 22:34
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yyihuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the capability to return the Log-Sum-Exp (LSE) value from the trtllm-gen attention kernels. This enhancement provides access to intermediate attention computation results, which can be crucial for debugging, advanced analysis, or implementing specific loss functions that rely on these values. The changes span both the CUDA kernel implementations and their Python bindings, ensuring that LSE can be optionally returned from both prefill and decode operations.

Highlights

  • CUDA Kernel Update: The core trtllm_paged_attention_launcher CUDA kernel now accepts an lse (Log-Sum-Exp) output parameter, enabling the direct return of these intermediate attention computation values.
  • Python API Enhancement: The Python API functions, specifically trtllm_batch_decode_with_kv_cache, trtllm_batch_context_with_kv_cache, and their MLA (Multi-Layer Attention) counterparts, have been enhanced to optionally return the lse tensor.
  • Automatic Tensor Allocation: Automatic allocation of the lse tensor is now supported when the return_lse flag is enabled and no lse tensor is explicitly provided, simplifying usage.
  • Comprehensive Test Coverage: Unit tests have been updated across various attention functions to include verification of the returned lse values against established reference implementations, ensuring correctness and numerical stability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

yyihuang added a commit that referenced this pull request Aug 26, 2025
…n attention (#1584)

<!-- .github/pull_request_template.md -->

## 📌 Description

**workspace_buffer arrangement**
- on main branch
paged_attention: multiCtasKvCounter or semaphores | multiCtasKvScratch
ragged_attention: softmax | multiCtasKvCounter or semaphores |
multiCtasKvScratch

- on PR branch:
softmax (optional) | multiCtasKvScratch | multiCtasKvCounter or
semaphores (last 8MB of 128 MB)

The range of semaphores must be fixed across multiple execution, since
we are not clearing the buffer by zeros explicitly any more.

related PR: #1463

And #1566 (WIP) depends
on this.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
@zhyncs
Copy link
Member

zhyncs commented Aug 29, 2025

@yzh119 @yyihuang what's the progress about this pr

@zhyncs
Copy link
Member

zhyncs commented Aug 29, 2025

Can we merge this before v0.3.0 @yzh119

@yzh119
Copy link
Collaborator

yzh119 commented Sep 5, 2025

@yyihuang fyi, c5822e7 should make bf16 mla tests passed (but unfortunately, fp8 UT will fail).

@yzh119
Copy link
Collaborator

yzh119 commented Sep 5, 2025

91e3b83 this commit should fix fp8.

The fundamental reason is because trtllm kernels use internal scale for rowsum/rowmax which might not align with provided bmm scale. For fp8, the reason we need to - log2(448) is that the logits is multipled with 448 (maximum value of fp8 e4m3) to improve numerical stability.

@pavanimajety
Copy link
Contributor

@yyihuang Is there a plan to revive this PR?? @yzh119

@yyihuang
Copy link
Collaborator Author

@yyihuang Is there a plan to revive this PR?? @yzh119

We can move it to #2332.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants