feat(gdn): add padding index guard for bf16 decode kernel by kaixih · Pull Request #2810 · flashinfer-ai/flashinfer

kaixih · 2026-03-17T21:29:02Z

Clamp negative slot indices to 0 before passing to the bf16 fast-path kernel to prevent out-of-bounds memory access when padding indices (-1) are present in initial_state_indices.

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- Prevented negative batch indices during decoding to avoid invalid indexing, improving stability and correctness of initial-state handling in batch inference.
Tests
- Added new bf16 padding-indices test to validate handling of mixed padding and valid indices and related state updates.
- Marked numerous tests as temporarily skipped due to CI failures to gate unstable coverage.

Clamp negative slot indices to 0 before passing to the bf16 fast-path kernel to prevent out-of-bounds memory access when padding indices (-1) are present in initial_state_indices. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-03-17T21:29:15Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial safeguard within the BF16 decode kernel to enhance memory safety. By ensuring that all slot indices are non-negative before processing, it effectively mitigates the risk of out-of-bounds memory access that could arise from padding indices.

Highlights

Padding Index Guard: Implemented a clamp operation on h_slot_indices to ensure all values are non-negative, specifically addressing -1 padding indices. This prevents potential out-of-bounds memory access in the BF16 decode kernel.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

flashinfer/gdn_kernels/gdn_decode_bf16_state.py
- Added a clamp(min=0) operation to h_slot_indices to guard against negative padding indices.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-17T21:29:25Z

📝 Walkthrough

Walkthrough

Adds clamping to ensure negative pool_batch_idx/h_slot_indices are set to 0 in GDN BF16 decode kernels and introduces a new bf16 padding-indices test plus several pytest.skip annotations across tests; no public APIs or function signatures were changed.

Changes

Cohort / File(s)	Summary
Kernel Guard `flashinfer/gdn_kernels/gdn_decode_bf16_state.py`	Clamp `pool_batch_idx` / `h_slot_indices` to a minimum of 0 in three decode kernels (seqlen1, seqlen234_unified, seqlen1_lowBS_1chunk) to prevent negative indexing into H slots.
Tests — new & skips `tests/gdn/test_decode_delta_rule.py`	Add `_test_decode_kernel_bf16_padding_indices` and `test_decode_kernel_bf16_padding_indices` (parametrized); mark numerous existing and some new tests with `pytest.mark.skip(reason="Temporarily skipped due to CI failures.")`.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat: add pool+indices support to gated_delta_rule_decode_pretranspose (bf16 path) #2619: Modifies the same bf16 decode kernels and touches pool-based per-batch index handling (pool_batch_idx / gH_slot_indices).
Feat/gdn decode pooled #2521: Related changes around pooled/indirect state indexing and handling of negative or padding indices in GDN decode kernels.

Suggested labels

run-ci

Suggested reviewers

bkryu
kahyunnam

Poem

🐰 I hop where bytes and buffers play,
I clamp the gaps that stray away.
No negatives slip through my door,
H slots safe, I guard the floor. 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description is somewhat incomplete. The author provided an opening statement and included the PR template, but did not fill in critical required sections like 'Related Issues' and 'Reviewer Notes'. While the checklist items are marked complete, the main descriptive content is minimal.	Expand the 'Description' section with more detail about the problem and solution. Link any related issues in the 'Related Issues' section. Provide specific reviewer notes if there are areas needing particular attention or context about the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a guard for padding indices in the bf16 decode kernel.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

kaixih · 2026-03-17T21:30:44Z

This resolves the issue: sgl-project/sglang#20791

@zihaoye PTAL

gemini-code-assist

Code Review

This pull request adds a guard to prevent out-of-bounds memory access in the bf16 decode kernel when padding indices are present. The change clamps negative indices to 0, which successfully prevents crashes.

However, this approach leads to incorrect, non-zero output for padded items, as they are processed using state from index 0. I've left a comment with a suggestion to zero out the output for these padded items to ensure correctness and consistency with other kernels in the repository.

gemini-code-assist · 2026-03-17T21:30:58Z

flashinfer/gdn_kernels/gdn_decode_bf16_state.py

    else:
        h_slot_indices = initial_state_indices

+    h_slot_indices = h_slot_indices.clamp(min=0)  # guard -1 padding


This correctly prevents out-of-bounds memory access for padded indices. However, it causes padded items (with index -1) to be processed using the state of item 0, resulting in non-zero garbage output for these items. Other kernels in this repository handle padding by skipping computation and zeroing out the output for padded items.

To ensure consistent and correct behavior, the output for padded items should be zeroed out. This can be done after the kernel launch, before returning the output tensor.

For example, you could add the following after the kernel call:

if initial_state_indices is not None: padding_mask = initial_state_indices < 0 if padding_mask.any(): output[padding_mask] = 0

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 1997-1998: The current clamp of h_slot_indices to 0 corrupts real
slot 0; instead compute a validity mask (e.g., valid = (h_slot_indices >= 0) &
(h_slot_indices < pool_size)) and use that mask to only read/update gH/state for
valid rows while leaving padded rows as no-ops/zeros in output; remove the clamp
and replace any direct indexing/gather/scatter that uses h_slot_indices
(references: h_slot_indices, output, pool_size, gH read/write paths) with masked
operations or conditional gathers/scatters so pad_slot_id == -1 rows are ignored
and indices >= pool_size are treated invalid.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fc4105fc-1168-4d77-b503-499e585773a3

📥 Commits

Reviewing files that changed from the base of the PR and between abf080a and b19cfec.

📒 Files selected for processing (1)

flashinfer/gdn_kernels/gdn_decode_bf16_state.py

coderabbitai · 2026-03-17T21:34:29Z

flashinfer/gdn_kernels/gdn_decode_bf16_state.py

+    h_slot_indices = h_slot_indices.clamp(min=0)  # guard -1 padding
    output = torch.empty(B, T, HV, V, device=q.device, dtype=q.dtype)


⚠️ Potential issue | 🔴 Critical

Clamping padding -1 to 0 silently corrupts real state slot 0.

At Line 1997, padding rows are remapped onto a valid slot, so padded entries now participate in gH read/write and can mutate slot 0 state. This fixes OOB but breaks semantics (padding should be no-op). Also, indices >= pool_size remain unchecked.

Please switch to a validity-mask path (process only valid rows, leave padded rows zero/no-op) instead of aliasing pads to slot 0.

Based on learnings: In tests/mamba/selective_state_update_triton.py, pad_slot_id is always negative (-1), i.e., a padding sentinel rather than a real slot.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 1997 - 1998, The current clamp of h_slot_indices to 0 corrupts real slot 0; instead compute a validity mask (e.g., valid = (h_slot_indices >= 0) & (h_slot_indices < pool_size)) and use that mask to only read/update gH/state for valid rows while leaving padded rows as no-ops/zeros in output; remove the clamp and replace any direct indexing/gather/scatter that uses h_slot_indices (references: h_slot_indices, output, pool_size, gH read/write paths) with masked operations or conditional gathers/scatters so pad_slot_id == -1 rows are ignored and indices >= pool_size are treated invalid.

yzh119 · 2026-03-17T23:09:20Z

flashinfer/gdn_kernels/gdn_decode_bf16_state.py

    else:
        h_slot_indices = initial_state_indices

+    h_slot_indices = h_slot_indices.clamp(min=0)  # guard -1 padding


Can we use inplace update: https://docs.pytorch.org/docs/stable/generated/torch.Tensor.clamp_.html

I feel the out-of-place clamp (current state) is safer. After the call, if the user inspects their initial_state_indices (aliased as h_slot_indices here) tensor, the padding markers are gone, ie. silently corrupted. This would break any logic that relies on those -1s after the decode step (e.g. a loop that checks which slots are padding).

Ideally, we should have change the inside of the kernel to avoid loading when idx < 0. I don't have time to test that now.

it is actually easier to change. I modified the kernel and let it to use 0 if the index is negative. PTAL.

…ull buffer Replace the Python-level clamp(min=0) guard with an in-kernel check: if pool_batch_idx < 0, redirect to slot 0 which is a reserved null buffer (zero-initialized, never allocated to real requests). This means: - State reads from slot 0 return zeros (correct fresh initial state) - State writes to slot 0 are harmlessly discarded - No per-call tensor allocation at the Python level Applied to all 3 kernel variants: seqlen1, seqlen234_unified, seqlen1_lowBS_1chunk. Fixes sgl-project/sglang#20791 (accuracy degradation from OOB access on negative padding indices). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Verifies that the bf16 fast-path kernel handles negative (padding) indices correctly via the slot-0 null buffer pattern: - Valid slots produce correct output and state updates (vs. direct-state ref) - Unused real slots are exactly untouched - Slot 0 (null buffer) is excluded from correctness checks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove module-level pytestmark skip (added in flashinfer-ai#2600) and replace with per-function @pytest.mark.skip on the previously-failing tests, so that the new test_decode_kernel_bf16_padding_indices runs in CI while the others remain skipped until their failures are addressed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kaixih · 2026-03-18T17:07:00Z

Redirect negative pool_batch_idx to slot 0 (null buffer) inside all 3 bf16 decode kernel variants to prevent OOB memory access on padding indices. Also adds a unittest and enables it in CI by replacing the stale module-level skip with per-test skips.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/gdn/test_decode_delta_rule.py (2)

205-205: Prefer xfail (with tracking) over broad unconditional skips.

These decorators silence a large part of the suite and can hide regressions. Consider xfail(strict=False) plus an issue link so failures stay visible in CI trends.

♻️ Suggested pattern

-@pytest.mark.skip(reason="Temporarily skipped due to CI failures.")
+@pytest.mark.xfail(
+    reason="Temporarily unstable in CI; tracked by <issue-url>",
+    strict=False,
+)

Also applies to: 371-371, 517-517, 773-773, 801-801, 1168-1168, 1209-1209, 1422-1422, 1454-1454

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/gdn/test_decode_delta_rule.py` at line 205, Replace broad
pytest.mark.skip decorators with pytest.mark.xfail(strict=False, reason="link:
<issue-or-tracker-url>") so failures remain visible in CI trends; locate the
skip usages (e.g., the decorator on the test function in
tests/gdn/test_decode_delta_rule.py and the other occurrences referenced at
lines 371, 517, 773, 801, 1168, 1209, 1422, 1454) and change each to xfail with
a brief reason that includes a link to the tracking issue or ticket.

881-884: Make padding-path coverage deterministic for batch_size=1.

Line 881 currently makes batch_size=1 coverage seed-dependent; the negative-index path may be missed for some seeds.

♻️ Suggested change

-        if batch_size >= 2:
-            mask[0] = False  # ensure at least one valid
-            mask[-1] = True  # ensure at least one padding
+        if batch_size == 1:
+            mask[0] = True  # always exercise padding path
+        else:
+            mask[0] = False  # ensure at least one valid
+            mask[-1] = True  # ensure at least one padding

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/gdn/test_decode_delta_rule.py` around lines 881 - 884, The padding-path
for batch_size==1 is seed-dependent because the block guarded by if batch_size
>= 2 skips setting mask[-1] for size 1; make the negative-index padding
deterministic by ensuring mask[-1] is set for all batch sizes while keeping
mask[0]=False only when batch_size >= 2 so larger batches still get a guaranteed
valid element; then proceed to set indices[mask] = -1 as before (refer to the
variables batch_size, mask, and indices).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/gdn/test_decode_delta_rule.py`:
- Around line 931-934: The advanced indexing uses the tensor variable used
(computed as indices[valid_mask].to(device)) to index unused_mask, but used may
be an int32 tensor causing inconsistent behavior; cast used to torch.long before
indexing (e.g., ensure used = indices[valid_mask].to(device).long()) so that
unused_mask[used] = False uses long indices; update the code around the
variables used, unused_mask, valid_mask and device to perform the .long() cast
before the advanced indexing.

---

Nitpick comments:
In `@tests/gdn/test_decode_delta_rule.py`:
- Line 205: Replace broad pytest.mark.skip decorators with
pytest.mark.xfail(strict=False, reason="link: <issue-or-tracker-url>") so
failures remain visible in CI trends; locate the skip usages (e.g., the
decorator on the test function in tests/gdn/test_decode_delta_rule.py and the
other occurrences referenced at lines 371, 517, 773, 801, 1168, 1209, 1422,
1454) and change each to xfail with a brief reason that includes a link to the
tracking issue or ticket.
- Around line 881-884: The padding-path for batch_size==1 is seed-dependent
because the block guarded by if batch_size >= 2 skips setting mask[-1] for size
1; make the negative-index padding deterministic by ensuring mask[-1] is set for
all batch sizes while keeping mask[0]=False only when batch_size >= 2 so larger
batches still get a guaranteed valid element; then proceed to set indices[mask]
= -1 as before (refer to the variables batch_size, mask, and indices).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9848b54a-1c67-4a4b-95d3-71ef4e35c3f9

📥 Commits

Reviewing files that changed from the base of the PR and between b43e512 and e756bf2.

📒 Files selected for processing (1)

tests/gdn/test_decode_delta_rule.py

coderabbitai · 2026-03-18T17:11:25Z

tests/gdn/test_decode_delta_rule.py

+    used = indices[valid_mask].to(device)
+    unused_mask = torch.ones(pool_size, dtype=torch.bool, device=device)
+    unused_mask[used] = False
+    unused_mask[0] = False  # slot 0 may be modified (null buffer), don't check it


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether int32 tensor indexing is accepted in the current torch runtime. python - <<'PY' import torch dev = "cuda" if torch.cuda.is_available() else "cpu" mask = torch.ones(8, dtype=torch.bool, device=dev) idx_i32 = torch.tensor([1, 3], dtype=torch.int32, device=dev) try: mask[idx_i32] = False print("int32_indexing_supported") except Exception as e: print("int32_indexing_not_supported") print(type(e).__name__, str(e)) PY

Repository: flashinfer-ai/flashinfer

Length of output: 185

🏁 Script executed:

# First, locate and read the test file fd test_decode_delta_rule.py -x wc -l {}

Repository: flashinfer-ai/flashinfer

Length of output: 110

🏁 Script executed:

# Read the specific lines and surrounding context from the test file fd test_decode_delta_rule.py -x head -n 950 {} | tail -n 50

Repository: flashinfer-ai/flashinfer

Length of output: 2052

🌐 Web query:

PyTorch int32 tensor indexing boolean mask support requirements

💡 Result:

PyTorch supports boolean mask indexing with tensors of dtype torch.bool. An int32 tensor (torch.int32) is not supported as a boolean mask for indexing; it must be explicitly converted to bool first (e.g., mask.bool).

Citations:

1: Static size boolean masking pytorch/pytorch#96111

2: https://docs.pytorch.org/cppdocs/notes/tensor_indexing.html

3: https://pytorch.org/cppdocs/notes/tensor_indexing.html

🌐 Web query:

PyTorch tensor indexing dtype requirements documentation

💡 Result:

PyTorch tensor indexing with tensor indices (advanced indexing) requires the index tensors to have integer dtype, specifically torch.int64 (torch.long) for reliable behavior. While some smaller integer dtypes like int32, int16, int8 may work in basic cases, they can lead to inconsistent or unexpected results, especially with uint8 which historically was treated as boolean mask. Official documentation for functions like torch.index_select specifies "index (IntTensor or LongTensor)", where IntTensor is torch.int32 and LongTensor is torch.int64, but long is the standard and recommended dtype. The tensor.getitem follows similar conventions internally. Boolean tensors are used for boolean indexing/masking, not integer indexing. For scalar index tensors, there are known issues with uint8 and sometimes smaller signed ints producing wrong shapes. Key evidence from sources: - PyTorch forums discuss why indices must be long dtype [1]. - GitHub issues show uint8/int8/etc. inconsistencies and bugs in indexing [2,3]. - torch.index_select docs: "index (IntTensor or LongTensor)" [17]. - torch.index_add_ docs: index should have dtype torch.int64 or torch.int32 [22]. - tensor_attributes lists integer dtypes, but indexing ops prefer 32/64-bit [4]. Use torch.long indices: idx = torch.tensor([1,2,3], dtype=torch.long); x[idx]

Citations:

1: https://discuss.pytorch.org/t/why-does-the-indices-tensor-have-to-be-long-dtype/139675

2: uint8 scalar tensors cannot be used for integer indexing pytorch/pytorch#70916

3: Advanced indexing with uint8 tensor versus int64 tensor is inconsistent pytorch/pytorch#20149

4: https://docs.pytorch.org/docs/stable/tensor_attributes.html

5: https://pytorch.org/docs/stable/generated/torch.index_select.html

6: https://pytorch.org/docs/stable/generated/torch.Tensor.index_add_.html

7: https://docs.pytorch.org/docs/stable/tensors.html

🏁 Script executed:

# Find where 'indices' variable is defined in the test file rg -n "indices\s*=" tests/gdn/test_decode_delta_rule.py | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 919

🏁 Script executed:

# Look for the indices creation with dtype info rg -B 5 -A 5 "indices.*int32\|torch\.int32.*indices\|dtype.*int32" tests/gdn/test_decode_delta_rule.py

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

# Find the function name containing lines 931-934 sed -n '880,935p' tests/gdn/test_decode_delta_rule.py | head -60

Repository: flashinfer-ai/flashinfer

Length of output: 1983

🏁 Script executed:

# Check the full function signature and location rg -B 20 "used = indices\[valid_mask\]\.to\(device\)" tests/gdn/test_decode_delta_rule.py | head -40

Repository: flashinfer-ai/flashinfer

Length of output: 761

Cast index tensor to long before advanced indexing.

Line 933 uses used (an int32 tensor) for advanced indexing into unused_mask. PyTorch's advanced indexing operations prefer and consistently support torch.long indices; int32 indices may work but can produce inconsistent or unexpected results across PyTorch versions. Cast to torch.long to ensure reliable behavior.

Proposed fix

- used = indices[valid_mask].to(device) + used = indices[valid_mask].to(device=device, dtype=torch.long) unused_mask = torch.ones(pool_size, dtype=torch.bool, device=device) unused_mask[used] = False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/gdn/test_decode_delta_rule.py` around lines 931 - 934, The advanced indexing uses the tensor variable used (computed as indices[valid_mask].to(device)) to index unused_mask, but used may be an int32 tensor causing inconsistent behavior; cast used to torch.long before indexing (e.g., ensure used = indices[valid_mask].to(device).long()) so that unused_mask[used] = False uses long indices; update the code around the variables used, unused_mask, valid_mask and device to perform the .long() cast before the advanced indexing.

yzh119 · 2026-03-21T08:17:24Z

/bot run

yzh119

LGTM overall.

yzh119 · 2026-03-21T08:09:49Z

tests/gdn/test_decode_delta_rule.py

    "num_q_heads, num_k_heads, num_v_heads",
    [(16, 16, 32)],
 )
+@pytest.mark.skip(reason="Temporarily skipped due to CI failures.")


Do we still need this? cc @bkryu (seems it was first introduced in #2600).

flashinfer-bot · 2026-03-21T08:18:24Z

GitLab MR !444 has been created, and the CI pipeline #46663351 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-21T13:34:13Z

[SUCCESS] Pipeline #46663351: 14/20 passed

kaixih requested a review from kahyunnam as a code owner March 17, 2026 21:29

kaixih mentioned this pull request Mar 17, 2026

[Bug] [GDN] Accuracy degradation with flashinfer gated_delta_rule_decode_pretranspose under no_buffer scheduling sgl-project/sglang#20791

Open

5 tasks

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

kaixih mentioned this pull request Mar 17, 2026

[RFC] Unified GDN Decode/Prefill API #2687

Open

40 tasks

yzh119 reviewed Mar 17, 2026

View reviewed changes

kaixih and others added 3 commits March 18, 2026 08:14

kaixih requested a review from bkryu as a code owner March 18, 2026 17:04

coderabbitai bot reviewed Mar 18, 2026

View reviewed changes

yzh119 approved these changes Mar 21, 2026

View reviewed changes

yzh119 merged commit ae9a64d into flashinfer-ai:main Mar 22, 2026
35 of 37 checks passed

		h_slot_indices = h_slot_indices.clamp(min=0) # guard -1 padding
		output = torch.empty(B, T, HV, V, device=q.device, dtype=q.dtype)

Conversation

kaixih commented Mar 17, 2026 • edited by yzh119 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

kaixih commented Mar 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

yzh119 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

kaixih Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

kaixih Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

kaixih commented Mar 18, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Mar 21, 2026

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Mar 21, 2026

Uh oh!

flashinfer-bot commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

kaixih commented Mar 17, 2026 •

edited by yzh119

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading