☑️ Fix problems solvable by a constant fill or eliminating a redundant operation #108

EssamWisam · 2025-12-20T21:20:33Z

I have identified that a number of KernelBench problems:

Can be solved by a kernel returning a constant output (constant fill)
Can be solved by a kernel eliminating one of the operations completely

This results in astronomical speedups for some of these problems which does not reflect the agent's ability to perform genuine optimizations and rather reflects their ability to exploit flaws in the given program. Correspondingly, one agent performing genuine optimizations while remaining logically equivalent to the program code would likely underperform an agent that rather focuses on exploiting these flaws. Not to mention that whether the agent exploits the flaw could depend on "luck".

In proposing fixes to each of the problems I tried to look for the most minimal change that would fix the problem. All the fixes to the redundant operations flaw are non-breaking (correctness-wise).

The PR includes:

Fix to each of the problems I identified, in one of our experiments, as flawed
Old problem code (temporarily for testing)
Temporary tests that prove the constant fill or redundant operation flaw exists for each of the problems fixed and that the fixes indeed work
Changelog; see: https://github.com/EssamWisam/KernelBench/blob/5763a8a6037d43d12b99e7e6df4125d086a17f86/KernelBench/changelog/constant_fill_fixes.txt
https://github.com/EssamWisam/KernelBench/blob/5763a8a6037d43d12b99e7e6df4125d086a17f86/KernelBench/changelog/redundant_op_fixes.txt

The changelog includes a list of todos which are problem renames. Delayed applying these until we approve the fixes.

…moid_BiasAdd

…put fix)

…ltiply_GlobalAvgPool_GlobalAvgPool_Mean

…nt output fix)

…h_Clamp

…pout (constant output fix)

…dtanh

…State

…ixes

PaliC

Thanks for all the work!!! (especially the changelog / tests it made reviewing things much easier). You also found a bunch of bugs in level 3 thanks :) Generally things look good!

Generally, points of feedback. Level2/23... is better fixed by adjusting the mean, but this is more of a nit.

For level 3, these are supposed to represent actual models. Let's try to be faithful to models we are trying to test :) I mentioned some fixes.

Two meta points for discussion (@simonguozirui / whoever has an opinion also chime in).

For level 2, the point is to evaluate against compositions of pytorch operations. Therefore, I actually think it's better to augment the problem set with the problems you modified. imo if a model figures out "oh hey this is redundant let's get rid of this" that's a really useful insight!!! I'd personally just tag these types of problems so they could be filtered out if needed. imo it's correct to do another release which fixes the problems.
For the tests you added. imo doing the constant check as a periodic sweep on the problems is a good idea (in practice you write the test and just have the ci run on it when you change problems).

PaliC · 2026-01-03T21:22:46Z

KernelBench/changelog/constant_fill_fixes.txt

+
+   Fix: Replaced mean with amax (global max pooling).
+        - x = x.mean(dim=[1, 2, 3, 4])
+        + x = x.amax(dim=[1, 2, 3, 4])


imo a better fix here is to just do x.mean(dim=[2, 3, 4]) instead of x.mean(dim=[1, 2, 3, 4]) as it gets around the normalization issue but doesn't change the ops of the problem. It changes the output shape, but that should be fine.

https://github.com/ScalingIntelligence/KernelBench/blob/main/KernelBench/level2/27_Conv3d_HardSwish_GroupNorm_Mean.py does this

Thanks for the suggestion, change applied.

PaliC · 2026-01-03T21:34:09Z

KernelBench/changelog/redundant_op_fixes.txt

+
+--------------------------------------------------------------------------------
+
+5. level3/36_LSTMHn.py


This smells like a bug in the original code, the fix should just be to return out

@PaliC responding to this and the comments below, the idea when fixing these problems is to remain backward compatible. The merit of that is that all evaluations where LLMs exploited the redundancy (eg, published research papers), will remain legit after the fix (changing the output makes all these problems harder so comparing evaluations across versions becomes even more tricky).

That said, I also agree it's more sensible to return the actual model's output. One more maintainer vote would be great @simonguozirui

I see what you’re saying. However, the changes we’re making for constant outputs aren’t backwards compatible. Similarly, the last version bump of KernelBench did invalidate other LLM solutions (as shapes and distributions changed). If it’s in the spirit of a more useful benchmark, I think it’s correct to break backwards compatibility here (as we’ve done before) with the next version of KernelBench.

In this case we're fixing what looks like a mistake in the initial release and shipping something that's more akin to the tasks we want llms to accomplish. Part of the utility of an eval is its practicality. For KernelBench that's in levels 1 and 3, therefore, we should aim to make those problems useful.

Regardless, @simonguozirui chip in. I'll respect whatever the decision ends up being.

I do side with your view even though I remember now among the reasons I did this was that the last KB release has indeed focused on minimizing breaking changes as was noted on the blog post.

Yes, I think ensuring practically of the benchmark is more meaningful. I will hope future papers remember to include the version.

PaliC · 2026-01-03T21:38:02Z

KernelBench/changelog/redundant_op_fixes.txt

+
+--------------------------------------------------------------------------------
+
+6. level3/37_LSTMCn.py


Same as above the fix should just be to return out

PaliC · 2026-01-03T21:39:07Z

KernelBench/level3/37_LSTMCn.py

-        out = self.fc(out[:, -1, :])  # out: tensor of shape (batch_size, output_size)
-
+        _, state = self.lstm(x, (h0, c0))
        return state[1]


just return out instead as that is more faithful as to what an lstm is supposed to do

PaliC · 2026-01-03T21:39:12Z

KernelBench/level3/36_LSTMHn.py

-        out = self.fc(out[:, -1, :])  # out: tensor of shape (batch_size, output_size)
-
+        _, state = self.lstm(x, (h0, c0))
        return state[0]


just return out instead as that is more faithful as to what an lstm is supposed to do

PaliC · 2026-01-03T21:49:04Z

KernelBench/changelog/redundant_op_fixes.txt

+
+--------------------------------------------------------------------------------
+
+7. level3/49_Mamba2ReturnFinalState.py


This is another case of the model has a bug https://github.com/state-spaces/mamba/blob/620cd9816997730a652b7c21d1b59c802e35add0/mamba_ssm/modules/ssd_minimal.py#L34 (@simonguozirui lmk if this is correct)

I'd implement lines 71-78 of the snippet.

I forget if kernel bench supports evaluating tuples but if it doesn't I'd just flatten and concat the output

EssamWisam · 2026-01-04T23:57:37Z

Thank you @PaliC for the review. As for what you said on integrating the constant check to ci, that could be future work; there doesn't seem to be a ci testing system in place and it's nontrivial (though possible) to run the ci actions on gpu.

PaliC · 2026-01-05T00:51:57Z

@EssamWisam Yeah it's was more for discussion, and imo it's worth making an issue for someone to pick up. Also a gpu shouldn't be needed to check torch numerics (distributions should be the same to the extent we care).

EssamWisam · 2026-01-05T01:00:20Z

Yes, I meant it can take way too long on cpu. At least, on my Macbook, that was the case.

The constant check could be also generalized to a variance check; problem tolerance can then also depend on the variance of the kernels outputs over the input distribution.

PaliC · 2026-01-05T02:50:26Z

Oh interesting do you remember what the bottlenecks were? nw if not.

For a rarely running CI it's likely fine (if it's bad on the linux machines github gives us we can just run the touched files).

EssamWisam · 2026-01-05T03:53:12Z

Oh interesting do you remember what the bottlenecks were? nw if not.

For a rarely running CI it's likely fine (if it's bad on the linux machines github gives us we can just run the touched files).

It would get stuck on some problems for too long or crash. I didn't debug why specifically but I presumed just model and input size for some problems eating up ram (eg, 16gb of cpu ram compared to H100's 80gb vram).

That said, detecting the problems in the original post was based on a reward hack detection mechanism that is reasonably different.

EssamWisam added 12 commits December 20, 2025 16:01

fix: remove redundant LeakyReLU from 7_Conv3d_ReLU_LeakyReLU_GELU_Sig…

d6b7f98

…moid_BiasAdd

fix: replace mean with amax in 23_Conv3d_GroupNorm_Mean (constant out…

0983e29

…put fix)

fix: remove redundant second GlobalAvgPool from 44_ConvTranspose2d_Mu…

de98a0c

…ltiply_GlobalAvgPool_GlobalAvgPool_Mean

fix: change mean dim from 1 to 0 in 80_Gemm_Max_Subtract_GELU (consta…

3889399

…nt output fix)

fix: remove redundant final clamp from 81_Gemm_Swish_Divide_Clamp_Tan…

ddf8e2f

…h_Clamp

fix: use max_value=0.5 in min op in 83_Conv3d_GroupNorm_Min_Clamp_Dro…

f9cd04e

…pout (constant output fix)

fix: remove redundant Hardtanh from 95_Matmul_Add_Swish_Tanh_GELU_Har…

e1071e0

…dtanh

fix: remove unused fc layer from 36_LSTMHn

5a9226c

fix: remove unused fc layer from 37_LSTMCn

26d1a57

fix: remove unused L and Y_diag computation from 49_Mamba2ReturnFinal…

59fe881

…State

docs: add changelogs for constant fill and redundant operation fixes

1d0059f

test: add temporary tests for constant fill and redundant operation f…

5763a8a

…ixes

PaliC requested changes Jan 3, 2026

View reviewed changes


		--------------------------------------------------------------------------------

		5. level3/36_LSTMHn.py


		--------------------------------------------------------------------------------

		6. level3/37_LSTMCn.py


		--------------------------------------------------------------------------------

		7. level3/49_Mamba2ReturnFinalState.py

☑️ Fix problems solvable by a constant fill or eliminating a redundant operation #108

Are you sure you want to change the base?

☑️ Fix problems solvable by a constant fill or eliminating a redundant operation #108

Uh oh!

Conversation

EssamWisam commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaliC left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EssamWisam Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EssamWisam commented Jan 4, 2026

Uh oh!

PaliC commented Jan 5, 2026

Uh oh!

EssamWisam commented Jan 5, 2026

Uh oh!

PaliC commented Jan 5, 2026

Uh oh!

EssamWisam commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EssamWisam commented Dec 20, 2025 •

edited

Loading

PaliC left a comment •

edited

Loading

EssamWisam Jan 4, 2026 •

edited

Loading