Skip to content

Conversation

@EssamWisam
Copy link

@EssamWisam EssamWisam commented Dec 20, 2025

I have identified that a number of KernelBench problems:

  • Can be solved by a kernel returning a constant output (constant fill)
  • Can be solved by a kernel eliminating one of the operations completely

This results in astronomical speedups for some of these problems which does not reflect the agent's ability to perform genuine optimizations and rather reflects their ability to exploit flaws in the given program. Correspondingly, one agent performing genuine optimizations while remaining logically equivalent to the program code would likely underperform an agent that rather focuses on exploiting these flaws. Not to mention that whether the agent exploits the flaw could depend on "luck".

In proposing fixes to each of the problems I tried to look for the most minimal change that would fix the problem. All the fixes to the redundant operations flaw are non-breaking (correctness-wise).

The PR includes:

The changelog includes a list of todos which are problem renames. Delayed applying these until we approve the fixes.

Copy link
Collaborator

@PaliC PaliC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work!!! (especially the changelog / tests it made reviewing things much easier). You also found a bunch of bugs in level 3 thanks :) Generally things look good!

Generally, points of feedback. Level2/23... is better fixed by adjusting the mean, but this is more of a nit.

For level 3, these are supposed to represent actual models. Let's try to be faithful to models we are trying to test :) I mentioned some fixes.

Two meta points for discussion (@simonguozirui / whoever has an opinion also chime in).

  • For level 2, the point is to evaluate against compositions of pytorch operations. Therefore, I actually think it's better to augment the problem set with the problems you modified. imo if a model figures out "oh hey this is redundant let's get rid of this" that's a really useful insight!!! I'd personally just tag these types of problems so they could be filtered out if needed. imo it's correct to do another release which fixes the problems.

  • For the tests you added. imo doing the constant check as a periodic sweep on the problems is a good idea (in practice you write the test and just have the ci run on it when you change problems).


Fix: Replaced mean with amax (global max pooling).
- x = x.mean(dim=[1, 2, 3, 4])
+ x = x.amax(dim=[1, 2, 3, 4])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo a better fix here is to just do x.mean(dim=[2, 3, 4]) instead of x.mean(dim=[1, 2, 3, 4]) as it gets around the normalization issue but doesn't change the ops of the problem. It changes the output shape, but that should be fine.

https://github.com/ScalingIntelligence/KernelBench/blob/main/KernelBench/level2/27_Conv3d_HardSwish_GroupNorm_Mean.py does this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, change applied.


--------------------------------------------------------------------------------

5. level3/36_LSTMHn.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells like a bug in the original code, the fix should just be to return out

Copy link
Author

@EssamWisam EssamWisam Jan 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PaliC responding to this and the comments below, the idea when fixing these problems is to remain backward compatible. The merit of that is that all evaluations where LLMs exploited the redundancy (eg, published research papers), will remain legit after the fix (changing the output makes all these problems harder so comparing evaluations across versions becomes even more tricky).

That said, I also agree it's more sensible to return the actual model's output. One more maintainer vote would be great @simonguozirui

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you’re saying. However, the changes we’re making for constant outputs aren’t backwards compatible. Similarly, the last version bump of KernelBench did invalidate other LLM solutions (as shapes and distributions changed). If it’s in the spirit of a more useful benchmark, I think it’s correct to break backwards compatibility here (as we’ve done before) with the next version of KernelBench.

In this case we're fixing what looks like a mistake in the initial release and shipping something that's more akin to the tasks we want llms to accomplish. Part of the utility of an eval is its practicality. For KernelBench that's in levels 1 and 3, therefore, we should aim to make those problems useful.

Regardless, @simonguozirui chip in. I'll respect whatever the decision ends up being.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do side with your view even though I remember now among the reasons I did this was that the last KB release has indeed focused on minimizing breaking changes as was noted on the blog post.

Yes, I think ensuring practically of the benchmark is more meaningful. I will hope future papers remember to include the version.


--------------------------------------------------------------------------------

6. level3/37_LSTMCn.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above the fix should just be to return out

out = self.fc(out[:, -1, :]) # out: tensor of shape (batch_size, output_size)

_, state = self.lstm(x, (h0, c0))
return state[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just return out instead as that is more faithful as to what an lstm is supposed to do

out = self.fc(out[:, -1, :]) # out: tensor of shape (batch_size, output_size)

_, state = self.lstm(x, (h0, c0))
return state[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just return out instead as that is more faithful as to what an lstm is supposed to do


--------------------------------------------------------------------------------

7. level3/49_Mamba2ReturnFinalState.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another case of the model has a bug https://github.com/state-spaces/mamba/blob/620cd9816997730a652b7c21d1b59c802e35add0/mamba_ssm/modules/ssd_minimal.py#L34 (@simonguozirui lmk if this is correct)

I'd implement lines 71-78 of the snippet.

I forget if kernel bench supports evaluating tuples but if it doesn't I'd just flatten and concat the output

@EssamWisam
Copy link
Author

Thank you @PaliC for the review. As for what you said on integrating the constant check to ci, that could be future work; there doesn't seem to be a ci testing system in place and it's nontrivial (though possible) to run the ci actions on gpu.

@PaliC
Copy link
Collaborator

PaliC commented Jan 5, 2026

@EssamWisam Yeah it's was more for discussion, and imo it's worth making an issue for someone to pick up. Also a gpu shouldn't be needed to check torch numerics (distributions should be the same to the extent we care).

@EssamWisam
Copy link
Author

Yes, I meant it can take way too long on cpu. At least, on my Macbook, that was the case.

The constant check could be also generalized to a variance check; problem tolerance can then also depend on the variance of the kernels outputs over the input distribution.

@PaliC
Copy link
Collaborator

PaliC commented Jan 5, 2026

Oh interesting do you remember what the bottlenecks were? nw if not.

For a rarely running CI it's likely fine (if it's bad on the linux machines github gives us we can just run the touched files).

@EssamWisam
Copy link
Author

Oh interesting do you remember what the bottlenecks were? nw if not.

For a rarely running CI it's likely fine (if it's bad on the linux machines github gives us we can just run the touched files).

It would get stuck on some problems for too long or crash. I didn't debug why specifically but I presumed just model and input size for some problems eating up ram (eg, 16gb of cpu ram compared to H100's 80gb vram).

That said, detecting the problems in the original post was based on a reward hack detection mechanism that is reasonably different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants