-
Notifications
You must be signed in to change notification settings - Fork 107
☑️ Fix problems solvable by a constant fill or eliminating a redundant operation #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ltiply_GlobalAvgPool_GlobalAvgPool_Mean
…pout (constant output fix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work!!! (especially the changelog / tests it made reviewing things much easier). You also found a bunch of bugs in level 3 thanks :) Generally things look good!
Generally, points of feedback. Level2/23... is better fixed by adjusting the mean, but this is more of a nit.
For level 3, these are supposed to represent actual models. Let's try to be faithful to models we are trying to test :) I mentioned some fixes.
Two meta points for discussion (@simonguozirui / whoever has an opinion also chime in).
-
For level 2, the point is to evaluate against compositions of pytorch operations. Therefore, I actually think it's better to augment the problem set with the problems you modified. imo if a model figures out "oh hey this is redundant let's get rid of this" that's a really useful insight!!! I'd personally just tag these types of problems so they could be filtered out if needed. imo it's correct to do another release which fixes the problems.
-
For the tests you added. imo doing the constant check as a periodic sweep on the problems is a good idea (in practice you write the test and just have the ci run on it when you change problems).
|
|
||
| Fix: Replaced mean with amax (global max pooling). | ||
| - x = x.mean(dim=[1, 2, 3, 4]) | ||
| + x = x.amax(dim=[1, 2, 3, 4]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo a better fix here is to just do x.mean(dim=[2, 3, 4]) instead of x.mean(dim=[1, 2, 3, 4]) as it gets around the normalization issue but doesn't change the ops of the problem. It changes the output shape, but that should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, change applied.
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| 5. level3/36_LSTMHn.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This smells like a bug in the original code, the fix should just be to return out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PaliC responding to this and the comments below, the idea when fixing these problems is to remain backward compatible. The merit of that is that all evaluations where LLMs exploited the redundancy (eg, published research papers), will remain legit after the fix (changing the output makes all these problems harder so comparing evaluations across versions becomes even more tricky).
That said, I also agree it's more sensible to return the actual model's output. One more maintainer vote would be great @simonguozirui
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you’re saying. However, the changes we’re making for constant outputs aren’t backwards compatible. Similarly, the last version bump of KernelBench did invalidate other LLM solutions (as shapes and distributions changed). If it’s in the spirit of a more useful benchmark, I think it’s correct to break backwards compatibility here (as we’ve done before) with the next version of KernelBench.
In this case we're fixing what looks like a mistake in the initial release and shipping something that's more akin to the tasks we want llms to accomplish. Part of the utility of an eval is its practicality. For KernelBench that's in levels 1 and 3, therefore, we should aim to make those problems useful.
Regardless, @simonguozirui chip in. I'll respect whatever the decision ends up being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do side with your view even though I remember now among the reasons I did this was that the last KB release has indeed focused on minimizing breaking changes as was noted on the blog post.
Yes, I think ensuring practically of the benchmark is more meaningful. I will hope future papers remember to include the version.
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| 6. level3/37_LSTMCn.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above the fix should just be to return out
| out = self.fc(out[:, -1, :]) # out: tensor of shape (batch_size, output_size) | ||
|
|
||
| _, state = self.lstm(x, (h0, c0)) | ||
| return state[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just return out instead as that is more faithful as to what an lstm is supposed to do
| out = self.fc(out[:, -1, :]) # out: tensor of shape (batch_size, output_size) | ||
|
|
||
| _, state = self.lstm(x, (h0, c0)) | ||
| return state[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just return out instead as that is more faithful as to what an lstm is supposed to do
|
|
||
| -------------------------------------------------------------------------------- | ||
|
|
||
| 7. level3/49_Mamba2ReturnFinalState.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another case of the model has a bug https://github.com/state-spaces/mamba/blob/620cd9816997730a652b7c21d1b59c802e35add0/mamba_ssm/modules/ssd_minimal.py#L34 (@simonguozirui lmk if this is correct)
I'd implement lines 71-78 of the snippet.
I forget if kernel bench supports evaluating tuples but if it doesn't I'd just flatten and concat the output
|
Thank you @PaliC for the review. As for what you said on integrating the constant check to ci, that could be future work; there doesn't seem to be a ci testing system in place and it's nontrivial (though possible) to run the ci actions on gpu. |
|
@EssamWisam Yeah it's was more for discussion, and imo it's worth making an issue for someone to pick up. Also a gpu shouldn't be needed to check torch numerics (distributions should be the same to the extent we care). |
|
Yes, I meant it can take way too long on cpu. At least, on my Macbook, that was the case. The constant check could be also generalized to a variance check; problem tolerance can then also depend on the variance of the kernels outputs over the input distribution. |
|
Oh interesting do you remember what the bottlenecks were? nw if not. For a rarely running CI it's likely fine (if it's bad on the linux machines github gives us we can just run the touched files). |
It would get stuck on some problems for too long or crash. I didn't debug why specifically but I presumed just model and input size for some problems eating up ram (eg, 16gb of cpu ram compared to H100's 80gb vram). That said, detecting the problems in the original post was based on a reward hack detection mechanism that is reasonably different. |
I have identified that a number of KernelBench problems:
This results in astronomical speedups for some of these problems which does not reflect the agent's ability to perform genuine optimizations and rather reflects their ability to exploit flaws in the given program. Correspondingly, one agent performing genuine optimizations while remaining logically equivalent to the program code would likely underperform an agent that rather focuses on exploiting these flaws. Not to mention that whether the agent exploits the flaw could depend on "luck".
In proposing fixes to each of the problems I tried to look for the most minimal change that would fix the problem. All the fixes to the redundant operations flaw are non-breaking (correctness-wise).
The PR includes:
https://github.com/EssamWisam/KernelBench/blob/5763a8a6037d43d12b99e7e6df4125d086a17f86/KernelBench/changelog/redundant_op_fixes.txt
The changelog includes a list of todos which are problem renames. Delayed applying these until we approve the fixes.