Triton Performance Worse on Split Activation in Forward Pass #1186
Unanswered
xanderdunn
asked this question in
Q&A
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Running this benchmark file as-is produces this output:

This is the forward pass in pytorch:
The equivalent triton kernel implementation is the gelu_partial_layer_fused_forward function.
I'm surprised that the triton performance is so much worse. Do you see any issues with the kernel implementation? This is a small modification on the provided matmul tutorial. I wonder if it's perhaps related to the experience in #984 where @jmc128 found that having two accumulators harmed the triton kernel performance. This is essentially what I have here where
accumulator_left
isz1
andaccumulator_right
isz2
.I'm running on latest master commit
3fa8a5a864c48a490625648387a86be3eb7c2c06
built from source. This is running on a GCP machine with a single A100. Ubuntu 22.04. Python 3.8.Beta Was this translation helpful? Give feedback.
All reactions