Why are transformers unet are slower than CNNs despite being smaller ? #9029
Replies: 2 comments
-
Beta Was this translation helpful? Give feedback.
-
Hello @RochMollero, AFAIK, The Pareto Principle says that roughly 80% of consequences come from 20% of causes. Also, The Lottery Ticket Hypothesis argues that we may even need as low <20% parameters as possible to get similar results in some situations. For quantization, see this blog post: Memory-efficient Diffusion Transformers with Quanto and Diffusers |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello
So I'm working on the AudioDiffusion pipeline and trying both the classical unet and the Conditional one based on transformers. I have these values plotted from
CNN UNET :
3217 MiB
PARAM NUMBER: 113668609 params (113.668609M)
2 or 3 seconds on runpod H100 PCIE for 50 diffusion steps
(Conditional) CNN Transformers :
4733 MiB
PARAM NUMBER: 69734529 params (69.734529M)
10 or 11 seconds on runpod H100 PCIE for 50 diffusion steps
So the transformer one is basically 66% size of the CNN on, but is 4 to 5 longer to execute. And It takes more space on the GPU !!
Why ? I guess it's due to the type of operations but is there more to know ? In particular why do people like transformers so much if it's that slow ? Do we only need 20% of the parameters to get same qualitative results ? Should I lower the parameter count of my transformer unet ?
Beta Was this translation helpful? Give feedback.
All reactions