Looks like the process of Kernel Fusing means that is a lot of repetition of code - is this easy to maintain? @bb515 What are the computational savings from kernel fusing?
If it is that significant, is there a smart way of doing, using code generation, and templating common parts (if that makes sense)