I realize that the b2b gemm example 13 has the following usage restrictions:
ThreadblockShape0::kN == problem_size_0.N
ThreadblockShape1::kN == problem_size_1.N
That, of course, means that problem size of N can typically only be 64, 128, or 256, which is a strong limitation.
How can it be lifted?
Can it also be lifted such that N0 doesn't need to be N1? The b2b example is useful for the typical MLPBlock with N1 = 4 * N0.