The CLBlast API converts hermitian, symmetric, and triangular matrices to regular matrices, and then calls regular GEMM. Doesn’t this incur 2 types of overhead: extra computation + bandwidth and memory allocation + copy, that could be avoided by custom kernels?