Is there any fp8_blockscale_gemm performance comparison data between nvcc and nvrtc?
"Note that there is some perf drop when using NVRTC due to a known bug of NVRTC which leads to extra instructions (but in the m=4096,n=2112,k=7168 case, NVRTC version was faster, which was a bit strange)" From Deepgemm。
Has this bug been fixed?