On line 20, you are computing the TFLOPS as such:
conv_flops = MN * MN * CK * CK * HW * HW
Wouldn't it actually be 2x this since each of those points is an add and a mul? I see the "*2" in tflops_sweep.py
Brings the computed speed from 4.8 TFLOPS to 9.6 TFLOPS, a lot closer to the 10.4 theoretical max.
(though since it's 3x3, it might be a Winograd conv)