I've made a few code improvements to your CPU version that increase efficiency by about 10 times
The test results in the same environment are as follows:
646464--->5ms
128128128--->7ms
256256256--->19ms
512512512--->121ms
102410241024--->728ms
204820482048---->5576ms