-
-
Notifications
You must be signed in to change notification settings - Fork 843
Description
您好,我最近刚入门cuda,想请教下sgemm.cu文件里sgemm_t_8x8_sliced_k_f32x4_bcf_kernel对s_a,s_b这两个共享内存的bank conflict是如何计算的。
s_b store load
向s_b store数据时每个线程会通过FLOAT4 一次 store 4个float数据,大小是16Bytes,而32个bank的宽度是128Bytes,32个线程会分成4次进行写入,每次使用8个线程,我理解这8个线程访问的bank地址是没有冲突的吧。
2. bank layout analysis: s_b[8][128] same as s_a[8][128]
3. bank conficts analysis: s_b[8][128]
tid 0 -> k 0, n 0 -> all access bank 0-3 (layer_0)
tid 1 -> k 0, n 4 -> all access bank 4-7 (layer_0)
tid 2 -> k 0, n 8 -> all access bank 7-11 (layer_0)
tid 7 -> k 0, n 28 -> all access bank 28-31 (layer_0)
tid 8 -> k 0, n 32 -> all access bank 0-3 (layer_1)
... ... ... ...
tid 15 -> k 0, n 60 -> all access bank 28-31 (layer_1)
tid 16 -> k 0, n 64 -> all access bank 0-3 (layer_2)
... ... ... ...
tid 31 -> k 0, n 124 -> all access bank 28-31 (layer_3)
conclusion: we still have bank conflicts within warp,
0/8/16/24 -> bank 0-3, 1/9/17/25 -> bank 4-7, etc.
thus, we still need 4 memory issues at least per warp.
此时8个线程访问bank的情况如下所示:
tid 0 -> k 0, n 0 -> all access bank 0-3 (layer_0)
tid 1 -> k 0, n 4 -> all access bank 4-7 (layer_0)
tid 2 -> k 0, n 8 -> all access bank 7-11 (layer_0)
tid 7 -> k 0, n 28 -> all access bank 28-31 (layer_0)
同理从s_b load数据的时候,和写入数据时逻辑是一样的,一个线程一次只会读取4个float数据,128Bytes需要8个线程完成读取操作,因此应该也是没有bank conflict的。
s_a store load
向s_a store数据时产生2路冲突我大概理解。
从s_a load数据的时候,由于一个线程通过FLOAT4 一次 load 4个float数据,32个bank宽度总共使用8个线程进行处理,t0-t15这16个线程访问的bank都是0-3,那是会产生8路冲突,还是通过广播的方式就没有bank conflict了?
bank conflicts analysis, tx/ty 0-15, 0-7 bank 4*8=32 bytes
tid 0-15 access bank 0-3, tid 16-31 access bank 4-7, etc.
tid 0, tk 0 -> ty 0 -> [0][0+0-3],[0][64+0-3] -> bank 0-3(layer_0/2),
tid 0, tk 7 -> ty 0 -> [7][0+0-3],[0][64+0-3] -> bank
0-3(layer_28/30), tid 15, tk 0 -> ty 0 -> [0][0+0-3],[0][64+0-3] ->
bank 0-3(layer_0/2), tid 15, tk 7 -> ty 0 -> [7][0+0-3],[0][64+0-3] ->
bank 0-3(layer_28/30), tid 16, tk 0 -> ty 1 -> [0][0+4-7],[0][64+4-7]
-> bank 4-7(layer_0/2), tid 16, tk 7 -> ty 1 -> [7][0+4-7],[0][64+4-7]
-> bank 4-7(layer_28/30), tid 31, tk 0 -> ty 1 ->
[0][0+4-7],[0][64+4-7] -> bank 4-7(layer_0/2), tid 31, tk 7 -> ty 1 ->
[7][0+4-7],[0][64+4-7] -> bank 4-7(layer_28/30), tid 255,tk 0 -> ty 15
-> [0][0+60-63],[0][64+60-63] -> bank 28-31(layer_1/3), tid 255,tk 7 ->
ty 15 -> [7][0+60-63],[0][64+60-63] -> bank 28-31(layer_29/31),