Skip to content

how to overlap the share2register and computing process? #14

@YijiaZhao

Description

@YijiaZhao

I have another question about MMult_cuda_12.cu
Honestly, I don't know how to overlap the share2register and computing process. Is it the asm(PTX) that make them run parallelly? The instructions are sequantially, so how could these two parts of code hide each other?
part1: loading shared-memory to panel
lds128(panelA[pp][0], panelA[pp][1], panelA[pp][2], panelA[pp][3],
aptr_base + ((subk + 1) % 8) * SMEM_LDA * sizeof(float));
lds128(panelA[pp][4], panelA[pp][5], panelA[pp][6], panelA[pp][7],
aptr_base + (((subk + 1) % 8) * SMEM_LDA + 64) * sizeof(float));
lds128(panelB[pp][0], panelB[pp][1], panelB[pp][2], panelB[pp][3],
bptr_base + ((subk + 1) % 8) * SMEM_LDB * sizeof(float));
lds128(panelB[pp][4], panelB[pp][5], panelB[pp][6], panelB[pp][7],
bptr_base + (((subk + 1) % 8) * SMEM_LDB + 64) * sizeof(float));

part2: computing the result of panel-data
#pragma unroll
for (int i = 0; i < 8; ++i) {
#pragma unroll
for (int j = 0; j < 8; ++j) {
sum[i][j] += panelA[subk % 2][i] * panelB[subk % 2][j];
}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions