how to overlap the share2register and computing process?

I have another question about MMult_cuda_12.cu
Honestly, I don't know how to overlap the share2register and computing process. Is it the asm(PTX) that make them run parallelly? The instructions are sequantially, so how could these two parts of code hide each other?
**part1:  loading shared-memory to panel**
      lds128(panelA[pp][0], panelA[pp][1], panelA[pp][2], panelA[pp][3],
             aptr_base + ((subk + 1) % 8) * SMEM_LDA * sizeof(float));
      lds128(panelA[pp][4], panelA[pp][5], panelA[pp][6], panelA[pp][7],
             aptr_base + (((subk + 1) % 8) * SMEM_LDA + 64) * sizeof(float));
      lds128(panelB[pp][0], panelB[pp][1], panelB[pp][2], panelB[pp][3],
             bptr_base + ((subk + 1) % 8) * SMEM_LDB * sizeof(float));
      lds128(panelB[pp][4], panelB[pp][5], panelB[pp][6], panelB[pp][7],
             bptr_base + (((subk + 1) % 8) * SMEM_LDB + 64) * sizeof(float));

  **part2: computing the result of  panel-data**
#pragma unroll
      for (int i = 0; i < 8; ++i) {
#pragma unroll
        for (int j = 0; j < 8; ++j) {
          sum[i][j] += panelA[subk % 2][i] * panelB[subk % 2][j];
        }
      }



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to overlap the share2register and computing process? #14

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

how to overlap the share2register and computing process? #14

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions