Add small 2D load block size option for B.T matrix. #2863

chengjunlu · 2024-11-28T06:17:23Z

To load one B operand of DPAS per load instruction gets better instruction scheduling result for flash attention kernel.

whitneywhtsang

I agree to have an autotune option until IGC improvement in instruction scheduling to handle large block IO.

alexbaden · 2024-11-28T16:18:35Z

@chengjunlu can you please provide an example of the before and after kernels, performance of each, and some explanation for why the instruction scheduling is better?

chengjunlu · 2024-11-29T01:34:20Z

@chengjunlu can you please provide an example of the before and after kernels, performance of each, and some explanation for why the instruction scheduling is better?

The short conclusion for optimization the FA kernel is:
The AI kernel are targeting the high throughput of the computation. The most important part is to hide the memory latency. There are 3 ways to hide the memory accessing latency is:

Memory accessing parallelism.
Instruction-level parallelism.
Thread-level parallelism.

We have used the loop pipelining and prefetching to hide the memory latency for memory accessing parallelism on PVC.

The heuristic of improve the instruction-level parallelism to hide the memory access latency is that to add more irrelevant instructions between the load and its user.

%0 = load %ptr
// The instruction before the user of %0 can be executed in parallel with memory accessing 
%n = dpas %0

This kind of scheduling enlarges the live range of the value returned by memory accessing. And the large interference with other values may cause register spilling in codegen.

Because we are using the large-GRF mode, the EU thread parallelism is wasted by half (Only 4 of 8 physical threads is valid). So the instruction-level parallelism is being more important than usual. (We need more instruction-level parallelism under large-GRF mode.)

So the better instruction scheduling is the balance of 0 register spilling and instruction-level parallelism.

Right now, we only focused on the 0 register spilling for now. The instruction-level parallelism maybe not the best. I will share a table of the idea number of instruction should be inserted for instruction-level parallelism offline.

There are some other method to improve the register spilling issue could be consider:

Use the SLM as the space to hold the long live tensor value to reduce the register pressure.
Carefully manage the register allocation and enable the A/B buffering in register for load.

chengjunlu requested review from etiotto and whitneywhtsang November 28, 2024 06:17

Add small 2D load block size option for B.T matrix.

3ceb6a5

whitneywhtsang approved these changes Nov 28, 2024

View reviewed changes

etiotto approved these changes Nov 28, 2024

View reviewed changes

etiotto merged commit 16ecc44 into main Nov 28, 2024
6 checks passed

etiotto deleted the chengjun/add_2d_load_option branch November 28, 2024 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add small 2D load block size option for B.T matrix. #2863

Add small 2D load block size option for B.T matrix. #2863

Uh oh!

chengjunlu commented Nov 28, 2024

Uh oh!

whitneywhtsang left a comment

Uh oh!

Uh oh!

alexbaden commented Nov 28, 2024

Uh oh!

chengjunlu commented Nov 29, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add small 2D load block size option for B.T matrix. #2863

Add small 2D load block size option for B.T matrix. #2863

Uh oh!

Conversation

chengjunlu commented Nov 28, 2024

Uh oh!

whitneywhtsang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexbaden commented Nov 28, 2024

Uh oh!

chengjunlu commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chengjunlu commented Nov 29, 2024 •

edited

Loading