Skip to content

Conversation

@chengjunlu
Copy link
Contributor

To load one B operand of DPAS per load instruction gets better instruction scheduling result for flash attention kernel.

Copy link
Contributor

@whitneywhtsang whitneywhtsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to have an autotune option until IGC improvement in instruction scheduling to handle large block IO.

@etiotto etiotto merged commit 16ecc44 into main Nov 28, 2024
6 checks passed
@etiotto etiotto deleted the chengjun/add_2d_load_option branch November 28, 2024 14:59
@alexbaden
Copy link
Contributor

@chengjunlu can you please provide an example of the before and after kernels, performance of each, and some explanation for why the instruction scheduling is better?

@chengjunlu
Copy link
Contributor Author

chengjunlu commented Nov 29, 2024

@chengjunlu can you please provide an example of the before and after kernels, performance of each, and some explanation for why the instruction scheduling is better?

The short conclusion for optimization the FA kernel is:
The AI kernel are targeting the high throughput of the computation. The most important part is to hide the memory latency. There are 3 ways to hide the memory accessing latency is:

  1. Memory accessing parallelism.
  2. Instruction-level parallelism.
  3. Thread-level parallelism.

We have used the loop pipelining and prefetching to hide the memory latency for memory accessing parallelism on PVC.

The heuristic of improve the instruction-level parallelism to hide the memory access latency is that to add more irrelevant instructions between the load and its user.

%0 = load %ptr
// The instruction before the user of %0 can be executed in parallel with memory accessing 
%n = dpas %0

This kind of scheduling enlarges the live range of the value returned by memory accessing. And the large interference with other values may cause register spilling in codegen.

Because we are using the large-GRF mode, the EU thread parallelism is wasted by half (Only 4 of 8 physical threads is valid). So the instruction-level parallelism is being more important than usual. (We need more instruction-level parallelism under large-GRF mode.)

So the better instruction scheduling is the balance of 0 register spilling and instruction-level parallelism.

Right now, we only focused on the 0 register spilling for now. The instruction-level parallelism maybe not the best. I will share a table of the idea number of instruction should be inserted for instruction-level parallelism offline.

There are some other method to improve the register spilling issue could be consider:

  1. Use the SLM as the space to hold the long live tensor value to reduce the register pressure.
  2. Carefully manage the register allocation and enable the A/B buffering in register for load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants