-
Notifications
You must be signed in to change notification settings - Fork 78
To load single DPAS B matrix instead of two per 2D block io instruction from the transposed memory #2628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // CHECK: %[[VAL_173:.*]] = llvm.load [[DEST]] : !llvm.ptr -> vector<16xi32> | ||
| // CHECK: %[[VAL_174:.*]] = llvm.shufflevector %[[VAL_173]], %[[VAL_173]] [0, 2, 4, 6, 8, 10, 12, 14] : vector<16xi32> | ||
| // CHECK: %[[VAL_176:.*]] = llvm.shufflevector %[[VAL_173]], %[[VAL_173]] [1, 3, 5, 7, 9, 11, 13, 15] : vector<16xi32> | ||
| // CHECK-COUNT-8: llvm.call spir_funccc @_Z51intel_sub_group_2d_block_read_transpose_32b_16r8x1cPU3AS1viiiDv2_iPj({{.*}}, [[DEST:%.*]]) {{.*}} : (!llvm.ptr<1>, i32, i32, i32, vector<2xi32>, !llvm.ptr) -> () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check that there is no llvm.shufflevector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the code generated does still contain a shufflevector instruction, example:
llvm.call spir_funccc @_Z51intel_sub_group_2d_block_read_transpose_32b_16r8x1cPU3AS1viiiDv2_iPj(%30, %139, %33, %138, %146, %141) {arg_attrs = [{llvm.nonnull, llvm.readonly}, {}, {}, {}, {}, {llvm.nonnull, llvm.writeonly}], function_type = !llvm.func<void (ptr<1>, i32, i32, i32, vector<2xi32>, ptr)>, linkage = #llvm.linkage<external>, no_unwind, sym_name = "_Z51intel_sub_group_2d_block_read_transpose_32b_16r8x1cPU3AS1viiiDv2_iPj", visibility_ = 0 : i64, will_return} : (!llvm.ptr<1>, i32, i32, i32, vector<2xi32>, !llvm.ptr) -> ()
%147 = llvm.load %141 : !llvm.ptr -> vector<8xi32>
%148 = llvm.shufflevector %147, %147 [0, 1, 2, 3, 4, 5, 6, 7] : vector<8xi32>
%149 = llvm.bitcast %148 : vector<8xi32> to vector<16xf16>
Note: the remaining shufflevector instruction is a noop (yields the same input vector) so it could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the lit test.
|
I am seeing a ~5% regression in Before:
After:
|
| // now. | ||
| numOperandsPer2DLoadM = | ||
| (threadsPerWarp <= tileHeight) ? repCluster[rank - 1] : 1; | ||
| // Only load 1 operand per inst on row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add the reason (an explanation) for doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason as I understand it is that by loading smaller 2D blocks we do not have to shuffle the results of the loads before "feeding" it to the DPAS instruction. @chengjunlu can you elaborate please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is to reduce the register spilling size.
The large size of 2D load requires a large contiguous register spaces.
Unlike the non-transpose matrix which can be used directly to the DPAS engine with a sub-region of the values loaded, the transposed matrix operands has to be shuffled first.
Then it requires another temporary space for shuffling which makes the register allocation more complicate.
The backend cannot generate the 0 spill binary if we use the large size 2D load of transposed matrix operands for now (Even with the latest internal IGC quick build.).
| // CHECK: %[[VAL_173:.*]] = llvm.load [[DEST]] : !llvm.ptr -> vector<16xi32> | ||
| // CHECK: %[[VAL_174:.*]] = llvm.shufflevector %[[VAL_173]], %[[VAL_173]] [0, 2, 4, 6, 8, 10, 12, 14] : vector<16xi32> | ||
| // CHECK: %[[VAL_176:.*]] = llvm.shufflevector %[[VAL_173]], %[[VAL_173]] [1, 3, 5, 7, 9, 11, 13, 15] : vector<16xi32> | ||
| // CHECK-COUNT-8: llvm.call spir_funccc @_Z51intel_sub_group_2d_block_read_transpose_32b_16r8x1cPU3AS1viiiDv2_iPj({{.*}}, [[DEST:%.*]]) {{.*}} : (!llvm.ptr<1>, i32, i32, i32, vector<2xi32>, !llvm.ptr) -> () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should
|
I did a micro benchmark run, results: http://benchmarks.glados.intel.com/d/1pXX4hUSz/microbenchmarks?orgId=1&var-tag=ci%7Creorder_dpas_gen&var-bench=All&var-device=Intel%28R%29%20Data%20Center%20GPU%20Max%201550&var-compiler=triton&var-backend=All&var-baseline_backend=triton-ci-XPU%201550&var-target_backend=triton-ci-XPU%201550 I was expecting to "See" and improvement for gemm with operand B transposed but performance seems about the same as CI (without this PR). We should grab the latest IGC and test against that. |
|
The degradation appears to be due to the number of loads doubling for the B matrix. Previously we made 4 2D block read transpose loads: now we make 8: b/c the loads are smaller - half the number of rows - we need to do twice as many. Does something change with a newer IGC version that mitigates this? |
| // CHECK: %[[VAL_173:.*]] = llvm.load [[DEST]] : !llvm.ptr -> vector<16xi32> | ||
| // CHECK: %[[VAL_174:.*]] = llvm.shufflevector %[[VAL_173]], %[[VAL_173]] [0, 2, 4, 6, 8, 10, 12, 14] : vector<16xi32> | ||
| // CHECK: %[[VAL_176:.*]] = llvm.shufflevector %[[VAL_173]], %[[VAL_173]] [1, 3, 5, 7, 9, 11, 13, 15] : vector<16xi32> | ||
| // CHECK-COUNT-8: llvm.call spir_funccc @_Z51intel_sub_group_2d_block_read_transpose_32b_16r8x1cPU3AS1viiiDv2_iPj({{.*}}, [[DEST:%.*]]) {{.*}} : (!llvm.ptr<1>, i32, i32, i32, vector<2xi32>, !llvm.ptr) -> () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: these are the changes that matter. The rest of the changes are white space differences.
| // now. | ||
| numOperandsPer2DLoadM = | ||
| (threadsPerWarp <= tileHeight) ? repCluster[rank - 1] : 1; | ||
| // Only load 1 operand per inst on row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason as I understand it is that by loading smaller 2D blocks we do not have to shuffle the results of the loads before "feeding" it to the DPAS instruction. @chengjunlu can you elaborate please.
Yes, the rationale for this change is discussed here: #2628 (comment) |
013c838 to
ae0c986
Compare
… transposed DPAS operand B. It can get better performance for flash attention because it works around performance issue of register spilling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the env. variable. This feature should be enabled by default, having proven that it gives good performance for attention and for GEMM with B transpose on the latest dev IGC.
@chengjunlu please provide performance results for this PR:
- run with DEV IGC driver (annotate the exact build please) without your code (baseline)
- rerun on the same machine with the same IGC dev driver as in (1) and with your feature (new)
Do so for attention and for GEMM with B transpose. Then record the results in this PR or in the issue associated with this PR please.
| "TRITON_INTEL_ADVANCED_PATH", | ||
| "TRITON_INTEL_AGGRESSIVE_DPAS_REUSE", | ||
| "TRITON_INTEL_DO_NOT_SINK_INSTR_ACROSS_RGN", | ||
| "TRITON_INTEL_DISABLE_LARGE_BLOCK_SIZE_IO_FOR_TRANS_DOT_B", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
|
I uses two IGC driver version to run the benchmark because:
For the GEMM benchmark, the large block IO size is 1.09x better than the small block IO size in geomean. The main reason of the small block IO size getting better performance in flash attention is because of there is no register spilling comparing to the large block IO size. Here is the detail performance results for small/large block IO size comparison:
Flash Attention:
|
|
So in summary the reason for not using large 2D reads is to make it easier for IGC to allocate registers without spills. But we do not have a good heuristic for register pressure in Triton. We need to discuss how to proceed offline on this. |
|
Discussed offline, decided to merge this PR with the env. variable so that we can more easily run performance experiments. |

To load single DPAS B matrix per 2D block io instruction from the column major matrix in memory gets better performance for flash attention.
Because unlike the row major matrix, the values, which includes more than one DPAS B operands returned by a single 2D transposed block IO, cannot be used as DPAS operands directly.
We have to shuffle the value in the register before pass it to the DPAS instruction and this is not optimized by the IGC for now.