[Warp Specialization] Inefficient codegen for loop carried dependencies

When we have a loop where a loop carried argument and its next value are both used in another partition, the codegen is inefficient:

```mlir
scf.for ... (%iv = %init) {
  %next = compute_next %iv {partition = 0}
  use_value(%iv, %next) {partition = 1}
  scf.yield %next
}
```

Previously, with the multiplicity optimization, RewritePartitionDependencies would generate

```mlir
%iv_buf = local_alloc <2xshapexdtype>
local_store %init, %iv_buf[0]
scf.for ... (%iv = %init) {
  %next = compute_next %iv {partition = 0}
  wait_barrier {partition = 0}
  local_store %next, %iv_buf[(%i+1)%2] {partition = 0}
  arrive_barrier {partition = 0}

  wait_barrier {partition = 1}
  %iv_0 = local_load %iv_buf[%i] {partition = 1}
  arrive_barrier {partition = 1}
  wait_barrier {partition = 1}
  %iv_next = local_load %iv_buf[(%i+1)%2} {partition = 1}
  arrive_barrier {partition = 1}
  use_value(%iv_0, %iv_next) {partition = 1}

  scf.yield %next
}
```

Note that this is already not optimal because we could reverse the order of the local_loads in partition 1 and make the second `wait_barrier` a no-op (we know the previous value is ready if the current value is ready).

After removing the multiplicity optimization, we generate two arefs, one for each value that crosses from partition 0 to 1, without recognizing that it is a loop-carried value:

```mlir
%iv_aref = aref_create
%next_aref = aref_create
scf.for ... (%iv = %init) {
  aref_put %iv, %iv_aref {partiiton = 0}

  %next = compute_next %iv {partition = 0}
  aref_put %next, %next_reaf {partition = 0}
  
  %iv_0 = aref_get %iv_aref
  %next_0 = aref_get %next_aref
  use_value(%iv_0, %next_0)

  scf.yield %next
}
```

This results in 2 aref writes and 2 aref reads. 

Ideally, we can fully optimize this to be a single aref like this:

```mlir
%iv_aref = aref_create
aref_put %init, %iv_aref {partition = 0}

%init_0 = aref_get %iv_aref {partition = 1}
scf.for ... (%iv = %init, %iv_0 = %init_0) {
  %next = compute_next %iv {partition = 0}
  aref_put %next, %iv_aref {partition = 1}

  %next_0 = aref_get %iv_aref {partition = 1}
  use_value(%iv_0, %next_0)

  scf.yield %next, %next_0
}
```

This results in 1 aref write and 1 aref read in the inner loop. We can also choose to multibuffer `%iv_aref` with 2 buffers to improve overlap between the partitions. This would require cloning the whole loop carried value across both partitions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Warp Specialization] Inefficient codegen for loop carried dependencies #7628

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Warp Specialization] Inefficient codegen for loop carried dependencies #7628

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions