Insights into DMA programming generated from Object FIFOs #2748

AndraBisca · 2025-12-02T17:12:25Z

AndraBisca
Dec 2, 2025

#2678 has raised some interesting questions on the topic of the low-level DMA programming generated from objectfifo split and join patterns, specifically related to their limitations when it comes to using data layout transformations on the fly. This discussion aims to provide more insights into the decisions which drive the lowering today and how the lowering would have to change to support more complex data layout transformations.

Context and Motivation

The objectfifo data movement primitive allocates its own memory buffers and locks at the different source and destinations specified by the user, or chosen by a placement algorithm. These resources are then used in the low-level DMA programming with the aim to provide a deadlock-free, race-free execution.

The split and join objectfifo data movement patterns are supported on Memory tiles, which do not have a compute core. As such, data is forwarded from the input channel(s) of the Memtile DMA, to the output channel(s) directly.

As an example, the code below describes a join of three input objectfifos of 8-element wide objects into one output objectfifo of 24-element wide objects. For simplicity, all objectfifos have a depth of one.

depth = 1
n_workers = 3
of_offsets = [8 * worker for worker in range(n_workers)]

of_out = ObjectFifo(tile24_ty, depth=depth, name="out")
of_outs = of_out.prod().join(
    of_offsets,
    obj_types=[tile8_ty] * n_workers,
    depths=[depth] * n_workers,
    names=[f"out{worker}" for worker in range(n_workers)],
)

The above code used to generate the following buffer descriptor (BD) level programming for the Memory tile DMA, where each input objectfifo uses a different S2MM channel to write 8-element wide data tensors in the same%out_buff_0 buffer at different offsets. The 24-element wide buffer is then written to the AXI stream on the MM2S channel.

%memtile_dma_0_1 = aie.memtile_dma(%mem_tile_0_1) {
      %0 = aie.dma_start(MM2S, 0, ^bb1, ^bb6)
    ^bb1:  // 2 preds: ^bb0, ^bb3
      aie.use_lock(%out_cons_lock, AcquireGreaterEqual, 3)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 0, 24)
      aie.use_lock(%out_prod_lock, Release, 3)
      aie.next_bd ^bb1
    ^bb6:  // pred: ^bb0
      %1 = aie.dma_start(S2MM, 0, ^bb7, ^bb8)
    ^bb7:  // 2 preds: ^bb6, ^bb7
      aie.use_lock(%out_prod_lock, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 0, 8)
      aie.use_lock(%out_cons_lock, Release, 1)
      aie.next_bd ^bb7
    ^bb8:  // pred: ^bb6
      %2 = aie.dma_start(S2MM, 1, ^bb9, ^bb10)
    ^bb9:  // 2 preds: ^bb8, ^bb9
      aie.use_lock(%out_prod_lock, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 8, 8)
      aie.use_lock(%out_cons_lock, Release, 1)
      aie.next_bd ^bb9
    ^bb10:  // pred: ^bb8
      %3 = aie.dma_start(S2MM, 2, ^bb11, ^bb12)
    ^bb11:  // 2 preds: ^bb10, ^bb11
      aie.use_lock(%out_prod_lock, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 16, 8)
      aie.use_lock(%out_cons_lock, Release, 1)
      aie.next_bd ^bb11
    ^bb12:  // pred: ^bb10
      aie.end
    }

At this point, it is important to note that a DMA BD can only have a single lock acquire operation and a single lock release operation, because of hardware limitations.

The DMA program above uses a single producer and consumer semaphore lock pair for the join pattern (i.e. out_prod_lock and out_cons_lock), where each input channel adds one token to the consumer lock and the output channel acquires all three consumer tokens, writes the data to the AXI stream, then adds three tokens to the producer lock, with the intention to have one token per input channel to consume for a new iteration of the full DMA program. This program, however, contains a race. In the case where one input channel receives data much faster than the other two input channels, nothing would stop it from consuming multiple or even all of the producer lock tokens and incrementing the consumer lock until the output channel would have enough tokens to execute.

To ensure a race-free execution, the program below is generated instead:

%memtile_dma_0_1 = aie.memtile_dma(%mem_tile_0_1) {
      %0 = aie.dma_start(MM2S, 0, ^bb1, ^bb6)
    ^bb1:  // 2 preds: ^bb0, ^bb3
      aie.use_lock(%out_cons_lock_0, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 0, 8)
      aie.use_lock(%out_prod_lock_0, Release, 1)
      aie.next_bd ^bb2
    ^bb2:  // pred: ^bb1
      aie.use_lock(%out_cons_lock_1, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 8, 8)
      aie.use_lock(%out_prod_lock_1, Release, 1)
      aie.next_bd ^bb3
    ^bb3:  // pred: ^bb2
      aie.use_lock(%out_cons_lock_2, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 16, 8)
      aie.use_lock(%out_prod_lock_2, Release, 1)
      aie.next_bd ^bb1
    ^bb6:  // pred: ^bb0
      %1 = aie.dma_start(S2MM, 0, ^bb7, ^bb8)
    ^bb7:  // 2 preds: ^bb6, ^bb7
      aie.use_lock(%out_prod_lock_0, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 0, 8)
      aie.use_lock(%out_cons_lock_0, Release, 1)
      aie.next_bd ^bb7
    ^bb8:  // pred: ^bb6
      %2 = aie.dma_start(S2MM, 1, ^bb9, ^bb10)
    ^bb9:  // 2 preds: ^bb8, ^bb9
      aie.use_lock(%out_prod_lock_1, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 8, 8)
      aie.use_lock(%out_cons_lock_1, Release, 1)
      aie.next_bd ^bb9
    ^bb10:  // pred: ^bb8
      %3 = aie.dma_start(S2MM, 2, ^bb11, ^bb12)
    ^bb11:  // 2 preds: ^bb10, ^bb11
      aie.use_lock(%out_prod_lock_2, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 16, 8)
      aie.use_lock(%out_cons_lock_2, Release, 1)
      aie.next_bd ^bb11
    ^bb12:  // pred: ^bb10
      aie.end
    }

In the lowering above each S2MM channel has its own pair of producer and consumer semaphore locks. As a DMA BD can only have a maximum of one acquire and one release lock operation, the output channel now has a chain of three BDs, where each BD sends a smaller part of the data corresponding to each of the join inputs.

Please note that the objectfifo link uses a list of offsets to determine where data should be written to or read from during a join or distribute. It currently also uses these offsets to determine what the length of each transfer is, and not the sizes of the objectfifos in the link. This is because the object sizes of the input objectfifos of a join could be smaller and simply require multiple DMA transfers to accumulate on the Memory tile. Similarly for the output objectfifos in a distribute.

Limitations

While the lowering above ensures correctness, it limits the complexity of data layout transformations that can be applied on the join output. These transformations are applied on the fly by the aie.dma_bd operations. These operations specify the memory buffer which is being accessed, at what offset, and what length. Since the BDs in the program above only operate on 8-element wide tensors, a transformation on the full 24-element wide output of the join cannot be applied.

One solution that was tested for the example above is to keep the BD chain on the output channel as is, but only one of the aie.dma_bd operations actually writes the full 24-element wide data to the stream. The code below shows the lowering if we wanted to apply a [(8, 1), (3, 8)]) dimensions_to_stream transformation on the output objectfifo.

%memtile_dma_0_1 = aie.memtile_dma(%mem_tile_0_1) {
      %0 = aie.dma_start(MM2S, 0, ^bb1, ^bb6)
    ^bb1:  // 2 preds: ^bb0, ^bb3
      aie.use_lock(%out_cons_lock_0, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 0, 0, [<size = 8, stride = 1>, <size = 3, stride = 8>])
      aie.use_lock(%out_prod_lock_0, Release, 1)
      aie.next_bd ^bb2
    ^bb2:  // pred: ^bb1
      aie.use_lock(%out_cons_lock_1, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 0, 0, [<size = 8, stride = 1>, <size = 3, stride = 8>])
      aie.use_lock(%out_prod_lock_1, Release, 1)
      aie.next_bd ^bb3
    ^bb3:  // pred: ^bb2
      aie.use_lock(%out_cons_lock_2, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 0, 24, [<size = 8, stride = 1>, <size = 3, stride = 8>])
      aie.use_lock(%out_prod_lock_2, Release, 1)
      aie.next_bd ^bb1
    ^bb6:  // pred: ^bb0
      %1 = aie.dma_start(S2MM, 0, ^bb7, ^bb8)
    ^bb7:  // 2 preds: ^bb6, ^bb7
      aie.use_lock(%out_prod_lock_0, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 0, 8)
      aie.use_lock(%out_cons_lock_0, Release, 1)
      aie.next_bd ^bb7
    ^bb8:  // pred: ^bb6
      %2 = aie.dma_start(S2MM, 1, ^bb9, ^bb10)
    ^bb9:  // 2 preds: ^bb8, ^bb9
      aie.use_lock(%out_prod_lock_1, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 8, 8)
      aie.use_lock(%out_cons_lock_1, Release, 1)
      aie.next_bd ^bb9
    ^bb10:  // pred: ^bb8
      %3 = aie.dma_start(S2MM, 2, ^bb11, ^bb12)
    ^bb11:  // 2 preds: ^bb10, ^bb11
      aie.use_lock(%out_prod_lock_2, AcquireGreaterEqual, 1)
      aie.dma_bd(%out_buff_0 : memref<24xi32>, 16, 8)
      aie.use_lock(%out_cons_lock_2, Release, 1)
      aie.next_bd ^bb11
    ^bb12:  // pred: ^bb10
      aie.end
    }

While the lowering above supports a data layout transformation on the full output of the join, the DMA program is no longer race-free: once the producer locks of each join input are released in the output DMA BD-chain, data in the buffer can be overwritten by a new data transfer.

Possible Solutions

Following the last point in the previous section, it would be possible to generate additional "dummy" locks to acquire and release after the aie.dma_bd operations with the data transfer in the BD chain. This would ensure that the input channels do not overwrite the data before it is written to the AXI stream.

Another option is to increase the depth of the output objectfifo in the join such that data is buffered into a different object while the other is written to the stream.

Future Integration

This is an area that would greatly benefit from a rework and is already part of the roadmap for data movement development. Specifically, improvements should allow more flexibility when it comes to how data layout transformations are communicated through the data movement primitives and on how objectfifos of different object sizes and depths are coupled through Memory tiles.

AndraBisca · 2025-12-03T16:17:20Z

AndraBisca
Dec 3, 2025
Author

Note that as of #2751, toStream data layout transformations on the output objectfifo of a join and fromStream data layout transformations on the input objectfifo of a distribute are not allowed and produce an error message.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insights into DMA programming generated from Object FIFOs #2748

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Insights into DMA programming generated from Object FIFOs #2748

Uh oh!

Uh oh!

AndraBisca Dec 2, 2025

Replies: 1 comment

Uh oh!

AndraBisca Dec 3, 2025 Author

AndraBisca
Dec 2, 2025

AndraBisca
Dec 3, 2025
Author