Insights into DMA programming generated from Object FIFOs #2748
AndraBisca
started this conversation in
General
Replies: 1 comment
-
|
Note that as of #2751, |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
#2678 has raised some interesting questions on the topic of the low-level DMA programming generated from objectfifo split and join patterns, specifically related to their limitations when it comes to using data layout transformations on the fly. This discussion aims to provide more insights into the decisions which drive the lowering today and how the lowering would have to change to support more complex data layout transformations.
Context and Motivation
The objectfifo data movement primitive allocates its own memory buffers and locks at the different source and destinations specified by the user, or chosen by a placement algorithm. These resources are then used in the low-level DMA programming with the aim to provide a deadlock-free, race-free execution.
The split and join objectfifo data movement patterns are supported on Memory tiles, which do not have a compute core. As such, data is forwarded from the input channel(s) of the Memtile DMA, to the output channel(s) directly.
As an example, the code below describes a join of three input objectfifos of 8-element wide objects into one output objectfifo of 24-element wide objects. For simplicity, all objectfifos have a depth of one.
The above code used to generate the following buffer descriptor (BD) level programming for the Memory tile DMA, where each input objectfifo uses a different S2MM channel to write 8-element wide data tensors in the same
%out_buff_0buffer at different offsets. The 24-element wide buffer is then written to the AXI stream on the MM2S channel.At this point, it is important to note that a DMA BD can only have a single lock acquire operation and a single lock release operation, because of hardware limitations.
The DMA program above uses a single producer and consumer semaphore lock pair for the join pattern (i.e.
out_prod_lockandout_cons_lock), where each input channel adds one token to the consumer lock and the output channel acquires all three consumer tokens, writes the data to the AXI stream, then adds three tokens to the producer lock, with the intention to have one token per input channel to consume for a new iteration of the full DMA program. This program, however, contains a race. In the case where one input channel receives data much faster than the other two input channels, nothing would stop it from consuming multiple or even all of the producer lock tokens and incrementing the consumer lock until the output channel would have enough tokens to execute.To ensure a race-free execution, the program below is generated instead:
In the lowering above each S2MM channel has its own pair of producer and consumer semaphore locks. As a DMA BD can only have a maximum of one acquire and one release lock operation, the output channel now has a chain of three BDs, where each BD sends a smaller part of the data corresponding to each of the join inputs.
Please note that the objectfifo link uses a list of offsets to determine where data should be written to or read from during a join or distribute. It currently also uses these offsets to determine what the length of each transfer is, and not the sizes of the objectfifos in the link. This is because the object sizes of the input objectfifos of a join could be smaller and simply require multiple DMA transfers to accumulate on the Memory tile. Similarly for the output objectfifos in a distribute.
Limitations
While the lowering above ensures correctness, it limits the complexity of data layout transformations that can be applied on the join output. These transformations are applied on the fly by the
aie.dma_bdoperations. These operations specify the memory buffer which is being accessed, at what offset, and what length. Since the BDs in the program above only operate on 8-element wide tensors, a transformation on the full 24-element wide output of the join cannot be applied.One solution that was tested for the example above is to keep the BD chain on the output channel as is, but only one of the
aie.dma_bdoperations actually writes the full 24-element wide data to the stream. The code below shows the lowering if we wanted to apply a[(8, 1), (3, 8)])dimensions_to_stream transformation on the output objectfifo.While the lowering above supports a data layout transformation on the full output of the join, the DMA program is no longer race-free: once the producer locks of each join input are released in the output DMA BD-chain, data in the buffer can be overwritten by a new data transfer.
Possible Solutions
Following the last point in the previous section, it would be possible to generate additional "dummy" locks to acquire and release after the
aie.dma_bdoperations with the data transfer in the BD chain. This would ensure that the input channels do not overwrite the data before it is written to the AXI stream.Another option is to increase the depth of the output objectfifo in the join such that data is buffered into a different object while the other is written to the stream.
Future Integration
This is an area that would greatly benefit from a rework and is already part of the roadmap for data movement development. Specifically, improvements should allow more flexibility when it comes to how data layout transformations are communicated through the data movement primitives and on how objectfifos of different object sizes and depths are coupled through Memory tiles.
Beta Was this translation helpful? Give feedback.
All reactions