High performance SPI (50 MHz) with DMA on LPSPI2 on NXP MIMXRT1061 #70580
-
I need to be able to transfer data from a SPI connected peripheral at close to 50 Mbps. The custom board is configured for LPSPI2 operation on pins SD_B1_07-SD_B1_09. I have managed to configured the clocks to achieve ~50 MHz clock. |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 4 replies
-
The error I'm getting is logged from zephyr/drivers/spi/spi_context.h: It indicates the ISR for the dma completion did call spi_mcux_dma_callback. |
Beta Was this translation helpful? Give feedback.
-
I have made some progress on this issue. I found my setup for the edma in devicetree did not have the correct DMAMUX channels. LPSPI2 uses channels 77 and 78 in the edma. After fixing this I'm able to get SPI operations working using DMA but I'm still not able to get the full data rate. The issue now seems to be that zephyr on MIMXRT106x SPI+DMA maxes out at 1 byte every 186 nS. Even if I increase the SPI clock rate or decrease the transfer-delay, the transfers still occur at the same interval. This suggests to me that the issue is actually some configuration of the DMA engine or the data transfer over the buses. Also it seems to transfer only 1 byte at a time even though I'm transferring a total of 5760 bytes. Any recommendations on how I can configure the DMA to transfer data at higher rates? |
Beta Was this translation helpful? Give feedback.
-
Hi @asteriskSF , The eDMA will have higher throughput when the transfers are multiple bytes. That will reduce the overhead for the eDMA. And I see in the RT1060 using the LPSPI, both Rx and Tx have FIFOs of 16-words. I think leveraging those FIFOs with the DMA transfer would improve the throughput. Also, is the DMA accessing the buffers for SPI in OCRAM? Using SRAM on-chip will reduce the latency compared to external RAMs. And due to the bus interconnects, the DMA typically has the highest performance when accessing the OCRAM. For more details, see AN12437 i.MX RT Series Performance Optimization. Best regards |
Beta Was this translation helpful? Give feedback.
-
Hi @asteriskSF , |
Beta Was this translation helpful? Give feedback.
-
I made some attempts at multiple byte DMA transfers, yesterday. I was able to get the tx dma to read 16 bytes from memory and load up the tx fifo 1 byte at a time on a single DMA operation, which keeps the tx dma out of the way for most rx operations. For the LPSPI to accept more than 1 byte per FIFO entry would require modifying the frame size which is written into the command fifo at the beginning of the SPI operation, or more likely writing the command fifo between each dma operation. The existing driver code only writes the TCR (comamand fifo) at the start of the operation using fsl_lpspi.c function LPSPI_MasterInit. I would need to add a function to update the TCR during operation to fsl_lpspi and call it from spi_mcux_dma_rxtx_load between operations. I haven't tried this yet since I'm concerned it will have other ramifications that would require a lot of time to sort out. I've been pursuing simple fixes so far. Another speed-up I found was to operate the LPSPI at 206 MHz which then divides to 53 MHz inside the LPSPI module. Unfortunately this appears to exceed the LPSPI_CLK_ROOT maximum of 133 MHz. Is there anything like a higher speed grade for these parts that might support higher IPG and LPSPI clock rates? |
Beta Was this translation helpful? Give feedback.
-
We're also seeing 32-byte blocks of data which are not updated in the middle of the data read from SPI. I am zero'ing the memory region before the LPSPI-DMA starts. Zero is not an expected value since these are samples of analog values that always have some bias. After transfer over USB I find regions where there is data that hasn't been updated. Its always in blocks of 32-bytes or multiples of 32-bytes. I read there is a 32-byte cache in other support requests, though I'm having trouble finding that in the reference manual. Is there some configuration of the OCRAM that we are missing to get reliable memory updates from the DMA? |
Beta Was this translation helpful? Give feedback.
-
Thanks @DerekSnell your explanation makes it all clear. I found AN12042 yesterday which also has relevant information, though not explained quite so succinctly. Instead of completely disabling the cache for OCRAM, I have implemented cache flush and invalidate before I start the DMA operations, this works in the design because the CPU does not access the memory region during until the SPI/DMA is completed. I used the zephyr Cache API in lieu of the CMSIS or NXP specific APIs. Since there is a future plan to perform some signal processing, this seem like the best way to maintain CPU performance. I've also managed (barely) to reach the performance requirements for this application but at the moment it requires violating the maximum SPI peripheral clock. When I run the SPI peripheral clock below the maximum frequency I get an extended delay of around 40nS between every byte on the SPI peripheral bus which significantly reduces the throughput. The maximum rated clock rate is 133MHz so I need to run between 120MHz - 100 MHz to achieve 60-50 MHz clock rate of the external peripheral. With a LPSPI peripheral clock rate of 216 MHz, I am getting a bus clock rate of 54 MHz with only a ~20 nS gap between each byte, which is fast enough for the application. I have also tried adjusting the SPI transfer-delay property in devicetree but this did not change the delay I'm seeing on the bus. I am not entirely clear whether this issue originates. I can rule out the DMA in the TX path, I have this configured now to send 16 bytes at each request. I can clearly see slightly extended delays every 16 bytes when this occurs. Also I don't think the DMA in the RX path would cause an issue since the DMA should be keeping the FIFO empty by reading a byte on ever receive.. This suggests the problem is localized to the LPSPI logic, perhaps related to the timing of logic signals from the LPSPI peripheral clock domain relative to the bus clock. Any ideas what causes the extended delay between LPSPI bytes? |
Beta Was this translation helpful? Give feedback.
-
Never mind about the flexspi vs lpspi question, I just now realize flexspi is specifically designed for flash-like devices and would not work for the current design.. |
Beta Was this translation helpful? Give feedback.
-
Latest update from NXP technical support is that LPSPI doesn't support speed above 30MHz, and might fail in mass production. We are evaluating switching to another vendor or trying to switch to the FlexSPI. The FlexSPI seems challenging since the documentation is entirely focused on operation with Flash peripherals. |
Beta Was this translation helpful? Give feedback.
Hi @asteriskSF ,
Yes, it sounds like the cache is likely the cause of this. In the RT1xxx architecture, the caches are only used on the busses of the Cortex-M7 CPU, and Zephyr enables these by default. Other masters in the SOC like the DMA do not access the cache. Therefore, any shared RAM locations like your SPI buffers need to be in non-cached memory. Otherwise, the CPU and DMA may not be using the same data.
Starting with Zephyr v3.5, you can use the devicetree attribute
DT_MEM_ARM(ATTR_MPU_RAM_NOCACHE)
to disable the cache for a specific MPU region, like in the OCRAM. Some examples of using this attribute include the mem_attr test, or this RT1170 overlay.Let us know if that resolves …