High performance SPI (50 MHz) with DMA on LPSPI2 on NXP MIMXRT1061 #70580

asteriskSF · 2024-03-21T21:00:01Z

asteriskSF
Mar 21, 2024

I need to be able to transfer data from a SPI connected peripheral at close to 50 Mbps. The custom board is configured for LPSPI2 operation on pins SD_B1_07-SD_B1_09. I have managed to configured the clocks to achieve ~50 MHz clock.
Operating with ISR there are 2-4 uS gaps in the data stream about every 8 bytes. This reduces the effective data rate to about ~30 Mbps.
I was able to achieve a higher data rate on the 1062_evkb on LPSPI1 using the DMA option.
When I try to use DMA on LPSPI2, I get errors during spi operations and no SPI activity on the bus.
Is there an example for zephyr of how to use LPSPI2 with DMA?
What should I use for the dmas cell inside of LPSPI2?
Are there any other settings necessary to get the maximum data rates from eDMA? Are there FIFO thresholds that might prevent data underrun from occurring? (During my testing I was still seeing some small gaps between every byte at 50MHz)
@DerekSnell any tips?

Answered by DerekSnell

Apr 10, 2024

Hi @asteriskSF ,
Yes, it sounds like the cache is likely the cause of this. In the RT1xxx architecture, the caches are only used on the busses of the Cortex-M7 CPU, and Zephyr enables these by default. Other masters in the SOC like the DMA do not access the cache. Therefore, any shared RAM locations like your SPI buffers need to be in non-cached memory. Otherwise, the CPU and DMA may not be using the same data.

Starting with Zephyr v3.5, you can use the devicetree attribute DT_MEM_ARM(ATTR_MPU_RAM_NOCACHE) to disable the cache for a specific MPU region, like in the OCRAM. Some examples of using this attribute include the mem_attr test, or this RT1170 overlay.

Let us know if that resolves …

View full answer

asteriskSF · 2024-03-21T21:22:29Z

asteriskSF
Mar 21, 2024
Author

The error I'm getting is logged from zephyr/drivers/spi/spi_context.h:
if (k_sem_take(&ctx->sync, timeout)) {
LOG_ERR("Timeout waiting for transfer complete");
return -ETIMEDOUT;
}

It indicates the ISR for the dma completion did call spi_mcux_dma_callback.

0 replies

asteriskSF · 2024-03-26T18:18:38Z

asteriskSF
Mar 26, 2024
Author

I have made some progress on this issue. I found my setup for the edma in devicetree did not have the correct DMAMUX channels. LPSPI2 uses channels 77 and 78 in the edma. After fixing this I'm able to get SPI operations working using DMA but I'm still not able to get the full data rate.

The issue now seems to be that zephyr on MIMXRT106x SPI+DMA maxes out at 1 byte every 186 nS. Even if I increase the SPI clock rate or decrease the transfer-delay, the transfers still occur at the same interval. This suggests to me that the issue is actually some configuration of the DMA engine or the data transfer over the buses. Also it seems to transfer only 1 byte at a time even though I'm transferring a total of 5760 bytes.

Any recommendations on how I can configure the DMA to transfer data at higher rates?

1 reply

asteriskSF Apr 3, 2024
Author

I'm still stuck on reaching 50 Mbps SPI with DMA.

From what I have read it seems each byte on the SPI bus requires 2 DMA operations, one for rx and one for tx. (This seems to apply even when there is no tx data, since instead a dummy byte is exchanged.)

The eDMA is rather complex but based on the reference manual for the K60 eDMA, which I have read is the same as in MIMXRT1062, it appears that every time the DMA services a request it takes 15-18 cycles. (see section 22.4.4 in K60 reference manual K60 kinetis

The eDMA runs from the IPG clock which by default is configured to 150 MHz (in zephyr/soc/arm/nxp_imx/rt/soc_rt10xx.c) this is the maximum allowed frequency, and technically this frequency is only allowed in "overdrive" mode. With 2 operations of 15-18 cycles, that maximum theoretical data rate is pretty close to ~5 MBytes / second or ~40 Mbps, which is the data rate I'm getting.

I have experimented with increasing the IPG_CLK by adjusting the ipg_podf through app.overlay and I do get an improvement in the data rate. Unfortunately this is not a viable solution since it would exceed the maximum recommended IPG clock rate, indicated in Table 14-5 in the RT1060x reference manual.

Fundamentally the issue seems to be that DMA is only transferring a single byte per request, to achieve higher data rates it would be necessary to exchange multiple bytes per request. I tried to set this up in zephyr/drivers/spi/spi_mcux_lpspi.c by adjusting the source_burst_length, however this causes the device to reset, perhaps there are some data alignment requirements?

Similar questions have been asked back in 2020: https://community.nxp.com/t5/i-MX-RT/Questions-about-DMA-speed-on-rt1062/td-p/1031199 The NXP support offered information that contradicted the reference manual, indicating the IPG clock could be run up to 528 MHz.

@DerekSnell -- is there any one at NXP who can help with this? Is it possible to run the IPG clock at the same speed as arm core clock?

DerekSnell · 2024-04-03T16:58:21Z

DerekSnell
Apr 3, 2024
Collaborator

Hi @asteriskSF ,
You are correct that in the RT1060, the eDMA is clocked from the IPG_CLK, which is limited to 150 MHz as stated in the Reference Manual. Thank you for pointing out the Community post. I see the confusion. From the time stamps of the posts, the last post confirmed the same 150 MHz limit. But since those posts are confusing, I will work with the author, and clean up that post to make it clear.

The eDMA will have higher throughput when the transfers are multiple bytes. That will reduce the overhead for the eDMA. And I see in the RT1060 using the LPSPI, both Rx and Tx have FIFOs of 16-words. I think leveraging those FIFOs with the DMA transfer would improve the throughput.

Also, is the DMA accessing the buffers for SPI in OCRAM? Using SRAM on-chip will reduce the latency compared to external RAMs. And due to the bus interconnects, the DMA typically has the highest performance when accessing the OCRAM. For more details, see AN12437 i.MX RT Series Performance Optimization.

Best regards

2 replies

asteriskSF Apr 3, 2024
Author

Thanks for the link, I'll read through it.
My thoughts..

Yes this data is coming from the OCRAM. This operation is a spi_read, the rx data is going into the OCRAM and the tx is coming from the dummy data in DTCM (also OCRAM). This project does not have any external RAM.
The receive data is placed in a buffer which was allocated like this:

      static frame_buffer_t frame_buffers_##inst[NUM_FRAME_BUFFERS]                \
	          Z_GENERIC_SECTION(LINKER_DT_NODE_REGION_NAME(DT_NODELABEL(ocram)));

The transmitted dummy data is coming from the g_lpspiDummyData I believe:

  /* @brief Dummy data for each instance. This data is used when user's tx buffer is NULL*/
volatile uint8_t g_lpspiDummyData[ARRAY_SIZE(s_lpspiBases)] = {0};

perhaps the transmit DMA request can be improved by placing the dummy data into the OCRAM.
This project utilized pins sd_b1_07 .. sd_b1_09 with LPSPI2, therefore flexspi is not option and we don't get the advantage of the cache.
The zephyr drivers configured watermarks for 0 and single byte transfers in the zephyr/drivers/spi/spi_mcux_lpspi.c .

@DerekSnell can you recommend any example for configuring LPSPI for multi-byte operations with SPI? I made some initial attempts but it didn't go so well, and resulted in faults..

asteriskSF Apr 3, 2024
Author

Also I noticed the lpspi driver is configuring the source_gather_en/dest_scatter_en, even though the block_count is always 1.
I read in the K60 reference manual that scatter/gather operation adds an additional delay between channels, as mentioned in section 22.4.4.3.

I'll try to see if it works and improves performance if I simply disable the source_gather_en/dest_scatter_n.

DerekSnell · 2024-04-03T18:02:59Z

DerekSnell
Apr 3, 2024
Collaborator

Hi @asteriskSF ,
I am not aware of an example using Zephyr. You could check non-Zephyr examples to see how the hardware is configured. Have you already checked if the MCXUpresso SDK example transfers more than 1 byte? If not, you could search and ask for an example at NXP's Community forum. The support team there may be aware of an example.
Best regards

0 replies

asteriskSF · 2024-04-04T16:02:10Z

asteriskSF
Apr 4, 2024
Author

I made some attempts at multiple byte DMA transfers, yesterday.

I was able to get the tx dma to read 16 bytes from memory and load up the tx fifo 1 byte at a time on a single DMA operation, which keeps the tx dma out of the way for most rx operations.

For the LPSPI to accept more than 1 byte per FIFO entry would require modifying the frame size which is written into the command fifo at the beginning of the SPI operation, or more likely writing the command fifo between each dma operation. The existing driver code only writes the TCR (comamand fifo) at the start of the operation using fsl_lpspi.c function LPSPI_MasterInit. I would need to add a function to update the TCR during operation to fsl_lpspi and call it from spi_mcux_dma_rxtx_load between operations. I haven't tried this yet since I'm concerned it will have other ramifications that would require a lot of time to sort out.

I've been pursuing simple fixes so far. Another speed-up I found was to operate the LPSPI at 206 MHz which then divides to 53 MHz inside the LPSPI module. Unfortunately this appears to exceed the LPSPI_CLK_ROOT maximum of 133 MHz.

Is there anything like a higher speed grade for these parts that might support higher IPG and LPSPI clock rates?

0 replies

asteriskSF · 2024-04-08T21:24:25Z

asteriskSF
Apr 8, 2024
Author

@DerekSnell

We're also seeing 32-byte blocks of data which are not updated in the middle of the data read from SPI. I am zero'ing the memory region before the LPSPI-DMA starts. Zero is not an expected value since these are samples of analog values that always have some bias. After transfer over USB I find regions where there is data that hasn't been updated. Its always in blocks of 32-bytes or multiples of 32-bytes.

I read there is a 32-byte cache in other support requests, though I'm having trouble finding that in the reference manual.

Is there some configuration of the OCRAM that we are missing to get reliable memory updates from the DMA?

1 reply

DerekSnell Apr 10, 2024
Collaborator

Hi @asteriskSF ,
Yes, it sounds like the cache is likely the cause of this. In the RT1xxx architecture, the caches are only used on the busses of the Cortex-M7 CPU, and Zephyr enables these by default. Other masters in the SOC like the DMA do not access the cache. Therefore, any shared RAM locations like your SPI buffers need to be in non-cached memory. Otherwise, the CPU and DMA may not be using the same data.

Starting with Zephyr v3.5, you can use the devicetree attribute DT_MEM_ARM(ATTR_MPU_RAM_NOCACHE) to disable the cache for a specific MPU region, like in the OCRAM. Some examples of using this attribute include the mem_attr test, or this RT1170 overlay.

Let us know if that resolves your issue. Best regards

Answer selected by asteriskSF

asteriskSF · 2024-04-10T15:53:57Z

asteriskSF
Apr 10, 2024
Author

Thanks @DerekSnell your explanation makes it all clear. I found AN12042 yesterday which also has relevant information, though not explained quite so succinctly.

Instead of completely disabling the cache for OCRAM, I have implemented cache flush and invalidate before I start the DMA operations, this works in the design because the CPU does not access the memory region during until the SPI/DMA is completed. I used the zephyr Cache API in lieu of the CMSIS or NXP specific APIs. Since there is a future plan to perform some signal processing, this seem like the best way to maintain CPU performance.

I've also managed (barely) to reach the performance requirements for this application but at the moment it requires violating the maximum SPI peripheral clock. When I run the SPI peripheral clock below the maximum frequency I get an extended delay of around 40nS between every byte on the SPI peripheral bus which significantly reduces the throughput. The maximum rated clock rate is 133MHz so I need to run between 120MHz - 100 MHz to achieve 60-50 MHz clock rate of the external peripheral. With a LPSPI peripheral clock rate of 216 MHz, I am getting a bus clock rate of 54 MHz with only a ~20 nS gap between each byte, which is fast enough for the application. I have also tried adjusting the SPI transfer-delay property in devicetree but this did not change the delay I'm seeing on the bus.

I am not entirely clear whether this issue originates. I can rule out the DMA in the TX path, I have this configured now to send 16 bytes at each request. I can clearly see slightly extended delays every 16 bytes when this occurs. Also I don't think the DMA in the RX path would cause an issue since the DMA should be keeping the FIFO empty by reading a byte on ever receive.. This suggests the problem is localized to the LPSPI logic, perhaps related to the timing of logic signals from the LPSPI peripheral clock domain relative to the bus clock.

Any ideas what causes the extended delay between LPSPI bytes?
Is there an application note with detailed explanation of LPSPI peripheral?
Would the performance be better if Flexspi was used instead of LPSPI?

0 replies

asteriskSF · 2024-04-10T15:55:31Z

asteriskSF
Apr 10, 2024
Author

Never mind about the flexspi vs lpspi question, I just now realize flexspi is specifically designed for flash-like devices and would not work for the current design..

0 replies

asteriskSF · 2024-04-12T23:36:19Z

asteriskSF
Apr 12, 2024
Author

Latest update from NXP technical support is that LPSPI doesn't support speed above 30MHz, and might fail in mass production.

We are evaluating switching to another vendor or trying to switch to the FlexSPI. The FlexSPI seems challenging since the documentation is entirely focused on operation with Flash peripherals.

0 replies

High performance SPI (50 MHz) with DMA on LPSPI2 on NXP MIMXRT1061 #70580

Uh oh!

asteriskSF Mar 21, 2024

Replies: 9 comments · 4 replies

Uh oh!

asteriskSF Mar 21, 2024 Author

Uh oh!

asteriskSF Mar 26, 2024 Author

Uh oh!

asteriskSF Apr 3, 2024 Author

Uh oh!

DerekSnell Apr 3, 2024 Collaborator

Uh oh!

asteriskSF Apr 3, 2024 Author

Uh oh!

asteriskSF Apr 3, 2024 Author

Uh oh!

DerekSnell Apr 3, 2024 Collaborator

Uh oh!

asteriskSF Apr 4, 2024 Author

Uh oh!

asteriskSF Apr 8, 2024 Author

Uh oh!

DerekSnell Apr 10, 2024 Collaborator

Uh oh!

asteriskSF Apr 10, 2024 Author

Uh oh!

asteriskSF Apr 10, 2024 Author

Uh oh!

asteriskSF Apr 12, 2024 Author

asteriskSF
Mar 21, 2024

Replies: 9 comments 4 replies

asteriskSF
Mar 21, 2024
Author

asteriskSF
Mar 26, 2024
Author

asteriskSF Apr 3, 2024
Author

DerekSnell
Apr 3, 2024
Collaborator

asteriskSF Apr 3, 2024
Author

asteriskSF Apr 3, 2024
Author

DerekSnell
Apr 3, 2024
Collaborator

asteriskSF
Apr 4, 2024
Author

asteriskSF
Apr 8, 2024
Author

DerekSnell Apr 10, 2024
Collaborator

asteriskSF
Apr 10, 2024
Author

asteriskSF
Apr 10, 2024
Author

asteriskSF
Apr 12, 2024
Author