Skip to content

Conversation

@devprodest
Copy link
Contributor

@devprodest devprodest commented Mar 12, 2025

Description

There are platforms where copying data using the CPU is not very optimal.
The easiest and fastest way is to use DMA.
To implement this functionality, you need to replace the standard memcpy with memcpy with DMA

Test Steps

No additional actions are required. This functionality improves the flexibility of the code.
To use the alternative function, you need to define pvPortMemCpyStreamBuffer in the FreeRTOSIPConfig.h file.

#include "utils\memcpy_with_dma.h" ///< my headeris for example only

#define pvPortMemCpyStreamBuffer(dst, src, count) memcpy_with_dma(dst, src, count)

If not specified, memcpy from the standard library will be used.

Checklist:

  • I have tested my changes. No regression in existing tests.
  • I have modified and/or added unit-tests to cover the code changes in this Pull Request.

Related Issue

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…am buffers. Useful for optimization in systems that allow DMA to be used only in some memory areas.
@devprodest devprodest requested a review from a team as a code owner March 12, 2025 10:47
@tony-josi-aws
Copy link
Member

@devprodest

Thanks for contributing to FreeRTOS+TCP.

Since the memory allocator for TCP stream buffers can be configured if needed by defining pvPortMallocLarge it makes sense to have something similar for memcpy that uses those buffers if it adds value.

I'm curious to know about the usage of DMA based memcpy (memcpy_with_dma in your case) here. Are you yielding the RTOS task (inside memcpy_with_dma) after the DMA has been setup and waiting for the DMA completion interrupt to make the task ready again? If that's the case, are you gaining significant performance benefits if you take into account the extra compute time spent on the context switches?
Or is it polled inside the memcpy_with_dma? I believe there is not much performance benefit if DMA is setup for transfer and then its polled for completion.

@devprodest
Copy link
Contributor Author

devprodest commented Mar 12, 2025

@tony-josi-aws
hi, in my DMA platform, it can copy 128-bit-wide data, unlike the CPU, which does it in bytes with unaligned access. Or 32 if aligned access.
The Dma takes care of the alignment itself.
Due to this change, the stack speed up to x10
I'm just using the custom functions malloc to implement zero copying, since the Ethernet in my chip can only use some memory addresses.

@devprodest
Copy link
Contributor Author

due to the features of the platform and saving resources for context switching, polling of the readiness flag is used, but this does not prevent you from making the work very fast.

@tony-josi-aws
Copy link
Member

@devprodest

Thanks for the update.

So the memcpy_with_dma is setting up the DMA for copying (which in your platform is faster due to the 128-bit wide accesses), and then it polls for the readiness flag to see if the copying is completed. Which is faster than normal memcpy as the DMA is faster, is that right?

Due to this change, the stack speed up to x10

That's a good improvement; was it measured using IPERF? Also wondering which hardware platform you are using.

I'm just using the custom functions malloc to implement zero copying, since the Ethernet in my chip can only use some memory addresses.

You can take a look at this page: TCP/IP Stack Network Buffers Allocation Schemes and their implication on simplicity, CPU load, and throughput performance if you haven't already to see if BufferAllocation_1.c fits your use case better as you can statically allocate network buffers in the specified section of your memory map. [example]

@devprodest
Copy link
Contributor Author

devprodest commented Mar 13, 2025

So the memcpy_with_dma is setting up the DMA for copying (which in your platform is faster due to the 128-bit wide accesses), and then it polls for the readiness flag to see if the copying is completed. Which is faster than normal memcpy as the DMA is faster, is that right?

Yes, that's right. this increases the speed of copying.
I wrote more details below.

That's a good improvement; was it measured using IPERF? Also wondering which hardware platform you are using.

No, the check was carried out using an algorithm that is similar to the actual application.
uploaded and downloaded 256 megabytes of data from the device.

I can't say which platform yet, it's a trade secret. But I can describe some of the features. This is a video processing chip. Similar to GoPro or other similar cameras, but with some interesting effects. CPU is 32-bits risc-v.
My platform includes several different memories: TCM, SRAM, and DDR (separate IC). DMA can only work with sram and ddr.

TCM is used for firmware operation. SRAM stores stack buffers and other buffers that DMA should work with.

The main algorithm of operation is uploading data over the network to DDR, processing and downloading back to the PC.

This is where the bottleneck is. DDR is very slow memory compared to sram. And byte-by-byte copying is a very long operation. DMA does this very quickly and in large transactions.

You can take a look at this page: TCP/IP Stack Network Buffers Allocation Schemes and their implication on simplicity, CPU load, and throughput performance if you haven't already to see if BufferAllocation_1.c fits your use case better as you can statically allocate network buffers in the specified section of your memory map. [example]

Thank you for this suggestion. I did this a few days ago and It didn't have the desired effect. The allocator is not currently in use. But that doesn't solve the whole problem.

aggarg
aggarg previously approved these changes Mar 13, 2025
@tony-josi-aws
Copy link
Member

/bot run formatting

@tony-josi-aws tony-josi-aws merged commit 1e32b23 into FreeRTOS:main Mar 13, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants