Comparison of latency in memory-mapped operations via XBUS using different approaches. #1493
Replies: 4 comments 1 reply
-
|
Your functions may not be inline'ed by the compiler (depending on optimization used) and your functions have an addition before the pointer access. So, I would expect your versions to be slower. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for sharing your findings! Your functions do not appear to be inlined. They are therefore called as normal functions, which results in additional overhead for calling and returning. As @vhdlnerd already mentioned, there is at least one ADD operation in the function. Depending on how the compiler distributes the variables across the registers, the function may also have to save/restore registers to/from the stack, which results in additional overhead. It would be best to look at the generated assembler code for that functions. The macros from the library are ultimately just a single inline assembly instruction - no variable passing, no arithmetic, no stack operations. For block transfers, you would then of course need a loop around it, which increases the overhead again. The good old Duff's Device could help here - or you could just use the DMA controller directly. |
Beta Was this translation helpful? Give feedback.
-
|
For seeing assembly generated from C, I often use the awesome Compiler Explorer web page. For example: https://godbolt.org/z/eex4GMaW4 This link shows @Unike267 version of the write_RAM function complied. You get one load immediate instruction for the RAM_BASE_ADDR constant, one add instruction and the store instruction (plus the return if the function does not get inlined). |
Beta Was this translation helpful? Give feedback.
-
|
Thank you very much, @vhdlnerd and @stnolting! Googling it, I saw that there is an I modified the code and tested these four scenarios:
Here are the measured timings (in clk cycles) for a complete RAM read (1024 elements) under each scenario, highlighting how the use of
The use of the RAM/ROM read and RAM write functions inside the project suggests that they should be public for consistent use. I wonder if there's a way to achieve the performance of Cheers! P.S.: Sorry for the delay, I was at a conference in Mexico last week, and the previous weeks were quite busy 😅😅. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there!
I’ve tested two approaches to perform read/write operations through the default memory-mapped interface (XBUS) and would like to share the latency results of both implementations with you.
These two ways are as follows:
1. Using Stephan’s custom functions:
neorv32_cpu_load_unsigned_word()neorv32_cpu_store_unsigned_word()2. Using pointers in C directly
To do that, I’ve generated two additional files to complement the
main.c.First, I’ve created a header
xbus.hto declare the functions and to specify the selected addresses within the mappable addresses:Tip
To clarify the addresses that can be mapped to XBUS, I recommend reading:
#1247 (reply in thread)
Second, I’ve created a code
xbus.cto define these functions.Finally, I’ve used these functions in the main.c
Warning
These extracts are for guidance only. See the actual implementation below: xbus.h, xbus.c and main.c
Setup
The following block diagram defines the simulation setup where the measurements were performed:
The test involves writing 1024 random values of varying widths to RAM, which are read from a ROM (initially loaded from the HDL). Then it reads all this data from RAM. The test uses both approaches for each memory-mapped read/write operation.
Note
This test is performed three times per implementation with different width data: 8, 16, and 32 bits.
To perform the measures, the
CSR mcycleregister is used.Note
See 897 to learn how this measure is performed.
Results
The results are summarized in the following table:
These results can be checked in the repo’s CI:
Conclusions
The principal conclusion is that Stephan’s implementation works better than using C pointers directly.
I think that is because these functions are optimized at the assembler level, see:
neorv32/sw/lib/include/neorv32_cpu.h
Lines 50 to 55 in f902371
neorv32/sw/lib/include/neorv32_cpu.h
Lines 96 to 102 in f902371
Also, it has been verified that the latency is constant across different data widths.
Feel free to add your conclusions/doubts/opinions.
Cheers!
/cc @stnolting
Beta Was this translation helpful? Give feedback.
All reactions