Comparison of latency in memory-mapped operations via XBUS using different approaches. #1493

Unike267 · 2026-02-09T17:32:10Z

Unike267
Feb 9, 2026
Collaborator

Hi there!

I’ve tested two approaches to perform read/write operations through the default memory-mapped interface (XBUS) and would like to share the latency results of both implementations with you.

These two ways are as follows:

1. Using Stephan’s custom functions:

Read: neorv32_cpu_load_unsigned_word()
Write: neorv32_cpu_store_unsigned_word()

2. Using pointers in C directly

To do that, I’ve generated two additional files to complement the main.c.

First, I’ve created a header xbus.h to declare the functions and to specify the selected addresses within the mappable addresses:

#include <string.h>
#include <stdint.h>

//Define your constants

#define n32_BASE        0xA0000000 // Define your base address to establish the rd/wr addresses from/to NEORV32
                                   // (among those mappable to XBUS, see TIP)


// If applicable, define your regions.


//Declare your functions

void write(unsigned int address, int value);
int read(unsigned int address);

Tip

To clarify the addresses that can be mapped to XBUS, I recommend reading:
#1247 (reply in thread)

Second, I’ve created a code xbus.c to define these functions.

//Define your functions

//Write to the peripheral device through XBUS
extern void write(unsigned int address, int value)
{
    unsigned int offset = address + n32_BASE;

    int volatile * const data_wr = (int *) offset;
    *data_wr = value;
    return;
}

//Read from the peripheral device through XBUS
extern int read(unsigned int address)
{
    unsigned int offset = address + n32_BASE;
    int read;

    int volatile * const data_rd = (int *) offset;
    read = *data_rd;
    return read;
}

Finally, I’ve used these functions in the main.c

// Add header
#include "xbus.h"

// Use the functions

write(addr,data)

value = read(addr)

Warning

These extracts are for guidance only. See the actual implementation below: xbus.h, xbus.c and main.c

Setup

The following block diagram defines the simulation setup where the measurements were performed:

The test involves writing 1024 random values of varying widths to RAM, which are read from a ROM (initially loaded from the HDL). Then it reads all this data from RAM. The test uses both approaches for each memory-mapped read/write operation.

Note

This test is performed three times per implementation with different width data: 8, 16, and 32 bits.

To perform the measures, the CSR mcycle register is used.

Note

See 897 to learn how this measure is performed.

Results

The results are summarized in the following table:

TEST with 1024 elements (Results in clk cycles)	Stephan's funct	C pointers
To write RAM from data read from ROM	20490	60416
To read the entire RAM	24582	45057

These results can be checked in the repo’s CI:

Using Stephan’s function: https://github.com/Unike267/afselerator/actions/runs/21666257459/job/62462391204
Using C pointers directly: https://github.com/Unike267/afselerator/actions/runs/21823907030/job/62963913196

Conclusions

The principal conclusion is that Stephan’s implementation works better than using C pointers directly.

I think that is because these functions are optimized at the assembler level, see:

neorv32/sw/lib/include/neorv32_cpu.h

Lines 50 to 55 in f902371

    
           inline void __attribute__ ((always_inline)) neorv32_cpu_store_unsigned_word(uint32_t addr, uint32_t wdata) { 
        
             uint32_t reg_addr = addr; 
        
             uint32_t reg_data = wdata; 
        
             asm volatile ("sw %[da], 0(%[ad])" : : [da] "r" (reg_data), [ad] "r" (reg_addr)); 
        
           }

neorv32/sw/lib/include/neorv32_cpu.h

Lines 96 to 102 in f902371

    
           inline uint32_t __attribute__ ((always_inline)) neorv32_cpu_load_unsigned_word(uint32_t addr) { 
        
             uint32_t reg_addr = addr; 
        
             uint32_t reg_data; 
        
             asm volatile ("lw %[da], 0(%[ad])" : [da] "=r" (reg_data) : [ad] "r" (reg_addr)); 
        
             return reg_data; 
        
           }

Also, it has been verified that the latency is constant across different data widths.

Feel free to add your conclusions/doubts/opinions.

Cheers!

/cc @stnolting

vhdlnerd · 2026-02-11T19:52:38Z

vhdlnerd
Feb 11, 2026

Your functions may not be inline'ed by the compiler (depending on optimization used) and your functions have an addition before the pointer access. So, I would expect your versions to be slower.

0 replies

stnolting · 2026-02-12T03:40:58Z

stnolting
Feb 12, 2026
Maintainer

Thanks for sharing your findings!

Your functions do not appear to be inlined. They are therefore called as normal functions, which results in additional overhead for calling and returning.

As @vhdlnerd already mentioned, there is at least one ADD operation in the function. Depending on how the compiler distributes the variables across the registers, the function may also have to save/restore registers to/from the stack, which results in additional overhead.

It would be best to look at the generated assembler code for that functions.

The macros from the library are ultimately just a single inline assembly instruction - no variable passing, no arithmetic, no stack operations.

For block transfers, you would then of course need a loop around it, which increases the overhead again. The good old Duff's Device could help here - or you could just use the DMA controller directly.

0 replies

vhdlnerd · 2026-02-12T14:16:28Z

vhdlnerd
Feb 12, 2026

For seeing assembly generated from C, I often use the awesome Compiler Explorer web page.

For example: https://godbolt.org/z/eex4GMaW4

This link shows @Unike267 version of the write_RAM function complied.

You get one load immediate instruction for the RAM_BASE_ADDR constant, one add instruction and the store instruction (plus the return if the function does not get inlined).

1 reply

stnolting Feb 13, 2026
Maintainer

I also like the Compiler Explorer. However, you should also consider the default NEORV32 GCC flags - especially the RISC-V ISA config (although this probably makes no difference here):

neorv32/sw/common/common.mk

Lines 135 to 140 in 8a583a6

    
           # Compiler & linker flags 
        
           CC_FLAGS  = -march=$(MARCH) -mabi=$(MABI) $(EFFORT) -Wall -ffunction-sections -fdata-sections -nostartfiles -mno-fdiv 
        
           CC_FLAGS += -mstrict-align -mbranch-cost=10 -Wl,--gc-sections -ffp-contract=off -g 
        
           CC_FLAGS += $(USER_FLAGS) 
        
           LD_LIBS   = -lm -lc -lgcc 
        
           LD_LIBS  += $(USER_LIBS)

Unike267 · 2026-03-11T12:28:40Z

Unike267
Mar 11, 2026
Collaborator Author

Thank you very much, @vhdlnerd and @stnolting!

Googling it, I saw that there is an inline C directive that suggests the compiler embed the function code at the point of call, and it is often used statically.

I modified the code and tested these four scenarios:

Adding the inline directive to public/external functions in the xbus.c library.
Same scenario as above, but changing the compiler optimization to -O2.
Moving the functions to the main.c (not using xbus.h or xbus.c), making them private by applying the static directive, and also using the inline directive to each, which should embed their code at the use site for improved latency performance.
Same scenario as above, but changing the compiler optimization to -O2.

Here are the measured timings (in clk cycles) for a complete RAM read (1024 elements) under each scenario, highlighting how the use of static and inline directives, as well as optimization flags, affected latency and performance.

First scenario	Second scenario	Third scenario	Fourth scenario
45057	45057	26623	24577

The use of the RAM/ROM read and RAM write functions inside the project suggests that they should be public for consistent use.
However, these results indicate that applying the inline directive to public/external functions does not improve performance, while static inline functions achieve measurable performance gains.

I wonder if there's a way to achieve the performance of static inline functions while keeping them public/external for use across different programs.

Cheers!

P.S.: Sorry for the delay, I was at a conference in Mexico last week, and the previous weeks were quite busy 😅😅.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison of latency in memory-mapped operations via XBUS using different approaches. #1493

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Comparison of latency in memory-mapped operations via XBUS using different approaches. #1493

Uh oh!

Uh oh!

Unike267 Feb 9, 2026 Collaborator

1. Using Stephan’s custom functions:

2. Using pointers in C directly

Setup

Results

Conclusions

Replies: 4 comments · 1 reply

Uh oh!

vhdlnerd Feb 11, 2026

Uh oh!

stnolting Feb 12, 2026 Maintainer

Uh oh!

vhdlnerd Feb 12, 2026

Uh oh!

stnolting Feb 13, 2026 Maintainer

Uh oh!

Unike267 Mar 11, 2026 Collaborator Author

Unike267
Feb 9, 2026
Collaborator

Replies: 4 comments 1 reply

vhdlnerd
Feb 11, 2026

stnolting
Feb 12, 2026
Maintainer

vhdlnerd
Feb 12, 2026

stnolting Feb 13, 2026
Maintainer

Unike267
Mar 11, 2026
Collaborator Author