Surprisingly large function call overhead cost #14007

george-hawkins · 2024-03-02T15:45:57Z

george-hawkins
Mar 2, 2024

I have a piece of code where introducing a function call incurs a surprising overhead. Essentially, this first bit of code is far slower than the second version.

Slow code:

def copy_to_uart():
    for _, _ in poller.ipoll(0):
        uart_input.readinto(byte_buffer)
        uart_output.write(byte_buffer)

while True:
    copy_to_uart()

Fast code:

def copy_to_uart():
    while True:
        for _, _ in poller.ipoll(0):
            uart_input.readinto(byte_buffer)
            uart_output.write(byte_buffer)

copy_to_uart()

I.e. allowing the code to regularly drop out of the function copy_to_uart (only to be immediately re-invoked by the while-loop) has a noticeable effect on how quickly I can echo data through the UART.

Of course, I understand that invoking a function incurs some cost but I'm surprised at the effect - the bytes-per-second (when I try to max things out) drops by about 14%.

Given that there are only three other calls involved, i.e. ipoll, readinto and write, this might not seem too bad. But actually, these get called many times due to the for loop (when one is trying to stream data through the UART as fast as possible). And actually, in my real code, I make many more calls in the copy_to_uart logic (and check all the return values). However, all these calls are into the standard MicroPython libraries.

Is the function call overhead really so much greater for my own code compared to the standard libraries?

Note: I'm using a C3 ESP32 and I see that while there's a native emitter for the Xtensa based ESP32s (see py/asmxtensa.c), there's no native emitter for the RISC-V based C3 ESP32s.

I've tried all kind of things suggested on the MicroPython docs on optimizations and maximizing speed. But none of them, e.g. caching object references, really had a noticeable impact (I'd already discovered that accessing global variables is expensive and so avoid that in my code).

If you want to experiment, you can find a tiny but complete example that includes the above snippet here. To go with it, there's serial_tester.py - you can run it on your laptop/PC like so:

$ python serial_tester.py --port /dev/ttyACM0 --baud-rate 230400

It depends on pyserial, which you'll already have installed if you're using mpremote.

As this code involves tying up UART0, I'd also suggest adding this block before the run call if you want to be able to iterate on uploading code (just press the boards RESET button and code waits 3 seconds before taking control of UART0, during which time you can upload a new program or connect to the REPL and press ctrl-C):

import time
# Give myself a chance to bail and upload a new program.
print('Press ctrl-C now to exit')
time.sleep(3)
print("Taking control of the USB UART")
time.sleep_ms(200)

dpgeorge · 2024-03-03T12:48:26Z

dpgeorge
Mar 3, 2024
Maintainer

The slow-down could be due to the loading of the global variable copy_to_uart() in the while-true loop.

Try this modification to the slow code:

def copy_to_uart():
    ... (same as before)

def main():
    copy = copy_to_uart  # pre-load global variable into local variable
    while True:
        copy()

0 replies

george-hawkins · 2024-03-03T14:16:33Z

george-hawkins
Mar 3, 2024
Author

Thanks for the follow-up, @dpgeorge.

The code shown in my question is just a snippet from the real code (to which I provided this link).

As noted, I'd already found that globals have a very high cost, so had already eliminated them (I believe) by wrapping everything in an apparently redundant run function rather than creating everything in the global scope.

PS thank you for the amazing tool that is MicroPython. I think, when showered with questions like this one, where it's always "why this?" / "why that?", it may sound like we're all always complaining. If so, apologies. I really appreciate MicroPython and am trying to get the most out of it 👍

2 replies

dpgeorge Mar 5, 2024
Maintainer

I'd already found that globals have a very high cost, so had already eliminated them (I believe) by wrapping everything in an apparently redundant run function rather than creating everything in the global scope.

In that linked example you've now made copy_to_uart() a closure! Because it has to close over the variables poller, uart_input and uart_output that are shared between run() and copy_to_uart(). That will slow things down as well.

It's a tricky problem because if you make copy_to_uart() a function in the global scope (not a closure) then it needs to reference poller etc as global variables, which again slows things down.

You could try a generator... kind of like a mini asyncio-like scheduler that you run yourself:

def run():
    UART(0, baudrate=BAUD_RATE)
    uart_input = sys.stdin.buffer
    uart_output = sys.stdout.buffer

    # Disable keyboard interrupt on receiving a 0x03 (ctrl-C) byte.
    micropython.kbd_intr(-1)

    poller = select.poll()
    poller.register(uart_input, select.POLLIN)

    def copy_to_uart(poll, uart_in, uart_out):
        byte_buffer = bytearray(1)
        yield

        while True:
            for _, _ in poll.ipoll(0):
                uart_in.readinto(byte_buffer)
                uart_out.write(byte_buffer)
            yield

    gen = copy_to_uart(poller, uart_input, uart_output)
    next(gen)  # run the first bit of copy_to_uart()
    
    local_next = next
    while True:
        local_next(gen)

That will have very little function overhead with the local_next(gen) call, and not access any globals in copy_to_uart().

george-hawkins Mar 10, 2024
Author

Thanks for suggesting the generator approach. I had already been thinking of wrapping up some elements of my logic in a generator (one that e.g. yielded the bytes as they're read).

I'll give your approach a try.

I'd never really thought of inner functions as closures but I guess they have to be or things would get very confusing if you returned such an inner function as the result of a function and it hadn't closed over the state that's visible to it (and instead that state disappeared when the outer function returned).

In my real code, everything is wrapped up in classes. I just broke things out into free-floating functions in my example to keep things simple.

My program probably could benefit from being rewritten in an asyncio fashion - initially, I thought working with a hard-loop and non-blocking streams would be simpler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MicroPython

Surprisingly large function call overhead cost #14007

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MicroPython

Surprisingly large function call overhead cost #14007

Uh oh!

george-hawkins Mar 2, 2024

Replies: 2 comments · 2 replies

Uh oh!

dpgeorge Mar 3, 2024 Maintainer

Uh oh!

george-hawkins Mar 3, 2024 Author

Uh oh!

dpgeorge Mar 5, 2024 Maintainer

Uh oh!

george-hawkins Mar 10, 2024 Author

george-hawkins
Mar 2, 2024

Replies: 2 comments 2 replies

dpgeorge
Mar 3, 2024
Maintainer

george-hawkins
Mar 3, 2024
Author

dpgeorge Mar 5, 2024
Maintainer

george-hawkins Mar 10, 2024
Author