Skip to content

perf: 6 performance optimizations for STM32WB55 (Cortex-M4 @ 64MHz)#977

Draft
WonderMr wants to merge 3 commits intoDarkFlippers:devfrom
WonderMr:feat/opus-optimised
Draft

perf: 6 performance optimizations for STM32WB55 (Cortex-M4 @ 64MHz)#977
WonderMr wants to merge 3 commits intoDarkFlippers:devfrom
WonderMr:feat/opus-optimised

Conversation

@WonderMr
Copy link
Copy Markdown

@WonderMr WonderMr commented Mar 16, 2026

What's new

  1. Compiler: -Og → -Os (release builds) Replace debug-level optimization with size-optimized -Os, which enables most -O2 passes: function inlining, dead code elimination, loop optimization, aggressive register allocation, tail call optimization, and common subexpression elimination. File: site_scons/firmwareopts.scons

  2. Disable heap memset on free() in release configHEAP_CLEAR_MEMORY_ON_FREE was always 1, causing FreeRTOS Heap_4 to memset() every freed block to zero. Useful for catching use-after-free in debug, but pure waste in release. Now conditional on FURI_DEBUG. Saves ~500+ memset calls per second during active GUI/protocol work. File: targets/f7/inc/FreeRTOSConfig.h

  3. SPI TX via DMA instead of busy-wait polling furi_hal_spi_bus_tx() polled TXE flag byte-by-byte, keeping the CPU in a tight loop for entire SPI transfers. Now delegates to furi_hal_spi_bus_trx_dma() which uses DMA2_Channel7 and FreeRTOS semaphore-based sleep, freeing CPU during display updates (~1KB/ frame @ 20fps) and radio TX operations. File: targets/f7/furi_hal/furi_hal_spi.c

  4. Fix realloc() to copy min(old_size, new_size) Original realloc() copied size (new) bytes from old block via memcpy, which reads past allocation when growing. Added memmgr_heap_get_block_size() that reads usable size from Heap_4 BlockLink_t header. Now copies min(old_size, new_size) bytes, fixing potential UB and reducing unnecessary copying. Files: furi/core/memmgr.c, furi/core/memmgr_heap.c/.h, targets/f7/api_symbols.csv

  5. Fix calloc() to explicitly zero memory Original calloc() just called pvPortMalloc() without memset, relying on configHEAP_CLEAR_MEMORY_ON_FREE=1 for zero-initialized returns. With optimization №2 disabling that in release, calloc() would return uninitialized memory. Added explicit memset(0). File: furi/core/memmgr.c

  6. Branch prediction hints on furi_check/assert/break Added __builtin_expect(!(__e), 0) to all assertion macros. Tells GCC that error path is cold: crash code moves to end of function, hot path becomes fall-through (0 pipeline penalty on Cortex-M4 3-stage pipeline). Affects ~2300+ call sites across the firmware. File: furi/core/check.h

Also: strncpy → strlcpy in subghz_scene_save_name.c (-Os exposed -Werror=stringop-truncation warning).

Verification

  • Build: ./fbt COMPACT=1 DEBUG=0
  • Boot device, navigate Settings → About — FW version shown
  • Open SubGHz, NFC, IR apps — UI responsive, no hangs

Checklist (For Reviewer)

  • PR has description of feature/bug
  • Description contains actions to verify feature/bugfix
  • I've built this code, uploaded it to the device and verified feature/bugfix

1. Compiler: -Og → -Os (release builds)
   Replace debug-level optimization with size-optimized -Os, which
   enables most -O2 passes: function inlining, dead code elimination,
   loop optimization, aggressive register allocation, tail call
   optimization, and common subexpression elimination.
   File: site_scons/firmwareopts.scons

2. Disable heap memset on free() in release
   configHEAP_CLEAR_MEMORY_ON_FREE was always 1, causing FreeRTOS
   Heap_4 to memset() every freed block to zero. Useful for catching
   use-after-free in debug, but pure waste in release. Now conditional
   on FURI_DEBUG. Saves ~500+ memset calls per second during active
   GUI/protocol work.
   File: targets/f7/inc/FreeRTOSConfig.h

3. SPI TX via DMA instead of busy-wait polling
   furi_hal_spi_bus_tx() polled TXE flag byte-by-byte, keeping the
   CPU in a tight loop for entire SPI transfers. Now delegates to
   furi_hal_spi_bus_trx_dma() which uses DMA2_Channel7 and FreeRTOS
   semaphore-based sleep, freeing CPU during display updates (~1KB/
   frame @ 20fps) and radio TX operations.
   File: targets/f7/furi_hal/furi_hal_spi.c

4. Fix realloc() to copy min(old_size, new_size)
   Original realloc() copied `size` (new) bytes from old block via
   memcpy, which reads past allocation when growing. Added
   memmgr_heap_get_block_size() that reads usable size from Heap_4
   BlockLink_t header. Now copies min(old_size, new_size) bytes,
   fixing potential UB and reducing unnecessary copying.
   Files: furi/core/memmgr.c, furi/core/memmgr_heap.c/.h,
          targets/f7/api_symbols.csv

5. Fix calloc() to explicitly zero memory
   Original calloc() just called pvPortMalloc() without memset,
   relying on configHEAP_CLEAR_MEMORY_ON_FREE=1 for zero-initialized
   returns. With optimization #2 disabling that in release, calloc()
   would return uninitialized memory. Added explicit memset(0).
   File: furi/core/memmgr.c

6. Branch prediction hints on furi_check/assert/break
   Added __builtin_expect(!(__e), 0) to all assertion macros. Tells
   GCC that error path is cold: crash code moves to end of function,
   hot path becomes fall-through (0 pipeline penalty on Cortex-M4
   3-stage pipeline). Affects ~2300+ call sites across the firmware.
   File: furi/core/check.h

Also: strncpy → strlcpy in subghz_scene_save_name.c (-Os exposed
-Werror=stringop-truncation warning).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@WonderMr WonderMr requested a review from xMasterX as a code owner March 16, 2026 05:57
@WonderMr WonderMr force-pushed the feat/opus-optimised branch from b8150e9 to 63e81d4 Compare March 16, 2026 14:38
AlZh-Mex and others added 2 commits March 16, 2026 20:25
realloc: add NULL check on pvPortMalloc result to prevent crash and
preserve original allocation on OOM per C standard. Use
memmgr_heap_get_block_size() to copy min(old, new) bytes instead of
reading past the old allocation boundary.

SPI DMA: TX-only path now sets up RX DMA channel draining into a dummy
byte to prevent OVR accumulation on transfers >4 bytes. Pre-scheduler
fallback correctly routes to furi_hal_spi_bus_tx() for TX-only ops.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@WonderMr WonderMr force-pushed the feat/opus-optimised branch from 733b117 to ce8a953 Compare March 17, 2026 05:10
@xMasterX xMasterX marked this pull request as draft March 20, 2026 02:11
@xMasterX
Copy link
Copy Markdown
Member

Checklist (For Reviewer) clearly says for who its made, stop editing it pls, thats for repo maintainers, not PR author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants