Skip to content

Cpu optimizations#5206

Draft
bkerler wants to merge 3 commits intoprusa3d:masterfrom
bkerler:cpu_optimizations
Draft

Cpu optimizations#5206
bkerler wants to merge 3 commits intoprusa3d:masterfrom
bkerler:cpu_optimizations

Conversation

@bkerler
Copy link
Copy Markdown

@bkerler bkerler commented Mar 26, 2026

This proposes three fixes for cpu optimizations:

1. precise_stepping: drop redundant previous_step_time double field

previous_step_time (double) was stored alongside previous_step_time_ticks
(uint64_t) and the two were always set in sync. The double was only ever
read to check == 0. to detect the first step event. On Cortex-M4F, the
FPU only handles single-precision; double comparison goes through the
software emulation library (~5-8 cycles). Switching to
previous_step_time_ticks == 0 uses a native 2-register integer compare
(1-2 cycles) and removes 8 bytes from step_generator_state_t.

2. precise_stepping: unroll step event index insertion sort for 4 axes

step_generator_state_update_nearest_idx is called on every step event
generated in the move ISR. The previous implementation used
std::lower_bound with a lambda comparator followed by a separate shift
loop that required asm volatile("") to prevent the compiler from emitting
a memmove call for the 3-element shift.

With PS_AXIS_COUNT fixed at 4, the re-insertion into positions [1..3]
is replaced by an explicit cascade of at most 3 comparisons and
in-place shifts. This eliminates the lower_bound iterator overhead, the
lambda dispatch, and the asm volatile workaround, while producing the
same result.

3. precise_stepping: maintain current_flags incrementally per axis

update_step_generator_state_current_flags() rebuilt current_flags by
iterating all 4 axes through pointer indirection on every step event.
Only the axis whose generator was just called can have changed its
step_flags, so a full recompute is wasteful.

generate_next_step_event now updates current_flags with a single
mask-and-OR immediately after calling the per-axis step generator:

axis_mask = (X_DIR | X_ACTIVE) << axis
current_flags = (current_flags & ~axis_mask) | step_flags[axis]

The update_step_generator_state_current_flags function is removed.
The invariant current_flags equals the OR of all generators'
step_flags is maintained throughout because only one axis changes
per call and that change is applied inline.

bkerler added 3 commits March 26, 2026 08:59
previous_step_time (double) was stored alongside previous_step_time_ticks
(uint64_t) and the two were always set in sync. The double was only ever
read to check == 0. to detect the first step event. On Cortex-M4F, the
FPU only handles single-precision; double comparison goes through the
software emulation library (~5-8 cycles). Switching to
previous_step_time_ticks == 0 uses a native 2-register integer compare
(1-2 cycles) and removes 8 bytes from step_generator_state_t.
step_generator_state_update_nearest_idx is called on every step event
generated in the move ISR. The previous implementation used
std::lower_bound with a lambda comparator followed by a separate shift
loop that required asm volatile("") to prevent the compiler from emitting
a memmove call for the 3-element shift.

With PS_AXIS_COUNT fixed at 4, the re-insertion into positions [1..3]
is replaced by an explicit cascade of at most 3 comparisons and
in-place shifts. This eliminates the lower_bound iterator overhead, the
lambda dispatch, and the asm volatile workaround, while producing the
same result.
update_step_generator_state_current_flags() rebuilt current_flags by
iterating all 4 axes through pointer indirection on every step event.
Only the axis whose generator was just called can have changed its
step_flags, so a full recompute is wasteful.

generate_next_step_event now updates current_flags with a single
mask-and-OR immediately after calling the per-axis step generator:

    axis_mask = (X_DIR | X_ACTIVE) << axis
    current_flags = (current_flags & ~axis_mask) | step_flags[axis]

The update_step_generator_state_current_flags function is removed.
The invariant current_flags equals the OR of all generators'
step_flags is maintained throughout because only one axis changes
per call and that change is applied inline.
@bkerler bkerler marked this pull request as draft March 26, 2026 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant