Draft
Conversation
previous_step_time (double) was stored alongside previous_step_time_ticks (uint64_t) and the two were always set in sync. The double was only ever read to check == 0. to detect the first step event. On Cortex-M4F, the FPU only handles single-precision; double comparison goes through the software emulation library (~5-8 cycles). Switching to previous_step_time_ticks == 0 uses a native 2-register integer compare (1-2 cycles) and removes 8 bytes from step_generator_state_t.
step_generator_state_update_nearest_idx is called on every step event
generated in the move ISR. The previous implementation used
std::lower_bound with a lambda comparator followed by a separate shift
loop that required asm volatile("") to prevent the compiler from emitting
a memmove call for the 3-element shift.
With PS_AXIS_COUNT fixed at 4, the re-insertion into positions [1..3]
is replaced by an explicit cascade of at most 3 comparisons and
in-place shifts. This eliminates the lower_bound iterator overhead, the
lambda dispatch, and the asm volatile workaround, while producing the
same result.
update_step_generator_state_current_flags() rebuilt current_flags by
iterating all 4 axes through pointer indirection on every step event.
Only the axis whose generator was just called can have changed its
step_flags, so a full recompute is wasteful.
generate_next_step_event now updates current_flags with a single
mask-and-OR immediately after calling the per-axis step generator:
axis_mask = (X_DIR | X_ACTIVE) << axis
current_flags = (current_flags & ~axis_mask) | step_flags[axis]
The update_step_generator_state_current_flags function is removed.
The invariant current_flags equals the OR of all generators'
step_flags is maintained throughout because only one axis changes
per call and that change is applied inline.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This proposes three fixes for cpu optimizations:
1. precise_stepping: drop redundant previous_step_time double field
previous_step_time (double) was stored alongside previous_step_time_ticks
(uint64_t) and the two were always set in sync. The double was only ever
read to check == 0. to detect the first step event. On Cortex-M4F, the
FPU only handles single-precision; double comparison goes through the
software emulation library (~5-8 cycles). Switching to
previous_step_time_ticks == 0 uses a native 2-register integer compare
(1-2 cycles) and removes 8 bytes from step_generator_state_t.
2. precise_stepping: unroll step event index insertion sort for 4 axes
step_generator_state_update_nearest_idx is called on every step event
generated in the move ISR. The previous implementation used
std::lower_bound with a lambda comparator followed by a separate shift
loop that required asm volatile("") to prevent the compiler from emitting
a memmove call for the 3-element shift.
With PS_AXIS_COUNT fixed at 4, the re-insertion into positions [1..3]
is replaced by an explicit cascade of at most 3 comparisons and
in-place shifts. This eliminates the lower_bound iterator overhead, the
lambda dispatch, and the asm volatile workaround, while producing the
same result.
3. precise_stepping: maintain current_flags incrementally per axis
update_step_generator_state_current_flags() rebuilt current_flags by
iterating all 4 axes through pointer indirection on every step event.
Only the axis whose generator was just called can have changed its
step_flags, so a full recompute is wasteful.
generate_next_step_event now updates current_flags with a single
mask-and-OR immediately after calling the per-axis step generator:
The update_step_generator_state_current_flags function is removed.
The invariant current_flags equals the OR of all generators'
step_flags is maintained throughout because only one axis changes
per call and that change is applied inline.