Skip to content

Conversation

@hakehuang
Copy link
Contributor

@hakehuang hakehuang commented Sep 1, 2025

this is just to test #85248 with latest code base

This was just a pedantic setting.  I mean, of course it makes no sense
to have thread FPU management state features built when you aren't
including the scheduler in the build.

...unless you want to unit-test the context switch code without
tripping over itself on the way into the test code.  In fact lots of
unit testing of low level primitives can be done with
MULTITHREADING=n.

Remove the dependency.  It isn't actually doing anything useful.

Signed-off-by: Andy Ross <[email protected]>
This test seemed confused.  Per the description, it's trying to create
a priority boost situation where a low priority thread (the main ztest
thread running test_mutex_timeout_race_during_priority_inversion())
holds a mutex and is boosted by a high priority thread (running
tThread_mutex_lock_should_fail()) trying to lock the mutex.

But that's not what the code was doing.  It was creating the thread at
a LOWER priority, so no priority inheritance was happening.  The test
would pass simply because the thread ordering via the natural
scheduler caused the spawned thread to run first.  Fix the priorities
to be explicit.  And add some comments explaining what's happening.

Also clean up the timeout whiteboxing, which apparently an attempt to
evade userspace protections on the stack.  Just use a shared pointer
to ZTEST_DMEM.  And remove a voodoo "align to tick boundary" sleep;
that's needless here as the timeout is explicitly constructed to be
simultaneous, we don't care when we go to sleep, only that we wake up
out of the same ISR.

Signed-off-by: Andy Ross <[email protected]>
This is a low-level/whitebox test of the older context layer and can't
pass with the new code (which has its own unit test).

Signed-off-by: Andy Ross <[email protected]>
1. Mostly complete.  Supports MPU, userspace, PSPLIM-based stack
guards, and FPU/DSP features.  ARMv8-M secure mode "should" work but I
don't know how to test it.

2. Designed with an eye to uncompromising/best-in-industry cooperative
context switch performance.  No PendSV exception nor hardware
stacking/unstacking, just a traditional "musical chairs" switch.
Context gets saved on process stacks only instead of split between
there and the thread struct.  No branches in the core integer switch
code (and just one in the FPU bits that can't be avoided).

3. Minimal assembly use; arch_switch() itself is ALWAYS_INLINE, there
is an assembly stub for exception exit, and that's it beyond one/two
instruction inlines elsewhere.

4. Selectable at build time, interoperable with existing code.  Just
use the pre-existing CONFIG_USE_SWITCH=y flag to enable it.  Or turn
it off to evade regressions as this stabilizes.

5. Exception/interrupt returns in the common case need only a single C
function to be called at the tail, and then return naturally.
Effectively "all interrupts are direct now".  This isn't a benefit
currently because the existing stubs haven't been removed (see #4),
but in the long term we can look at exploiting this.  The boilerplate
previously required is now (mostly) empty.

6. No support for ARMv6 (Cortex M0 et. al.) thumb code.  The expanded
instruction encodings in ARMv7 are a big (big) win, so the older cores
really need a separate port to avoid impacting newer hardware.
Thankfully there isn't that much code to port (see #3), so this should
be doable.

Signed-off-by: Andy Ross <[email protected]>
This is a standalone unit test of the new context code, which builds
in isolation as a MULTITHREADING=n binary.  No integration with the
core arch layer is required.

Signed-off-by: Andy Ross <[email protected]>
Integrate the new context layer, allowing it to be selected via the
pre-existing CONFIG_USE_SWITCH.  Not a lot of changes, but notable
ones:

+ There was code in the MPU layer to adjust PSP on exception exit at a
  stack overflow so that it remained inside the defined stack bounds.
  With the new context layer though, exception exit will rewrite the
  stack frame in a larger format, and needs PSP to be adjusted to make
  room.

+ There was no such treatment in the PSPLIM case (the hardware prents
  the SP from going that low), so I had to add similar code to
  validate PSP at exit from fault handling.

+ The various return paths for fault/svc assembly handlers need to
  call out to the switch code to do the needed scheduler work.  Really
  almost all of these can be replaced with C now, only userspace
  syscall entry (which has to "return" into the privileged stack)
  needs special treatment.

+ There is a gcc bug that prevents the arch_switch() inline assembly
  from building when frame pointers are enabled (which they almost
  never are on ARM): it disallows you from touching r7 (the thumb
  frame pointer) entirely.  But it's a context switch, we need to!
  Worked around by enforcing -fomit-frame-pointer even in the two
  scheduler files that can swap when NO_OPTIMIZATIONS=y.

Signed-off-by: Andy Ross <[email protected]>
ARM Cortex M has what amounts to a design bug.  The architecture
inherits several unpipelined/microcoded "ICI/IT" instruction forms
that take many cycles to complete (LDM/STM and the Thumb "IT"
conditional frame are the big ones).  But out of a desire to minimize
interrupt latency, the CPU is allowed to halt and resume these
instructions mid-flight while they are partially completed.  The
relevant bits of state are stored in the EPSR fields of the xPSR
register (see ARMv7-M manual B1.4.2).  But (and this is the design
bug) those bits CANNOT BE WRITTEN BY SOFTWARE.  They can only be
modified by exception return.

This means that if a Zephyr thread takes an interrupt
mid-ICI/IT-instruction, then switches to another thread on exit, and
then that thread is resumed by a cooperative switch and not an
interrupt, the instruction will lose the state and restart from
scratch.  For LDM/STM that's generally idempotent for memory (but not
MMIO!), but for IT that means that the restart will re-execute
arbitrary instructions that may not be idempotent (e.g. "addeq r0, r0,

The fix is to check for this condition (which is very rare) on
interrupt exit when we are switching, and if we discover we've
interrupted such an instruction we swap the return address with a
trampoline that uses a UDF instruction to immediately trap to the
undefined instruction handler, which then recognizes the fixup address
as special and immediately returns back into the thread with the
correct EPSR value and resume PC (which have been stashed in the
thread struct).  The overhead for the normal case is just a few cycles
for the test.

Signed-off-by: Andy Ross <[email protected]>
Pick some low hanging fruit on non-SMP code paths:

+ The scheduler spinlock is always taken, but as we're already in an
  irqlocked state that's a noop.  But the optmizer can't tell, because
  arch_irq_lock() involves an asm block it can't see inside.  Elide
  the call when possible.

+ The z_swap_next_thread() function evaluates to just a single load of
  _kernel.ready_q.cache when !SMP, but wasn't being inlined because of
  function location.  Move that test up into do_swap() so it's always
  done correctly.

Signed-off-by: Andy Ross <[email protected]>
Benchmarking context switch in the absence of other polluting factors
is actually really hard.  I wanted a microbenchmark of just z_swap()
for tuning the ARM arch_switch() code, and nothing we have is very
useful.  This one works well, measuring ONLY the time taken for a
thread to enter z_swap(), switch to another thread already suspended
in z_swap(), and return to a timestamp on the other side.  It runs a
few thousand iterations and reporting average and standard deviation.

Signed-off-by: Andy Ross <[email protected]>
GCC/gas has a code generation bugglet on thumb.  The R7 register is
the ABI-defined frame pointer, though it's usually unused in zephyr
due to -fomit-frame-pointer (and the fact the DWARF on ARM doesn't
really need it).  But when it IS enabled, which sometimes seems to
happen due to toolchain internals, GCC is unable to allow its use in
the clobber list of an asm() block (I guess it can't generate
spill/fill code without using the frame?).

There is existing protection for this problem that sets
-fomit-frame-pointer unconditionally on the two files (sched.c and
init.c) that require it.  But even with that, gcc sometimes gets
kicked back into "framed mode" due to internal state.  Provide a
kconfig workaround that does an explicit spill/fill on the one
test/platform where we have trouble.

(I checked, btw: an ARM clang build appears not to have this
misfeature)

Signed-off-by: Andy Ross <[email protected]>
Running this with USE_SWITCH=y takes more space as expected, but it
oddly seems variable.  Even at 2kb, I still see fairly regular
failures.  It wants 3kb to be stable.  Maybe the non-determinism of
the two-qemu setup is involved?  Requests batch up and are handled
recursively?  Too much code in the third party stack to make it easy
to tell.

Signed-off-by: Andy Ross <[email protected]>
This gets turned on at the platform layer by a handful of boards,
despite note being used for anything in the test.  It has performance
impact, so really shouldn't be here.

Note that ARM_MPU also needs an =n, as that will select MPU, seemingly
incorrectly.

Signed-off-by: Andy Ross <[email protected]>
This microbenchmark has been super-useful.  Add similar cases looking
at targetting interupt latency:

+ "IRQ" tests the cycles taken from the instant of an interrupt
  trigger (as implemented, it's an arch_irq_lock() being released with
  a timer interrupt pending) through a noop ISR (a k_timer callback),
  and back to the interrupting thread.

+ "IRQ_P" is a similar preemption test: it measures the time taken
   from the same trigger, into a higher-priority thread that
   timestamps the instant of its wakeup (in this case as a return from
   k_sem_take()).

Signed-off-by: Andy Ross <[email protected]>
This is a mistake.  IRQ_OFFLOAD_NESTED is not intended to be an app
tunable. It reflects whether or not platform layer irq_offload()
implementation supports its use from within an ISR (i.e. to invoke a
nested interrupt for testing purposes).  Not all architectures can do
that (and I know of at least one Xtensa soc that didn't give a
software interrupt the right priority and can't either).  Basically
this should have been a "ARCH_HAS_" kind of kconfig, it's just named
funny.

And in any case the thread_metric tests don't do nested interrupts,
which as I understand it aren't even possible in the FreeRTOS
environment it was written on.

Signed-off-by: Andy Ross <[email protected]>
z_get_next_switch_handle() is a clean API, but implementing it as a
(comparatively large) callable function requires significant
entry/exit boilerplate and hides the very common "no switch needed"
early exit condition from the enclosing C code that calls it.  (Most
architectures call this from assembly though and don't notice).

Provide an unwrapped version for the specific needs non-SMP builds.
It's compatible in all other ways.

Slightly ugly, but the gains are significant (like a dozen cycles or
so).

Signed-off-by: Andy Ross <[email protected]>
Micro-optimization: We don't need a full arch_irq_lock(), which is a
~6-instruction sequence on Cortex M.  The lock will be dropped
unconditionally on interrupt exit, so take it unconditionally.

Signed-off-by: Andy Ross <[email protected]>
The new switch code no longer needs PendSV, but still calls the SVC
vector.  Split them into separate files for hygiene and a few
microseconds of build time.

Signed-off-by: Andy Ross <[email protected]>
When USE_SWITCH=y, the thread struct is now mostly degenerate.  Only
the two words for ICI/IT state tracking are required.  Eliminate all
the extra fields when not needed and save a bunch of SRAM.

Note a handful of spots in coredump/debug that need a location for the
new stack pointer (stored as the switch handle now) are also updated.

Signed-off-by: Andy Ross <[email protected]>
z_reschedule() is the basic kernel entry point for context switch,
wrapping z_swap(), and thence arch_switch().  It's currently defined
as a first class function for entry from other files in the kernel and
elsewhere (e.g. IPC library code).

But in practice it's actually a very thin wrapper without a lot of
logic of its own, and the context switch layers of some of the more
obnoxiously clever architectures are designed to interoperate with the
compiler's own spill/fill logic to avoid double saving.  And with a
small z_reschedule() there's not a lot to work with.

Make reschedule() an inlinable static, so the compiler has more
options.

Signed-off-by: Andy Ross <[email protected]>
This function was a little clumsy, taking the scheduler lock,
releasing it, and then calling z_reschedule_unlocked() instead of the
normal locked variant of reschedule.  Don't take the lock twice.

Mostly this is a code size and hygiene win.  Obviously the sched lock
is not normally a performance path, but I happened to have picked this
API for my own microbenchmark in tests/benchmarks/swap and so noticed
the double-lock while staring at disassembly.

Signed-off-by: Andy Ross <[email protected]>
Some nitpicky hand-optimizations, no logic changes:

+ Shrink the assembly entry to put more of the logic into
  compiler-optimizable C.

+ Split arm_m_must_switch() into two functions so that the first
  doesn't look so big to the compiler.  That allows it to spill (many)
  fewer register on entry and speeds the (very) common early-exit case
  where an interrupt returns without context switch.

Signed-off-by: Andy Ross <[email protected]>
Late-arriving clang-format-demanded changes that are too hard to split
and squash into the original patches.  No behavior changes.

Signed-off-by: Andy Ross <[email protected]>
The z_arm_pendsv vector doesn't exist on USE_SWITCH=y builds

Signed-off-by: Andy Ross <[email protected]>
Some toolchains don't support an __asm__(...) block at the top level
of a file and require that they live within function scope.  That's
not a hardship as these two blocks were defining callable functions
anyway.  Exploit the "naked" attribute to avoid wasted bytes in unused
entry/exit code.

Signed-off-by: Andy Ross <[email protected]>
I'm at a loss here.  The feature this test case wants to see (on
ARMv7M, not ARMv6M) is the ability to take a irq_lock() inside a
system call and then see that future system calls from the same thread
continue to hold the lock.

That's not documented AFAICT.  It's also just a terrible idea because
either:

1. The obvious denial of service implications if user code is allowed
   to run in an unpreemptible mode, or:

2. The broken locking promise if this is implemented to release the
   lock and reacquire it in an attempt to avoid #1.

(FWIW: my read of the code is that #1 is the current implementation.
But hilariously the test isn't able to tell the difference!)

And in any case it's not how any of our other platforms work (or can
work, in some cases), making this a non-portable system call
API/feature at best.

Leave it in place for now out of conservatism, but disable with the
new arch_switch() code, whose behavior matches that of other Zephyr
userspaces.

Signed-off-by: Andy Ross <[email protected]>
The exit from the SVC exception used for syscalls back into the
calling thread is done without locking.  This means that the
intermediate states can be interrupted while the kernel-mode code is
still managing thread state like the mode bit, leading to mismatches.
This seems mostly robust when used with PendSV (though I'm a little
dubious), but the new arch_switch() code needs to be able to suspend
such an interrupted thread and restore it without going through a full
interrupt entry/exit again, so it needs locking for sure.

Take the lock unconditionally before exiting the call, and release it
in the thread once the magic is finished, just before calling the
handler.  Then take it again before swapping stacks and dropping
privilege.

Even then there is a one-cycle race where the interrupted thread has
dropped the lock but still has privilege (the nPRIV bit is clear in
CONTROL).  This thread will be resumed later WITHOUT privilege, which
means that trying to set CONTROL will fail.  So there's detection of
this 1-instruction race that will skip over it.

Signed-off-by: Andy Ross <[email protected]>
The ARM Ltd. FVP emulator (at least the variants run in Zephyr CI)
appears to have a bug with the stack alignment bit in xPSR.  It's
common (it fails in the first 4-6 timer interrupts in
tests.syscalls.timeslicing) that we'll take an interrupt from a
seemingly aligned (!) stack with the bit set.  If we then switch and
resume the thread from a different context later, popping the stack
goes wrong (more so than just a misalignment of four bytes: I usually
see it too low by 20 bytes) in a way that it doesn't if we return
synchronously.  Presumably legacy PendSV didn't see this because it
used the unmodified exception frame.

Work around this by simply assuming all interrupted stacks were
aligned and clearing the bit.  That is NOT correct in the general
case, but in practice it's enough to get tests to pass.

Signed-off-by: Andy Ross <[email protected]>
@hakehuang hakehuang marked this pull request as draft September 1, 2025 07:15
@zephyrbot zephyrbot requested review from andyross and bbolen September 1, 2025 07:16
@sonarqubecloud
Copy link

sonarqubecloud bot commented Sep 1, 2025

@JarmouniA JarmouniA changed the title Arch cortex m test [DNR] Arch cortex m test Sep 1, 2025
@hakehuang hakehuang closed this Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DNM This PR should not be merged (Do Not Merge)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants