-
Couldn't load subscription status.
- Fork 8.1k
[DNR] Arch cortex m test #95252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
hakehuang
wants to merge
27
commits into
zephyrproject-rtos:main
from
nxp-upstream:arch_cortex_m_test
Closed
[DNR] Arch cortex m test #95252
hakehuang
wants to merge
27
commits into
zephyrproject-rtos:main
from
nxp-upstream:arch_cortex_m_test
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was just a pedantic setting. I mean, of course it makes no sense to have thread FPU management state features built when you aren't including the scheduler in the build. ...unless you want to unit-test the context switch code without tripping over itself on the way into the test code. In fact lots of unit testing of low level primitives can be done with MULTITHREADING=n. Remove the dependency. It isn't actually doing anything useful. Signed-off-by: Andy Ross <[email protected]>
This test seemed confused. Per the description, it's trying to create a priority boost situation where a low priority thread (the main ztest thread running test_mutex_timeout_race_during_priority_inversion()) holds a mutex and is boosted by a high priority thread (running tThread_mutex_lock_should_fail()) trying to lock the mutex. But that's not what the code was doing. It was creating the thread at a LOWER priority, so no priority inheritance was happening. The test would pass simply because the thread ordering via the natural scheduler caused the spawned thread to run first. Fix the priorities to be explicit. And add some comments explaining what's happening. Also clean up the timeout whiteboxing, which apparently an attempt to evade userspace protections on the stack. Just use a shared pointer to ZTEST_DMEM. And remove a voodoo "align to tick boundary" sleep; that's needless here as the timeout is explicitly constructed to be simultaneous, we don't care when we go to sleep, only that we wake up out of the same ISR. Signed-off-by: Andy Ross <[email protected]>
This is a low-level/whitebox test of the older context layer and can't pass with the new code (which has its own unit test). Signed-off-by: Andy Ross <[email protected]>
1. Mostly complete. Supports MPU, userspace, PSPLIM-based stack guards, and FPU/DSP features. ARMv8-M secure mode "should" work but I don't know how to test it. 2. Designed with an eye to uncompromising/best-in-industry cooperative context switch performance. No PendSV exception nor hardware stacking/unstacking, just a traditional "musical chairs" switch. Context gets saved on process stacks only instead of split between there and the thread struct. No branches in the core integer switch code (and just one in the FPU bits that can't be avoided). 3. Minimal assembly use; arch_switch() itself is ALWAYS_INLINE, there is an assembly stub for exception exit, and that's it beyond one/two instruction inlines elsewhere. 4. Selectable at build time, interoperable with existing code. Just use the pre-existing CONFIG_USE_SWITCH=y flag to enable it. Or turn it off to evade regressions as this stabilizes. 5. Exception/interrupt returns in the common case need only a single C function to be called at the tail, and then return naturally. Effectively "all interrupts are direct now". This isn't a benefit currently because the existing stubs haven't been removed (see #4), but in the long term we can look at exploiting this. The boilerplate previously required is now (mostly) empty. 6. No support for ARMv6 (Cortex M0 et. al.) thumb code. The expanded instruction encodings in ARMv7 are a big (big) win, so the older cores really need a separate port to avoid impacting newer hardware. Thankfully there isn't that much code to port (see #3), so this should be doable. Signed-off-by: Andy Ross <[email protected]>
This is a standalone unit test of the new context code, which builds in isolation as a MULTITHREADING=n binary. No integration with the core arch layer is required. Signed-off-by: Andy Ross <[email protected]>
Integrate the new context layer, allowing it to be selected via the pre-existing CONFIG_USE_SWITCH. Not a lot of changes, but notable ones: + There was code in the MPU layer to adjust PSP on exception exit at a stack overflow so that it remained inside the defined stack bounds. With the new context layer though, exception exit will rewrite the stack frame in a larger format, and needs PSP to be adjusted to make room. + There was no such treatment in the PSPLIM case (the hardware prents the SP from going that low), so I had to add similar code to validate PSP at exit from fault handling. + The various return paths for fault/svc assembly handlers need to call out to the switch code to do the needed scheduler work. Really almost all of these can be replaced with C now, only userspace syscall entry (which has to "return" into the privileged stack) needs special treatment. + There is a gcc bug that prevents the arch_switch() inline assembly from building when frame pointers are enabled (which they almost never are on ARM): it disallows you from touching r7 (the thumb frame pointer) entirely. But it's a context switch, we need to! Worked around by enforcing -fomit-frame-pointer even in the two scheduler files that can swap when NO_OPTIMIZATIONS=y. Signed-off-by: Andy Ross <[email protected]>
ARM Cortex M has what amounts to a design bug. The architecture inherits several unpipelined/microcoded "ICI/IT" instruction forms that take many cycles to complete (LDM/STM and the Thumb "IT" conditional frame are the big ones). But out of a desire to minimize interrupt latency, the CPU is allowed to halt and resume these instructions mid-flight while they are partially completed. The relevant bits of state are stored in the EPSR fields of the xPSR register (see ARMv7-M manual B1.4.2). But (and this is the design bug) those bits CANNOT BE WRITTEN BY SOFTWARE. They can only be modified by exception return. This means that if a Zephyr thread takes an interrupt mid-ICI/IT-instruction, then switches to another thread on exit, and then that thread is resumed by a cooperative switch and not an interrupt, the instruction will lose the state and restart from scratch. For LDM/STM that's generally idempotent for memory (but not MMIO!), but for IT that means that the restart will re-execute arbitrary instructions that may not be idempotent (e.g. "addeq r0, r0, The fix is to check for this condition (which is very rare) on interrupt exit when we are switching, and if we discover we've interrupted such an instruction we swap the return address with a trampoline that uses a UDF instruction to immediately trap to the undefined instruction handler, which then recognizes the fixup address as special and immediately returns back into the thread with the correct EPSR value and resume PC (which have been stashed in the thread struct). The overhead for the normal case is just a few cycles for the test. Signed-off-by: Andy Ross <[email protected]>
Pick some low hanging fruit on non-SMP code paths: + The scheduler spinlock is always taken, but as we're already in an irqlocked state that's a noop. But the optmizer can't tell, because arch_irq_lock() involves an asm block it can't see inside. Elide the call when possible. + The z_swap_next_thread() function evaluates to just a single load of _kernel.ready_q.cache when !SMP, but wasn't being inlined because of function location. Move that test up into do_swap() so it's always done correctly. Signed-off-by: Andy Ross <[email protected]>
Benchmarking context switch in the absence of other polluting factors is actually really hard. I wanted a microbenchmark of just z_swap() for tuning the ARM arch_switch() code, and nothing we have is very useful. This one works well, measuring ONLY the time taken for a thread to enter z_swap(), switch to another thread already suspended in z_swap(), and return to a timestamp on the other side. It runs a few thousand iterations and reporting average and standard deviation. Signed-off-by: Andy Ross <[email protected]>
GCC/gas has a code generation bugglet on thumb. The R7 register is the ABI-defined frame pointer, though it's usually unused in zephyr due to -fomit-frame-pointer (and the fact the DWARF on ARM doesn't really need it). But when it IS enabled, which sometimes seems to happen due to toolchain internals, GCC is unable to allow its use in the clobber list of an asm() block (I guess it can't generate spill/fill code without using the frame?). There is existing protection for this problem that sets -fomit-frame-pointer unconditionally on the two files (sched.c and init.c) that require it. But even with that, gcc sometimes gets kicked back into "framed mode" due to internal state. Provide a kconfig workaround that does an explicit spill/fill on the one test/platform where we have trouble. (I checked, btw: an ARM clang build appears not to have this misfeature) Signed-off-by: Andy Ross <[email protected]>
Running this with USE_SWITCH=y takes more space as expected, but it oddly seems variable. Even at 2kb, I still see fairly regular failures. It wants 3kb to be stable. Maybe the non-determinism of the two-qemu setup is involved? Requests batch up and are handled recursively? Too much code in the third party stack to make it easy to tell. Signed-off-by: Andy Ross <[email protected]>
This gets turned on at the platform layer by a handful of boards, despite note being used for anything in the test. It has performance impact, so really shouldn't be here. Note that ARM_MPU also needs an =n, as that will select MPU, seemingly incorrectly. Signed-off-by: Andy Ross <[email protected]>
This microbenchmark has been super-useful. Add similar cases looking at targetting interupt latency: + "IRQ" tests the cycles taken from the instant of an interrupt trigger (as implemented, it's an arch_irq_lock() being released with a timer interrupt pending) through a noop ISR (a k_timer callback), and back to the interrupting thread. + "IRQ_P" is a similar preemption test: it measures the time taken from the same trigger, into a higher-priority thread that timestamps the instant of its wakeup (in this case as a return from k_sem_take()). Signed-off-by: Andy Ross <[email protected]>
This is a mistake. IRQ_OFFLOAD_NESTED is not intended to be an app tunable. It reflects whether or not platform layer irq_offload() implementation supports its use from within an ISR (i.e. to invoke a nested interrupt for testing purposes). Not all architectures can do that (and I know of at least one Xtensa soc that didn't give a software interrupt the right priority and can't either). Basically this should have been a "ARCH_HAS_" kind of kconfig, it's just named funny. And in any case the thread_metric tests don't do nested interrupts, which as I understand it aren't even possible in the FreeRTOS environment it was written on. Signed-off-by: Andy Ross <[email protected]>
z_get_next_switch_handle() is a clean API, but implementing it as a (comparatively large) callable function requires significant entry/exit boilerplate and hides the very common "no switch needed" early exit condition from the enclosing C code that calls it. (Most architectures call this from assembly though and don't notice). Provide an unwrapped version for the specific needs non-SMP builds. It's compatible in all other ways. Slightly ugly, but the gains are significant (like a dozen cycles or so). Signed-off-by: Andy Ross <[email protected]>
Micro-optimization: We don't need a full arch_irq_lock(), which is a ~6-instruction sequence on Cortex M. The lock will be dropped unconditionally on interrupt exit, so take it unconditionally. Signed-off-by: Andy Ross <[email protected]>
The new switch code no longer needs PendSV, but still calls the SVC vector. Split them into separate files for hygiene and a few microseconds of build time. Signed-off-by: Andy Ross <[email protected]>
When USE_SWITCH=y, the thread struct is now mostly degenerate. Only the two words for ICI/IT state tracking are required. Eliminate all the extra fields when not needed and save a bunch of SRAM. Note a handful of spots in coredump/debug that need a location for the new stack pointer (stored as the switch handle now) are also updated. Signed-off-by: Andy Ross <[email protected]>
z_reschedule() is the basic kernel entry point for context switch, wrapping z_swap(), and thence arch_switch(). It's currently defined as a first class function for entry from other files in the kernel and elsewhere (e.g. IPC library code). But in practice it's actually a very thin wrapper without a lot of logic of its own, and the context switch layers of some of the more obnoxiously clever architectures are designed to interoperate with the compiler's own spill/fill logic to avoid double saving. And with a small z_reschedule() there's not a lot to work with. Make reschedule() an inlinable static, so the compiler has more options. Signed-off-by: Andy Ross <[email protected]>
This function was a little clumsy, taking the scheduler lock, releasing it, and then calling z_reschedule_unlocked() instead of the normal locked variant of reschedule. Don't take the lock twice. Mostly this is a code size and hygiene win. Obviously the sched lock is not normally a performance path, but I happened to have picked this API for my own microbenchmark in tests/benchmarks/swap and so noticed the double-lock while staring at disassembly. Signed-off-by: Andy Ross <[email protected]>
Some nitpicky hand-optimizations, no logic changes: + Shrink the assembly entry to put more of the logic into compiler-optimizable C. + Split arm_m_must_switch() into two functions so that the first doesn't look so big to the compiler. That allows it to spill (many) fewer register on entry and speeds the (very) common early-exit case where an interrupt returns without context switch. Signed-off-by: Andy Ross <[email protected]>
Late-arriving clang-format-demanded changes that are too hard to split and squash into the original patches. No behavior changes. Signed-off-by: Andy Ross <[email protected]>
The z_arm_pendsv vector doesn't exist on USE_SWITCH=y builds Signed-off-by: Andy Ross <[email protected]>
Some toolchains don't support an __asm__(...) block at the top level of a file and require that they live within function scope. That's not a hardship as these two blocks were defining callable functions anyway. Exploit the "naked" attribute to avoid wasted bytes in unused entry/exit code. Signed-off-by: Andy Ross <[email protected]>
I'm at a loss here. The feature this test case wants to see (on ARMv7M, not ARMv6M) is the ability to take a irq_lock() inside a system call and then see that future system calls from the same thread continue to hold the lock. That's not documented AFAICT. It's also just a terrible idea because either: 1. The obvious denial of service implications if user code is allowed to run in an unpreemptible mode, or: 2. The broken locking promise if this is implemented to release the lock and reacquire it in an attempt to avoid #1. (FWIW: my read of the code is that #1 is the current implementation. But hilariously the test isn't able to tell the difference!) And in any case it's not how any of our other platforms work (or can work, in some cases), making this a non-portable system call API/feature at best. Leave it in place for now out of conservatism, but disable with the new arch_switch() code, whose behavior matches that of other Zephyr userspaces. Signed-off-by: Andy Ross <[email protected]>
The exit from the SVC exception used for syscalls back into the calling thread is done without locking. This means that the intermediate states can be interrupted while the kernel-mode code is still managing thread state like the mode bit, leading to mismatches. This seems mostly robust when used with PendSV (though I'm a little dubious), but the new arch_switch() code needs to be able to suspend such an interrupted thread and restore it without going through a full interrupt entry/exit again, so it needs locking for sure. Take the lock unconditionally before exiting the call, and release it in the thread once the magic is finished, just before calling the handler. Then take it again before swapping stacks and dropping privilege. Even then there is a one-cycle race where the interrupted thread has dropped the lock but still has privilege (the nPRIV bit is clear in CONTROL). This thread will be resumed later WITHOUT privilege, which means that trying to set CONTROL will fail. So there's detection of this 1-instruction race that will skip over it. Signed-off-by: Andy Ross <[email protected]>
The ARM Ltd. FVP emulator (at least the variants run in Zephyr CI) appears to have a bug with the stack alignment bit in xPSR. It's common (it fails in the first 4-6 timer interrupts in tests.syscalls.timeslicing) that we'll take an interrupt from a seemingly aligned (!) stack with the bit set. If we then switch and resume the thread from a different context later, popping the stack goes wrong (more so than just a misalignment of four bytes: I usually see it too low by 20 bytes) in a way that it doesn't if we return synchronously. Presumably legacy PendSV didn't see this because it used the unmodified exception frame. Work around this by simply assuming all interrupted stacks were aligned and clearing the bit. That is NOT correct in the general case, but in practice it's enough to get tests to pass. Signed-off-by: Andy Ross <[email protected]>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.



this is just to test #85248 with latest code base