Skip to content

Conversation

tomchy
Copy link
Contributor

@tomchy tomchy commented Sep 12, 2025

If the area, dedicated for the interrupt stack is not available, allow to specify a memory region that will be used as the stack for the S2RAM resume logic.

@JarmouniA
Copy link
Contributor

If the area, dedicated for the interrupt stack is not available,

Could you provide more details?

@tomchy tomchy force-pushed the bugfix/reset/NCSDK-35410_Dedicated_s2ram_stack branch from 1dbc200 to 4e4d6f2 Compare September 15, 2025 07:33
@tomchy
Copy link
Contributor Author

tomchy commented Sep 15, 2025

If the area, dedicated for the interrupt stack is not available,

Could you provide more details?

The current solution assumes that the value of z_interrupt_stacks is known, which might not be the case if:

  • The project uses bootloader (i.e. MCUboot).
  • The MCUboot chainloads the application.
  • The application enters S2RAM.
  • The SoC resumes from S2RAM from reset vector (due to the boot order - inside MCUboot).
  • The MCUboot must detect and jump to the application code to resume from S2RAM.
  • The MCUboot code will execute the problematic ldr r0 as well as msr msp, r0 instructions, possibly corrupting application data, before checking the S2RAM flags.

@mathieuchopstm
Copy link
Contributor

mathieuchopstm commented Sep 15, 2025

  • The MCUboot code will execute the problematic ldr r0 as well as msr msp, r0 instructions, possibly corrupting application data, before checking the S2RAM flags.

Do you mean that what MCUboot sees as z_interrupt_stack is actually in a different place than what the Zephyr image sees as z_interrupt_stack (i.e., MCUboot sees it where Zephyr places data) and as such it breaks everything?

Copy link
Contributor

@mathieuchopstm mathieuchopstm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be documentation somewhere about these special regions or it will get out of hand. While at it, please document pm_s2ram_context too - I think it was introduced without docs either.

@tomchy
Copy link
Contributor Author

tomchy commented Sep 15, 2025

  • The MCUboot code will execute the problematic ldr r0 as well as msr msp, r0 instructions, possibly corrupting application data, before checking the S2RAM flags.

Do you mean that what MCUboot sees as z_interrupt_stack is actually in a different place than what the Zephyr image sees as z_interrupt_stack (i.e., MCUboot sees it where Zephyr places data) and as such it breaks everything?

Correct. The placement of z_interrupt_stack cannot be guessed or easily configured, thus may lead to issues.

@tomchy
Copy link
Contributor Author

tomchy commented Sep 15, 2025

There needs to be documentation somewhere about these special regions or it will get out of hand. While at it, please document pm_s2ram_context too - I think it was introduced without docs either.

Do you have any specific area in the Zephyr documentation that you'd find suitable for such page?

@tomchy
Copy link
Contributor Author

tomchy commented Sep 15, 2025

There needs to be documentation somewhere about these special regions or it will get out of hand. While at it, please document pm_s2ram_context too - I think it was introduced without docs either.

Do you have any specific area in the Zephyr documentation that you'd find suitable for such page?

Extended the PM_S2RAM Kconfig symbol help message (this is the same place that mentions the pm_s2ram section).

JarmouniA
JarmouniA previously approved these changes Sep 18, 2025
@wearyzen
Copy link
Contributor

wearyzen commented Sep 19, 2025

There was a suggestion made here for a different issue to use a new stack for initialization flow. The issue is still open and creating a new stack could be an option but if it does then with this PR we would be stuck with 2 different options for init stack.
I would suggest to change this PR slightly to use the user provided stack regardless of the pm_s2ram feature, this way when we get to fix the other issue we won't making the current change redundant.

So something like below:

#if DT_NODE_EXISTS(DT_NODELABEL(custom_init_stack)) &&\
	DT_NODE_HAS_COMPAT(DT_NODELABEL(custom_init_stack), zephyr_memory_region)
    ldr r0, =DT_REG_ADDR(DT_NODELABEL(custom_init_stack)) + DT_REG_SIZE(DT_NODELABEL(custom_init_stack))
    msr msp, r0
#elif defined(CONFIG_PM_S2RAM)
    /*
     * Temporarily set MSP to interrupt stack so that arch_pm_s2ram_resume can
     * use stack for calling pm_s2ram_mark_check_and_clear.
     * This is safe because suspend never being called from an ISR ensures that
     * interrupt stack was not live during suspend.
     * Note:
     * if resuming from suspend, MSP is restored from cpu context
     * if not resuming, MSP will be set back to z_main_stack for proper init
     * And, apart from arch_pm_s2ram_resume being able to use the stack for
     * a short while, there is no change in behavior in either of the paths.
     */
    ldr r0, =z_interrupt_stacks + CONFIG_ISR_STACK_SIZE + MPU_GUARD_ALIGN_AND_SIZE
    msr msp, r0
#endif /* CONFIG_PM_S2RAM */
#endif

#if defined(CONFIG_PM_S2RAM)
    bl arch_pm_s2ram_resume
#endif /* CONFIG_PM_S2RAM */

and you might also want to add zephyr,memory-attr to protect the new stack unless its already protected by some other MPU region.

The only thing to consider here is that with my suggestion you would need a slightly bigger memory carved out for the init stack compared to the 16B allocate right now so let me know if you all would still like to go with the current solution.

@wearyzen
Copy link
Contributor

@tomchy btw, could you provide some steps to reproduce the issue this is trying to fix? I just wanted to try the build steps to see the mapfile and if the issue could be solved in any different way without changing this file.

@tomchy
Copy link
Contributor Author

tomchy commented Sep 23, 2025

There was a suggestion made here for a different issue to use a new stack for initialization flow. The issue is still open and creating a new stack could be an option but if it does then with this PR we would be stuck with 2 different options for init stack. I would suggest to change this PR slightly to use the user provided stack regardless of the pm_s2ram feature, this way when we get to fix the other issue we won't making the current change redundant.

So something like below:

#if DT_NODE_EXISTS(DT_NODELABEL(custom_init_stack)) &&\
	DT_NODE_HAS_COMPAT(DT_NODELABEL(custom_init_stack), zephyr_memory_region)
    ldr r0, =DT_REG_ADDR(DT_NODELABEL(custom_init_stack)) + DT_REG_SIZE(DT_NODELABEL(custom_init_stack))
    msr msp, r0
#elif defined(CONFIG_PM_S2RAM)
    /*
     * Temporarily set MSP to interrupt stack so that arch_pm_s2ram_resume can
     * use stack for calling pm_s2ram_mark_check_and_clear.
     * This is safe because suspend never being called from an ISR ensures that
     * interrupt stack was not live during suspend.
     * Note:
     * if resuming from suspend, MSP is restored from cpu context
     * if not resuming, MSP will be set back to z_main_stack for proper init
     * And, apart from arch_pm_s2ram_resume being able to use the stack for
     * a short while, there is no change in behavior in either of the paths.
     */
    ldr r0, =z_interrupt_stacks + CONFIG_ISR_STACK_SIZE + MPU_GUARD_ALIGN_AND_SIZE
    msr msp, r0
#endif /* CONFIG_PM_S2RAM */
#endif

#if defined(CONFIG_PM_S2RAM)
    bl arch_pm_s2ram_resume
#endif /* CONFIG_PM_S2RAM */

and you might also want to add zephyr,memory-attr to protect the new stack unless its already protected by some other MPU region.

The only thing to consider here is that with my suggestion you would need a slightly bigger memory carved out for the init stack compared to the 16B allocate right now so let me know if you all would still like to go with the current solution.

I do not think I fully understand your suggestion. The other issue is about the value of the psp set to z_interrupt_stack and your suggestion modifies only the logic concerning msp value. Additionally, the new psp value is directly overwritten by the following lines:

    ldr r0, =z_main_stack + CONFIG_MAIN_STACK_SIZE
    msr msp, r0

That said, if the S2RAM is disabled, the msp holds the new value only for a single assembly instruction.

I am also new to the zephyr,memory-attr-based MPU protection, so today I am unable to propose something that I'd understand that it is not going to break things.

It is fairly easy to reproduce the different values of the z_interrupt_stack, simply:

pushd zephyr/samples/subsys/mgmt/mcumgr/smp_svr
west build -p -b nrf52840dk/nrf52840 -T ./sample.mcumgr.smp_svr.bt
grep -r "z_interrupt_stacks" ./build/*/zephyr/zephyr.map

This should bring you two, different values, i.e.:

./build/mcuboot/zephyr/zephyr.map:                0x0000000020002740                z_interrupt_stacks
./build/smp_svr/zephyr/zephyr.map:                0x0000000020007040                z_interrupt_stacks

It is however hard to provide a recipe to see the faults, because:

  • The code shall use the corrupted address for something critical, like pointers to callbacks.
  • You must use platform that supports S2RAM.

In the builds that I had, even changing things like optimization level made the issue disappear, so I think it is better to understand the root cause than try to consistently reproduce the exact failure.

@wearyzen
Copy link
Contributor

Hi @tomchy, without going into details of which SP the other issue is talking about, what I wanted to highlight was that it talks about using a custom stack for initialization and so my suggestion was, instead of having the custom stack only for the CONFIG_PM_S2RAM path, have it in general.
Yes, a few more changes would have followed but I assumed we can go through them once we agree on this decision.

And thanks for the steps, IIUC as per the description, the image that wakes up from resume is mcuboot and loads z_interrupt_stacks into MSP. MSP of mcuboot overlaps with one of smp_svr/zephyr's memory regions causing a corruption.

I don't have any experience with mcuboot so please help me to understand this, once mcuboot resumes its context in arch_pm_s2ram_resume it goes back to using z_interrupt_stacks as MSP and z_main_stack as PSP so, what is stopping it from causing any more corruption in the next zephyr image?
Please correct me if I missed something here but I feel the issue is just not reproduced for now with this approach but not completely solved and the correct fix would be to adjust the linker script to make sure mcuboot's stack (z_interrupt_stack and z_main_stack which are part of .noinit.*) doesn't overlap with any of the next zephyr images memory regions.

@mathieuchopstm
Copy link
Contributor

Hi @tomchy, without going into details of which SP the other issue is talking about, what I wanted to highlight was that it talks about using a custom stack for initialization and so my suggestion was, instead of having the custom stack only for the CONFIG_PM_S2RAM path, have it in general. Yes, a few more changes would have followed but I assumed we can go through them once we agree on this decision.

And thanks for the steps, IIUC as per the description, the image that wakes up from resume is mcuboot and loads z_interrupt_stacks into MSP. MSP of mcuboot overlaps with one of smp_svr/zephyr's memory regions causing a corruption.

I don't have any experience with mcuboot so please help me to understand this, once mcuboot resumes its context in arch_pm_s2ram_resume it goes back to using z_interrupt_stacks as MSP and z_main_stack as PSP so, what is stopping it from causing any more corruption in the next zephyr image? Please correct me if I missed something here but I feel the issue is just not reproduced for now with this approach but not completely solved and the correct fix would be to adjust the linker script to make sure mcuboot's stack (z_interrupt_stack and z_main_stack which are part of .noinit.*) doesn't overlap with any of the next zephyr images memory regions.

I think @tomchy's scenario works because it also makes use of the "pinned context" feature, though I'll let @tomchy confirm.

If my understanding is correct, the following happens:

  • coldboot: MCUboot runs, loads the Zephyr image and jumps to it
  • Zephyr image runs until PM determines entry into S2RAM should be performed
  • arch_pm_s2ram_suspend pushes GPRs on stack and saves CPU context into pinned memory region
    • This CPU context contains msp and psp values that correspond to stacks of the Zephyr image
    • Pinned region is seen at same address by MCUboot and Zephyr image (since DT is shared)
  • SoC enters S2RAM
  • ...time elapses...
  • SoC exits S2RAM and MCUboot starts running
  • arch_pm_s2ram_resume is called and determines that this is S2RAM exit
    • CPU context is restored from pinned memory region (which MCUboot can locate!)
    • msp/psp are now valid stacks from the Zephyr image
    • GPRs are popped from the (Zephyr image) stack on which they were pushed
  • execution resumes in Zephyr image (after call to arch_pm_s2ram_suspend)

TL;DR: MCUboot is able to pivot to a stack of the Zephyr image because psp/msp were saved in a specific memory region visible by both MCUboot and the Zephyr image; however, a tiny amount of memory needs to be used as stack to perform this pivot (or rather, to run pm_s2ram_mark_check_and_clear!)

Another solution would be to increase the size of struct __cpu_context with scratch space and use that as stack for arch_pm_s2ram_resume (i.e., "hacky way to shoving a stack and the CPU context together"):

struct __cpu_context {
/* GPRs are saved onto the stack */
uint32_t msp;
uint32_t psp;
uint32_t primask;
uint32_t control;
#if defined(CONFIG_ARMV7_M_ARMV8_M_MAINLINE)
/* Registers present only on ARMv7-M and ARMv8-M Mainline */
uint32_t faultmask;
uint32_t basepri;
#endif /* CONFIG_ARMV7_M_ARMV8_M_MAINLINE */
#if defined(CONFIG_CPU_CORTEX_M_HAS_SPLIM)
/* Registers present only on certain ARMv8-M implementations */
uint32_t msplim;
uint32_t psplim;
#endif /* CONFIG_CPU_CORTEX_M_HAS_SPLIM */
};

 struct __cpu_context { 
 	/* GPRs are saved onto the stack */ 
 	uint32_t msp; 
 	uint32_t psp; 
 	uint32_t primask; 
 	uint32_t control; 
  
 #if defined(CONFIG_ARMV7_M_ARMV8_M_MAINLINE) 
 	/* Registers present only on ARMv7-M and ARMv8-M Mainline */ 
 	uint32_t faultmask; 
 	uint32_t basepri; 
 #endif /* CONFIG_ARMV7_M_ARMV8_M_MAINLINE */ 
  
 #if defined(CONFIG_CPU_CORTEX_M_HAS_SPLIM) 
 	/* Registers present only on certain ARMv8-M implementations */ 
 	uint32_t msplim; 
 	uint32_t psplim; 
 #endif /* CONFIG_CPU_CORTEX_M_HAS_SPLIM */
+       uint8_t scratch_stack_space[CONFIG_ARM_PM_S2RAM_RESUME_STACK_SIZE];
 };

Yet another other solution would be to bring back the constraints about pm_s2ram_mark_check_and_clear (must not use stack and leave a specific register untouched) but that feels overkill...

@wearyzen
Copy link
Contributor

The flow @mathieuchopstm mentioned looks similar to what was mentioned in #96290 (comment). If that's the case then maybe a new soc_resume_hook api would be better solution:

#if defined(CONFIG_PM_S2RAM)
#if defined(CONFIG_SOC_RESUME_HOOK)
    bl soc_resume_hook
#else
    /*
     * Temporarily set MSP to interrupt stack so that arch_pm_s2ram_resume can
     * use stack for calling pm_s2ram_mark_check_and_clear.
	@@ -113,6 +124,7 @@ SECTION_SUBSEC_FUNC(TEXT,_reset_section,__start)
     * a short while, there is no change in behavior in either of the paths.
     */
    ldr r0, =z_interrupt_stacks + CONFIG_ISR_STACK_SIZE + MPU_GUARD_ALIGN_AND_SIZE
#endif
#endif

and soc_resume_hook does what they expect to be done in arch_pm_s2ram_resume() similar to https://github.com/zephyrproject-rtos/zephyr/pull/96290/files#diff-15c65cbae35c06ba0b60c2a5129e9da316484d76b96dc60b689af15d54996601R19.

@mathieuchopstm
Copy link
Contributor

mathieuchopstm commented Sep 24, 2025

The flow @mathieuchopstm mentioned looks similar to what was mentioned in #96290 (comment). If that's the case then maybe a new soc_resume_hook api would be better solution:

#if defined(CONFIG_PM_S2RAM)
#if defined(CONFIG_SOC_RESUME_HOOK)
    bl soc_resume_hook
#else
    /*
     * Temporarily set MSP to interrupt stack so that arch_pm_s2ram_resume can
     * use stack for calling pm_s2ram_mark_check_and_clear.
	@@ -113,6 +124,7 @@ SECTION_SUBSEC_FUNC(TEXT,_reset_section,__start)
     * a short while, there is no change in behavior in either of the paths.
     */
    ldr r0, =z_interrupt_stacks + CONFIG_ISR_STACK_SIZE + MPU_GUARD_ALIGN_AND_SIZE
#endif
#endif

and soc_resume_hook does what they expect to be done in arch_pm_s2ram_resume() similar to https://github.com/zephyrproject-rtos/zephyr/pull/96290/files#diff-15c65cbae35c06ba0b60c2a5129e9da316484d76b96dc60b689af15d54996601R19.

#96290's approach has the advantage that it runs through the bootloader again which may be required in certain use-cases (e.g., XIP from external flash/RAM? are there even S2RAM platforms where this exists?); the disadvantage is that it runs through the bootloader again which may be inappropriate due to shared resource overwriting... but it is gated behind a Kconfig.

(but I guess the idea would be that platform_resume_hook would be tasked with calling the bootloader code to re-init storage devices/etc)

@tomchy
Copy link
Contributor Author

tomchy commented Sep 25, 2025

@wearyzen, @mathieuchopstm You're correct. This change alone does not solve the problem of resuming from S2RAM if bootloader is enabled - it must be combined with one of the other changes:

  1. Optional context memeory section section for suspend to RAM CPU context. #93818: to share the S2RAM context between the application and bootloader. That way the S2RAM routine executed from the bootloader code resumes the state created by the application. This solution however creates a restriction, that the resume procedure must be exactly the same in both the bootloader and all application builds, including all possible updates. That's why a new method is being proposed (see pt 2).
  2. arch/arm/cortex_m: support for bridge to the next image S2RAM routines #96290: to execute a small, PM-independent piece of logic, just so the bootloader can detect which variant of the application (Direct XIP) is the active one and jump to a reset vector of that application. That way the S2RAM resume routine from the application is handling the context and restores the CPU state.

@wearyzen: regarding the custom stack, I agree that we may name this "small" stack a custom_init_stack. Do you think adding this definition in this PR is enough or do you envision some other functional changes to be done here?

--edit--
@mathieuchopstm the proposal with scratch_stack_space has the same disadvantage as the option 1) - it is based on the assumption that this piece of code will be identical for both the bootloader and the application for the entire lifetime of the device.

@wearyzen
Copy link
Contributor

wearyzen commented Sep 25, 2025

@wearyzen: regarding the custom stack, I agree that we may name this "small" stack a custom_init_stack. Do you think adding this definition in this PR is enough or do you envision some other functional changes to be done here?

Considering both the PRs, I think changing MSP to use to small stack can just be done inside the soc_resume_hook() and there is no change needed here at arch level.

@tomchy
Copy link
Contributor Author

tomchy commented Sep 25, 2025

@wearyzen: regarding the custom stack, I agree that we may name this "small" stack a custom_init_stack. Do you think adding this definition in this PR is enough or do you envision some other functional changes to be done here?

Considering both the PRs, I think changing MSP to use to small stack can just be done inside the soc_resume_hook() and there is no change needed here at arch level.

Do you propose to create a new SOC_RESUME_HOOK and call it instead of the ldr and msr (and possibly bl arch_pm_s2ram_resume) instructions of arch-level __start implementation?

I cannot find a reference for the existing soc_resume_hook() function.

@wearyzen
Copy link
Contributor

Do you propose to create a new SOC_RESUME_HOOK and call it instead of the ldr and msr (and possibly bl arch_pm_s2ram_resume) instructions of arch-level __start implementation?

Yes a new hook suggested to be added in the other PR (#96290 (comment))

@tomchy
Copy link
Contributor Author

tomchy commented Sep 25, 2025

Do you propose to create a new SOC_RESUME_HOOK and call it instead of the ldr and msr (and possibly bl arch_pm_s2ram_resume) instructions of arch-level __start implementation?

Yes a new hook suggested to be added in the other PR (#96290 (comment))

Regarding the new hook - note that the soc_resume_hook cannot push anything to the stack before it sets the msp.
With some additional care to the lr it should be fixable though.

What I'm more concerned about - the issue is not specific to the SoC, but rather to the combination of having both bootloader and a sleepy application on the same core, started through chain-loading. That said, are you sure that a fix for it should be hidden inside nRF-specific files?

This kind of issues are not easy to debug, so before we remove it from the generic scope, I would like to confirm that the solution on other platforms is not desired.

@wearyzen
Copy link
Contributor

Regarding the new hook - note that the soc_resume_hook cannot push anything to the stack before it sets the msp. With some additional care to the lr it should be fixable though.

What I'm more concerned about - the issue is not specific to the SoC, but rather to the combination of having both bootloader and a sleepy application on the same core, started through chain-loading. That said, are you sure that a fix for it should be hidden inside nRF-specific files?

This kind of issues are not easy to debug, so before we remove it from the generic scope, I would like to confirm that the solution on other platforms is not desired.

I agree that there would be an issue is such a scenario and might not be easy to debug but for such a design where mcuboot resumes just to load the next images context won't they still need the soc_resume_hook? and if they do then with the right documentation they may chose to either change the msp to something like this PR does or not use the the stack at all and avoid the issue in soc_resume_hook 🤷.
But I agree that with soc_resume_hook added there should be good documentation highlighting this issue.

@mathieuchopstm
Copy link
Contributor

mathieuchopstm commented Sep 25, 2025

Considering that S2RAM is currently gated behind CONFIG_PM_S2RAM, maybe it would make sense to replace arch_pm_s2ram_resume with pm_s2ram_resume_hook, for which a default (weak) implementation would merely proxy to (what we call today) arch_pm_s2ram_resume?

(additionally, arch_pm_s2ram_resume - which today is an ARM implementation detail - could become a first-class primitive, consuming a struct _cpu_context from which it restores context à la longjmp)

@nordic-segl
Copy link
Contributor

Please rebase and resolve conflicts.
In this PR https://github.com/zephyrproject-rtos/zephyr/pull/95474/files
size of cpuapp_ram0 was decreased by 224 B (256-32) while only 192 B were assigned to pm_ramfunc. You should be able to fit in that gap.

If the area, dedicated for the interrupt stack is not available, allow
to specify a memory region that will be used as the stack for the S2RAM
resume logic.

Signed-off-by: Tomasz Chyrowicz <[email protected]>
Add the definition of pm_s2ram_stack memory region for nRF54H20.

Signed-off-by: Tomasz Chyrowicz <[email protected]>
@tomchy tomchy force-pushed the bugfix/reset/NCSDK-35410_Dedicated_s2ram_stack branch from 0cb2901 to 74e6b66 Compare September 30, 2025 13:52
@tomchy
Copy link
Contributor Author

tomchy commented Sep 30, 2025

Please rebase and resolve conflicts. In this PR https://github.com/zephyrproject-rtos/zephyr/pull/95474/files size of cpuapp_ram0 was decreased by 224 B (256-32) while only 192 B were assigned to pm_ramfunc. You should be able to fit in that gap.

PR rebased, docs removed due to: 4626f6b

To my understanding - if possible, the solution should be move into the board-specific files and there is an agreement on adding a platform-specific hook.

Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants