Skip to content

Comments

Support booting the kernel in EL1 with a PSCI proxy#8

Open
m4tx wants to merge 6 commits intoel2-supportfrom
el1-support
Open

Support booting the kernel in EL1 with a PSCI proxy#8
m4tx wants to merge 6 commits intoel2-supportfrom
el1-support

Conversation

@m4tx
Copy link
Collaborator

@m4tx m4tx commented Nov 26, 2025

No description provided.

@m4tx m4tx force-pushed the el1-support branch 2 times, most recently from 66d20eb to 4ede863 Compare November 26, 2025 15:24
@m4tx m4tx force-pushed the el1-support branch 3 times, most recently from 0dc830b to bbacedb Compare November 28, 2025 14:29
@m4tx m4tx requested review from qperret2 and qwandor November 28, 2025 14:30
/// A pointer to a stack allocated for a secondary CPU.
///
/// This must not be dropped as long as the secondary CPU is running.
struct SecondaryStack {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of using this wrapper type over just using Box<Stack<...>> directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much, I thought it will be a bit more readable with having a separate named type, but it's also fine to remove it if you think otherwise.

src/main.rs Outdated

const HEAP_SIZE: usize = 40 * PAGE_SIZE;
#[unsafe(no_mangle)]
#[unsafe(link_section = ".heap")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be in a separate section, and does it still get properly initialised?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I thought I need to prepare stack for the EL1 guest, it made sense that the heap space should be accessible to it, and hence needed to have a well-known address (because otherwise it will be reserved in the Device Tree/stage-2 pagetables). Now it's not needed.

.build(),
);

let new_dtb = dt.to_dtb().leak();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you just put the new device tree somewhere on the heap of RITM and return that to the EL1 payload? Presumably that means that you'll have to map a page-aligned region containing it into the stage 2 pagetable so that EL1 can read it; is there likely to be any sensitive data in the heap that EL1 shouldn't be able to read?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I'm doing in #16. In this PR, I just don't enable stage-2 page tables at all, so the guest should able to read everything in ritm's memory.

is there likely to be any sensitive data in the heap that EL1 shouldn't be able to read?

It doesn't at the moment, but given that RITM is supposed to be modified by the vendors, it potentially might contain sensitive data. Hence in #16 I only expose the device tree to the guest - so the guest cannot read anything it doesn't need.


// Reset the timer offset.
// SAFETY: Resetting the CNTVOFF_EL2 is needed as part of
// preparing an environment to run Linux.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, but doesn't explain why it is safe. I guess it's actually safe to write any value to this system register, as it doesn't affect anything to do with memory safety, right?

Copy link
Collaborator Author

@m4tx m4tx Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - also, I see that arm_sysregs marks this register write-safe, so we don't even need the unsafe block anymore.

///
/// This function is unsafe because it may call into the secure monitor via SMC or perform other
/// privileged operations.
unsafe fn handle_psci(fn_id: u64, arg0: u64, arg1: u64, arg2: u64) -> Result<u64, arm_psci::Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure this needs to be unsafe, but if it does then so does try_handle_psci by the same argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think with the current state of the code, since we wrap the potentially unsafe PSCI calls, and we handle all others (possibly by panicking), it's most likely fine to drop the unsafe attribute from this function.

let context = SUSPEND_CONTEXTS
.lock()
.remove(&context.data.mpidr)
.expect("context not found for resuming CPU");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that restore_from_suspend is being passed a mutable reference to the context from SUSPEND_CONTEXTS while the lock isn't held means that something dodgy is going on here, I guess with the aarch64-rt suspend API that you've added. It also seems superfluous, as the only thing you are getting from it here is the MPIDR, which you could just read from the system register anyway. I'm not sure what exactly we should do instead though.

Copy link
Collaborator Author

@m4tx m4tx Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, indeed it's unsound; presumably the aarch64-rt code should then pass the pointer to the callback rather than the reference (sadly making the API even more low-level), so that the callback must ensure it's okay to dereference it.

In this case, I think it's reasonable to read the mpidr from the system register and then access the data properly by locking the SpinMutex.

Okay, after consideration, I don't really think it's unsound as it is. Yes, the data is behind mutex, but after CPU-specific data is stored by the CPU, it never gets removed or otherwise modified by any other CPU. Hence, I believe it should be safe to read this after resuming, even without acquiring the lock. I'll update the safety comments with this info.

A completely safe alternative would be to store the data on the heap, but I believe that having this in a static global is better, even if unsafe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any undefined behaviour as it stands, but I think it's still unsound in that adding safe code could introduce undefined behaviour.

I think the best solution is to make the _context passed to restore_from_suspend a pointer rather than a reference. That shouldn't be a problem here, as you're not using it anyway. You could even remove the parameter entirely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used your latest changes in my aarch64-rt PR - the way we have it now seems to be perfectly sound.

"Restoring from suspend: entry={:#x}, ctx={:#x}",
context.data.entry_point, context.data.context_id
);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might need to reset a bunch of EL1 system registers first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? My understanding is that the guest should expect essentially the same state as after CPU_ON, so using the same entry point function (entry_point_el1) should suffice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally sure. Certainly in both cases EL1 will expect the EL1 system registers to have their reset values, but it's not clear to me whether EL2, EL3 or hardware is responsible for ensuring this. For example is it possible that in some suspend states hardware will preserve system register values and so EL2 is responsible for resetting EL1 system registers? But maybe that doesn't happen, because EL3 will return from the suspend call in that case rather than going through the resume path.

Let's leave this as is for now, but talk to someone who understands PSCI better than I do.

&raw mut **stack
} else {
&raw mut **stack_map.insert(mpidr, Box::default())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of allocating secondary stacks dynamically on the heap, rather than reserving SECONDARY_STACK_PAGE_COUNT * (MAX_CORES - 1) pages statically? The disadvantage is that you'll only find out at runtime when starting one of the secondary cores that your heap wasn't big enough for all the stacks, whereas if you do it statically you'll find out at compile time.

Copy link
Collaborator Author

@m4tx m4tx Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, Box gives some confidence that writes from multiple CPUs are safe, but as long as we never remove the items, it should be sound to have this stored in a single array as well.

"mov x19, x0",
"bl {disable_mmu_and_caches}",
"mov x0, x19",
"eret",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that some of the other general-purpose registers here contain sensitive EL2 data? If so, you should zero them all before the eret. The same applies to any other registers shared between EL2 and EL1.

Copy link
Collaborator Author

@m4tx m4tx Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - and if I'm not mistaken, zeroing out the general purpose registers should suffice, as the rest is not shared with EL1 when EL1H is set.

@m4tx m4tx requested a review from qwandor February 5, 2026 16:47
let context = SUSPEND_CONTEXTS
.lock()
.remove(&context.data.mpidr)
.expect("context not found for resuming CPU");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any undefined behaviour as it stands, but I think it's still unsound in that adding safe code could introduce undefined behaviour.

I think the best solution is to make the _context passed to restore_from_suspend a pointer rather than a reference. That shouldn't be a problem here, as you're not using it anyway. You could even remove the parameter entirely.

"Restoring from suspend: entry={:#x}, ctx={:#x}",
context.data.entry_point, context.data.context_id
);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally sure. Certainly in both cases EL1 will expect the EL1 system registers to have their reset values, but it's not clear to me whether EL2, EL3 or hardware is responsible for ensuring this. For example is it possible that in some suspend states hardware will preserve system register values and so EL2 is responsible for resetting EL1 system registers? But maybe that doesn't happen, because EL3 will return from the suspend call in that case rather than going through the resume path.

Let's leave this as is for now, but talk to someone who understands PSCI better than I do.

@m4tx m4tx requested a review from qwandor February 20, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants