Add `read_panic_message` kipc #2313

hawkw · 2025-12-03T19:53:56Z

Currently, there is no way to programmatically access the panic message of a task which has faulted due to a Rust panic fron within the Hubris userspace. This branch adds a new read_panic_message kipc that copies the contents of a panicked task's panic message buffer into the caller. If the requested task has not panicked, this kipc returns an error indicating this. This is intended by use by supervisor implementations or other tasks which wish to report panic messages from userspace.

I've also added a test case that exercises this functionality.

Fixes #2311

doc/kipc.adoc

Currently, there is no way to programmatically access the panic message of a task which has faulted due to a Rust panic fron within the Hubris userspace. This branch adds a new `read_panic_message` kipc that copies the contents of a panicked task's panic message buffer into the caller. If the requested task has not panicked, this kipc returns an error indicating this. This is intended by use by supervisor implementations or other tasks which wish to report panic messages from userspace. I've also added a test case that exercises this functionality. Fixes #2311

sys/kern/src/kipc.rs

mkeeter · 2025-12-03T20:36:09Z

sys/kern/src/kipc.rs

+    if index >= tasks.len() {
+        return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage(
+            UsageError::TaskOutOfRange,
+        )));
+    }


Similarly, something like

Suggested change

if index >= tasks.len() {

return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage(

UsageError::TaskOutOfRange,

)));

}

let Some(task) = tasks.get(index) else {

return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage(

UsageError::TaskOutOfRange,

)));

};

(then use task below instead of indexing repeatedly)

Co-authored-by: Matt Keeter <matt@oxide.computer>

sys/userlib/src/kipc.rs

sys/kern/src/kipc.rs

doc/kipc.adoc

sys/kern/src/kipc.rs

cbiffle · 2025-12-29T22:52:09Z

sys/kern/src/kipc.rs

+        ..
+    } = task.state()
+    else {
+        return Err(UserError::Recoverable(


Do you have any code that uses this API yet? I suspect this error is going to be unwrap()'d, in which case we might want to eliminate it and have the kernel fault the caller if they got the task state wrong. (Which is to say, switch this to Unrecoverable.)

FWIW all the other errors in this implementation look right to me.

Hmm, I was imagining we might see this error in the event that a task responsible for collecting panic messages (i.e., packrat) gets pinged about a panic, but doesn't actually get to run before a timer in jefe elapses and the task gets restarted. In that case, the task would no longer have a panic message to share, and packrat (or whomever wants the panic message) has just missed its chance to read the panic. I would definitely not want to panic packrat in that case, so I felt like this error ought to be handle-able without panicking.

If we don't end up implementing that behavior in the supervisor, and instead make it wait to be informed that a panic message has been collected before restarting the faulted task, it might make more sense to make this case a fault for the caller. But, the approach we discussed in #2309 (comment) made me feel inclined to go down the path of treating the restart cooldown in jefe as a timeout for collecting panic messages, so I was planning to do that (so that packrat or similar can't block other tasks from restarting indefinitely). And, even if we did decide to make jefe wait for panic messages to be recorded before restarting a task, making this a handleable error allows more flexibility for other (theoretical) supervisor implementations to do other things.

Hmm, I was imagining we might see this error in the event that a task responsible for collecting panic messages (i.e., packrat) gets pinged about a panic, but doesn't actually get to run before a timer in jefe elapses and the task gets restarted. In that case, the task would no longer have a panic message to share, and packrat (or whomever wants the panic message) has just missed its chance to read the panic. I would definitely not want to panic packrat in that case, so I felt like this error ought to be handle-able without panicking.

I find this convincing, I hadn't considered which task was likely to be calling this API.

sys/kern/src/kipc.rs

doc/kipc.adoc

sys/userlib/src/kipc.rs

cbiffle · 2025-12-29T22:59:39Z

sys/userlib/src/kipc.rs

+///
+/// Note that Hubris only preserves the first [`PANIC_MESSAGE_MAX_LEN`] bytes of
+/// a task's panic message, and panic messages greater than that length are
+/// truncated. Thus, this function accepts a buffer of that length.


Just wanted to note that we could most likely loosen this later, if required, by changing this to a &mut [u8] and not breaking callers.

Yup, that was my thinking as well.

cbiffle · 2025-12-29T23:00:36Z

sys/userlib/src/kipc.rs

+///
+/// - [`Ok`]`(&[u8])` if the task is panicked. The returned slice is borrowed
+///   from `buf`, and contains the task's panic message as a sequence of
+///   UTF-8 bytes. Note that the slice may be empty, if the task has panicked


Technically, nothing ensures that the contents of that slice are UTF-8, so I wouldn't advertise that. The panic message is normally expected to be UTF-8 but it may be truncated in the middle of a multi-byte sequence, and the task is panicking so strictly speaking it could have stomped all over its RAM, producing garbage.

This is another case where seeing a caller might clarify the API design -- it might make sense to call utf8_chunks on the buffer and return the iterator that results so that the caller can easily step over valid and invalid sections if desired. Or, callers may just byte-copy the result into CBOR and not parse it at all (in which case it's quite important that it not hit CBOR as UTF-8!).

Yeah. So, I was planning to probably just encode the valid regions as a CBOR string in the caller which I haven't written yet. But, I had a vague sense that the KIPC API ought not make too many decisions about what the caller wants; I had kind of been considering having a higher-level read_panic_message_utf8 or something that just gives you the valid UTF-8 regions. The sense I got from the other KIPC APIs is that they don't tend to do much beyond what's necessary to actually send and receive the IPC from the kernel, so I wanted to be consistent with that.

On the other hand, I suspect basically all callers are going to want to get a string out and not want anything that isn't UTF-8, so maybe this API should just do the higher-level behavior. I think you're starting to sell me on that.

test/test-suite/src/main.rs

cbiffle · 2025-12-29T23:07:02Z

test/test-suite/src/main.rs

+            .unwrap();
+    // it should look kinda like a panic message (but since the line number may
+    // change, don't make assertions about the entire contents of the string...
+    assert!(core::str::from_utf8(msg)


I'm not sure the expect here is right -- assuming the assistant panics with an ASCII-only message (which it likely does) you could still get this test to fail by running it in a deep path containing non-ASCII characters.

Since in the end you're only checking the prefix of the string, you can get the valid UTF-8 prefix using

msg.utf8_chunks().next().unwrap().valid()

(the unwrap() there is an unfortunate detail of the utf8_chunks implementation, which is defined as returning an iterator yielding at least one chunk... but the type system can't see that.)

Ah, that's a good point, thanks!

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

sys/kern/src/kipc.rs

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

In order for Packrat to produce an ereport when another task faults (see #2341), we must have a way to notify it when such a fault has occurred. Furthermore, our scheme for reporting such faults requires that some data be recorded *before* the task has been restarted. In particular, if a task faults due to a panic, the panic message must be read out of its address space before it is restarted, via the `read_panic_message` KIPC I added in #2313. Similarly, the `read_task_status` KIPC can only provide details on the fault while the task remains in the `Faulted` state; when it is restarted, that data is cleared. Thus, we must delay the task's restart for some period of time so that Packrat can gather this information and stuff it into an ereport. This commit solves both of these problems. First, I've add a new config field to Jefe to allow providing a list of other tasks to which a notification should be posted when a fault occurs (similarly to the existing state-change notification). Second, I've changed the timeout behavior a bit so that, if Jefe is notifying other tasks on a fault, it will _always_ wait at least 5 ms before restarting a faulted task, even if the task has been running for longer than `MIN_RUN_TIME` since its last fault. This way, we always allow some time for a fault report to be produced if someone cares about the task having faulted. The 5 ms minimum delay is elided if there are no tasks to notify on faults.

@cbiffle

Depends on #2313, #2350, #2358 Fixes #2309 It's currently somewhat difficult to become aware of Hubris task panics and other task faults in a production environment. While MGS can ask the SP to list task dumps as part of the API for reading dumps, this requires that the control plane (or faux-mgs user) proactively ask the SP whether it has any record of panicked tasks, rather than recording panics as they occur. Therefore, we should have a proactive notification from the SP indicating that task faults have occurred. This commit adds code to `packrat` for producing an ereport when a task has faulted. This could eventually be used by the control plane to trigger dump collection and produce a service bundle. In addition, it will provide a more permanent record that a task faulted at a particular time, even if the SP that contains the faulted task is later reset or replaced with an entirely different SP. This works using an approach similar to the one described by @cbiffle in [this comment][1]. There's a detailed description of how this works [in the module-level RustDoc for `ereport.rs` in Packrat][2]. The ereports that come out of this thing look like this: ```console eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports Jan 15 10:27:41.370 INFO creating SP handle on interface eno1np0, component: faux-mgs Jan 15 10:27:41.372 INFO initial discovery complete, addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket: control-plane-agent, component: faux-mgs restart ID: 4e54b7f1-e13a-d9bb-709a-c7e863d64a64 restart IDs did not match (requested 00000000-0000-0000-0000-000000000000) count: 4 ereports: 0x1: { "ereport_message_version": Number(0), "hubris_task_gen": Number(0), "hubris_task_name": String("packrat"), "hubris_uptime_ms": Number(0), "lost": Null, } 0x2: { "ereport_message_version": Number(0), "hubris_task_gen": Number(0), "hubris_task_name": String("ereportulator"), "hubris_uptime_ms": Number(378010), "k": String("hubris.fault.panic"), "msg": String("panicked at task/ereportulator/src/main.rs:158:9:\nim dead lol"), "v": Number(0), } 0x3: { "by": Object { "gen": Number(0), "task": String("jefe"), }, "ereport_message_version": Number(0), "hubris_task_gen": Number(0), "hubris_task_name": String("user_leds"), "hubris_uptime_ms": Number(382914), "k": String("hubris.fault.injected"), "v": Number(0), } 0x4: { "by": Object { "gen": Number(0), "task": String("jefe"), }, "ereport_message_version": Number(0), "hubris_task_gen": Number(1), "hubris_task_name": String("ereportulator"), "hubris_uptime_ms": Number(388215), "k": String("hubris.fault.injected"), "v": Number(0), } ``` [1]: #2309 (comment) [2]: https://github.com/oxidecomputer/hubris/blob/1e2b121fcd165dcdbcbdfdc69c36d35520dfa954/task/packrat/src/ereport.rs#L15-L95

hawkw requested a review from cbiffle December 3, 2025 19:53

hawkw commented Dec 3, 2025

View reviewed changes

doc/kipc.adoc Show resolved Hide resolved

hawkw force-pushed the eliza/read-panic-message branch 2 times, most recently from c4ca702 to 2c80a58 Compare December 3, 2025 19:56

hawkw force-pushed the eliza/read-panic-message branch from 2c80a58 to 5ba535a Compare December 3, 2025 20:27

mkeeter reviewed Dec 3, 2025

View reviewed changes

sys/kern/src/kipc.rs Outdated Show resolved Hide resolved

mkeeter reviewed Dec 3, 2025

View reviewed changes

Update kipc.rs

1313935

Co-authored-by: Matt Keeter <matt@oxide.computer>

mkeeter reviewed Dec 3, 2025

View reviewed changes

sys/userlib/src/kipc.rs Outdated Show resolved Hide resolved

labbott reviewed Dec 3, 2025

View reviewed changes

sys/kern/src/kipc.rs Outdated Show resolved Hide resolved

hawkw added 2 commits December 3, 2025 13:59

review feedback + tidiness

07460ec

CLIPPY DAMAGE

68a2fb7

hawkw added kernel Relates to the Hubris kernel userlib Related to userlib, the fundamental library used by tasks fault-management Everything related to the Oxide's Fault Management architecture implementation labels Dec 4, 2025

hawkw self-assigned this Dec 4, 2025

hawkw added 2 commits December 4, 2025 11:35

Merge branch 'master' into eliza/read-panic-message

2ab7a43

Merge branch 'master' into eliza/read-panic-message

8a221c1

cbiffle reviewed Dec 29, 2025

View reviewed changes

doc/kipc.adoc Show resolved Hide resolved

cbiffle reviewed Dec 29, 2025

View reviewed changes

add note on UTF-8 truncation

425c221

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

cbiffle reviewed Dec 29, 2025

View reviewed changes

sys/kern/src/kipc.rs Show resolved Hide resolved

hawkw and others added 7 commits December 29, 2025 17:04

Update kipc.adoc

3748fda

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

Update kipc.adoc

09fe93f

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

Update main.rs

e468ef7

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

More of @cbiffle's docs suggestions

7b4a6bb

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>

line wrapping, docs links

2d9b1d2

improve docs/naming for invalid buffers

fb2af3f

@cbiffle convinced me to return Utf8Chunks

5193057

cbiffle approved these changes Jan 5, 2026

View reviewed changes

Merge branch 'master' into eliza/read-panic-message

1890e0a

hawkw enabled auto-merge (squash) January 5, 2026 22:16

hawkw mentioned this pull request Jan 5, 2026

ereport: Task faulted/panicked #2341

Merged

hawkw merged commit e7f5f92 into master Jan 5, 2026
314 of 316 checks passed

hawkw deleted the eliza/read-panic-message branch January 5, 2026 23:36

hawkw mentioned this pull request Jan 14, 2026

jefe: Add mechanism for notifying another task on task faults #2358

Merged

Add read_panic_message kipc #2313

Add read_panic_message kipc #2313

Uh oh!

Conversation

hawkw commented Dec 3, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add `read_panic_message` kipc #2313

Add `read_panic_message` kipc #2313