-
Notifications
You must be signed in to change notification settings - Fork 218
Add read_panic_message kipc
#2313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c4ca702 to
2c80a58
Compare
Currently, there is no way to programmatically access the panic message of a task which has faulted due to a Rust panic fron within the Hubris userspace. This branch adds a new `read_panic_message` kipc that copies the contents of a panicked task's panic message buffer into the caller. If the requested task has not panicked, this kipc returns an error indicating this. This is intended by use by supervisor implementations or other tasks which wish to report panic messages from userspace. I've also added a test case that exercises this functionality. Fixes #2311
2c80a58 to
5ba535a
Compare
sys/kern/src/kipc.rs
Outdated
| if index >= tasks.len() { | ||
| return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage( | ||
| UsageError::TaskOutOfRange, | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, something like
| if index >= tasks.len() { | |
| return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage( | |
| UsageError::TaskOutOfRange, | |
| ))); | |
| } | |
| let Some(task) = tasks.get(index) else { | |
| return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage( | |
| UsageError::TaskOutOfRange, | |
| ))); | |
| }; |
(then use task below instead of indexing repeatedly)
Co-authored-by: Matt Keeter <matt@oxide.computer>
| .. | ||
| } = task.state() | ||
| else { | ||
| return Err(UserError::Recoverable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any code that uses this API yet? I suspect this error is going to be unwrap()'d, in which case we might want to eliminate it and have the kernel fault the caller if they got the task state wrong. (Which is to say, switch this to Unrecoverable.)
FWIW all the other errors in this implementation look right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I was imagining we might see this error in the event that a task responsible for collecting panic messages (i.e., packrat) gets pinged about a panic, but doesn't actually get to run before a timer in jefe elapses and the task gets restarted. In that case, the task would no longer have a panic message to share, and packrat (or whomever wants the panic message) has just missed its chance to read the panic. I would definitely not want to panic packrat in that case, so I felt like this error ought to be handle-able without panicking.
If we don't end up implementing that behavior in the supervisor, and instead make it wait to be informed that a panic message has been collected before restarting the faulted task, it might make more sense to make this case a fault for the caller. But, the approach we discussed in #2309 (comment) made me feel inclined to go down the path of treating the restart cooldown in jefe as a timeout for collecting panic messages, so I was planning to do that (so that packrat or similar can't block other tasks from restarting indefinitely). And, even if we did decide to make jefe wait for panic messages to be recorded before restarting a task, making this a handleable error allows more flexibility for other (theoretical) supervisor implementations to do other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I was imagining we might see this error in the event that a task responsible for collecting panic messages (i.e.,
packrat) gets pinged about a panic, but doesn't actually get to run before a timer injefeelapses and the task gets restarted. In that case, the task would no longer have a panic message to share, andpackrat(or whomever wants the panic message) has just missed its chance to read the panic. I would definitely not want to panic packrat in that case, so I felt like this error ought to be handle-able without panicking.
I find this convincing, I hadn't considered which task was likely to be calling this API.
sys/userlib/src/kipc.rs
Outdated
| /// | ||
| /// Note that Hubris only preserves the first [`PANIC_MESSAGE_MAX_LEN`] bytes of | ||
| /// a task's panic message, and panic messages greater than that length are | ||
| /// truncated. Thus, this function accepts a buffer of that length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to note that we could most likely loosen this later, if required, by changing this to a &mut [u8] and not breaking callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that was my thinking as well.
sys/userlib/src/kipc.rs
Outdated
| /// | ||
| /// - [`Ok`]`(&[u8])` if the task is panicked. The returned slice is borrowed | ||
| /// from `buf`, and contains the task's panic message as a sequence of | ||
| /// UTF-8 bytes. Note that the slice may be empty, if the task has panicked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, nothing ensures that the contents of that slice are UTF-8, so I wouldn't advertise that. The panic message is normally expected to be UTF-8 but it may be truncated in the middle of a multi-byte sequence, and the task is panicking so strictly speaking it could have stomped all over its RAM, producing garbage.
This is another case where seeing a caller might clarify the API design -- it might make sense to call utf8_chunks on the buffer and return the iterator that results so that the caller can easily step over valid and invalid sections if desired. Or, callers may just byte-copy the result into CBOR and not parse it at all (in which case it's quite important that it not hit CBOR as UTF-8!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. So, I was planning to probably just encode the valid regions as a CBOR string in the caller which I haven't written yet. But, I had a vague sense that the KIPC API ought not make too many decisions about what the caller wants; I had kind of been considering having a higher-level read_panic_message_utf8 or something that just gives you the valid UTF-8 regions. The sense I got from the other KIPC APIs is that they don't tend to do much beyond what's necessary to actually send and receive the IPC from the kernel, so I wanted to be consistent with that.
On the other hand, I suspect basically all callers are going to want to get a string out and not want anything that isn't UTF-8, so maybe this API should just do the higher-level behavior. I think you're starting to sell me on that.
test/test-suite/src/main.rs
Outdated
| .unwrap(); | ||
| // it should look kinda like a panic message (but since the line number may | ||
| // change, don't make assertions about the entire contents of the string... | ||
| assert!(core::str::from_utf8(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the expect here is right -- assuming the assistant panics with an ASCII-only message (which it likely does) you could still get this test to fail by running it in a deep path containing non-ASCII characters.
Since in the end you're only checking the prefix of the string, you can get the valid UTF-8 prefix using
msg.utf8_chunks().next().unwrap().valid()
(the unwrap() there is an unfortunate detail of the utf8_chunks implementation, which is defined as returning an iterator yielding at least one chunk... but the type system can't see that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a good point, thanks!
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
In order for Packrat to produce an ereport when another task faults (see #2341), we must have a way to notify it when such a fault has occurred. Furthermore, our scheme for reporting such faults requires that some data be recorded *before* the task has been restarted. In particular, if a task faults due to a panic, the panic message must be read out of its address space before it is restarted, via the `read_panic_message` KIPC I added in #2313. Similarly, the `read_task_status` KIPC can only provide details on the fault while the task remains in the `Faulted` state; when it is restarted, that data is cleared. Thus, we must delay the task's restart for some period of time so that Packrat can gather this information and stuff it into an ereport. This commit solves both of these problems. First, I've add a new config field to Jefe to allow providing a list of other tasks to which a notification should be posted when a fault occurs (similarly to the existing state-change notification). Second, I've changed the timeout behavior a bit so that, if Jefe is notifying other tasks on a fault, it will _always_ wait at least 5 ms before restarting a faulted task, even if the task has been running for longer than `MIN_RUN_TIME` since its last fault. This way, we always allow some time for a fault report to be produced if someone cares about the task having faulted. The 5 ms minimum delay is elided if there are no tasks to notify on faults.
Depends on #2313, #2350, #2358 Fixes #2309 It's currently somewhat difficult to become aware of Hubris task panics and other task faults in a production environment. While MGS can ask the SP to list task dumps as part of the API for reading dumps, this requires that the control plane (or faux-mgs user) proactively ask the SP whether it has any record of panicked tasks, rather than recording panics as they occur. Therefore, we should have a proactive notification from the SP indicating that task faults have occurred. This commit adds code to `packrat` for producing an ereport when a task has faulted. This could eventually be used by the control plane to trigger dump collection and produce a service bundle. In addition, it will provide a more permanent record that a task faulted at a particular time, even if the SP that contains the faulted task is later reset or replaced with an entirely different SP. This works using an approach similar to the one described by @cbiffle in [this comment][1]. There's a detailed description of how this works [in the module-level RustDoc for `ereport.rs` in Packrat][2]. The ereports that come out of this thing look like this: ```console eliza@hekate ~/Code/oxide/hubris $ faux-mgs --interface eno1np0 --discovery-addr '[fe80::0c1d:deff:fef0:d922]:11111' ereports Jan 15 10:27:41.370 INFO creating SP handle on interface eno1np0, component: faux-mgs Jan 15 10:27:41.372 INFO initial discovery complete, addr: [fe80::c1d:deff:fef0:d922%2]:11111, interface: eno1np0, socket: control-plane-agent, component: faux-mgs restart ID: 4e54b7f1-e13a-d9bb-709a-c7e863d64a64 restart IDs did not match (requested 00000000-0000-0000-0000-000000000000) count: 4 ereports: 0x1: { "ereport_message_version": Number(0), "hubris_task_gen": Number(0), "hubris_task_name": String("packrat"), "hubris_uptime_ms": Number(0), "lost": Null, } 0x2: { "ereport_message_version": Number(0), "hubris_task_gen": Number(0), "hubris_task_name": String("ereportulator"), "hubris_uptime_ms": Number(378010), "k": String("hubris.fault.panic"), "msg": String("panicked at task/ereportulator/src/main.rs:158:9:\nim dead lol"), "v": Number(0), } 0x3: { "by": Object { "gen": Number(0), "task": String("jefe"), }, "ereport_message_version": Number(0), "hubris_task_gen": Number(0), "hubris_task_name": String("user_leds"), "hubris_uptime_ms": Number(382914), "k": String("hubris.fault.injected"), "v": Number(0), } 0x4: { "by": Object { "gen": Number(0), "task": String("jefe"), }, "ereport_message_version": Number(0), "hubris_task_gen": Number(1), "hubris_task_name": String("ereportulator"), "hubris_uptime_ms": Number(388215), "k": String("hubris.fault.injected"), "v": Number(0), } ``` [1]: #2309 (comment) [2]: https://github.com/oxidecomputer/hubris/blob/1e2b121fcd165dcdbcbdfdc69c36d35520dfa954/task/packrat/src/ereport.rs#L15-L95
Currently, there is no way to programmatically access the panic message of a task which has faulted due to a Rust panic fron within the Hubris userspace. This branch adds a new
read_panic_messagekipc that copies the contents of a panicked task's panic message buffer into the caller. If the requested task has not panicked, this kipc returns an error indicating this. This is intended by use by supervisor implementations or other tasks which wish to report panic messages from userspace.I've also added a test case that exercises this functionality.
Fixes #2311