Skip to content

Make vats handle syscall failures better (or at all) #333

@FUDCo

Description

@FUDCo

Requires: #281

In the course of implementing #487, it was discovered that we are not properly handling syscall failures. Syscalls originate from the vat (specifically the supervisor), and are handled by the kernel. By design, they should never fail, but because we are fallible, they could fail due to programmer error.

Because syscalls have to be synchronous at the callsite, we were already "lying" to the vat supervisor and synchronously returning an ok result for every syscall. However, the idea was that we would still handle any errors on the kernel-side. This will probably include "rolling back" the current crank, which will in turn involve some messages to the vat, perhaps merely telling the vat to shut down so the kernel can restart it from the previous crank.

Details

The liveslots syscall API (which vats use to request services from the kernel) is synchronous, whereas the vat<->kernel communications pathway is necessarily asynchronous; this is a significant mismatch. The current syscall implementation papers over this via the crude expedient of optimistically assuming that syscalls are always successful. Fortunately for us, this is generally a safe assumption -- there's nothing in the nature of these calls that normally even merits a response, the operations these calls perform are deterministic, and for the most part there's no reason for them ever to fail. This allows us to synchronously return a success result to the syscall invoker while we allow the actual result to arrive asynchronously at a later time. It begs the question, though, of what we should do if a syscall actually fails. While syscall failures should never be possible as the result of user code misbehavior, they unfortunately can happen as the result of unrelated system problems in the enclosing environment (or kernel bugs, the possibility of which can't be discounted completely but which we can at least in principle be free of). A syscall failure represents a failure of invariants that a vat relies on, so should be treated as a fatal condition. A fatal failure inside a vat should abort the entire computational step ("crank") that the vat is currently executing, causing the state of the vat to revert to what it was at the start of the crank (i.e., immediately prior to whichever message delivery initiated the vat's current line of computation. Because the entire crank is being aborted, it's not important that the abort happen synchronously, as long as it happens before the end. However, at the end of a crank it is possible to await the results of any outstanding syscalls without breaking the illusion presented to liveslots that they are synchronous. This needs to be implemented.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions