Conversation
34a44e4 to
cec41dd
Compare
Is this the main and only usecase for the thread local panic logic of this PR? I'm trying to better understand whether the refactor is worth it at the moment. I thought we printed the error in the VM panic handler. Therefore:
|
1fc8711 to
5b52e96
Compare
56de046 to
ac0550c
Compare
ac0550c to
6fc9ebb
Compare
|
@vicsn I extended the PR description |
| /// | ||
| /// It is more efficient to set the panic hook once and directly use `errors::try_vm_runtime`. | ||
| #[macro_export] | ||
| macro_rules! try_vm_runtime { |
There was a problem hiding this comment.
Don't all the finalize instances still call the macro try_vm_runtime! instead of errors::try_vm_runtime()?
This would call set_panic_hook multiple times rather than the one time you claim "Should be called exactly once."
There was a problem hiding this comment.
I clarified the documentation that it is safe to call set_panic_hook multiple times. It will just not have any effect after the first invocation. The most recent commit also adds a boolean flag that is set once the handler is installed, to make sure there is no performance impact of calling set_panic_hook multiple times.
It would be great if we could replace that macro eventually, but that would require users of snarkVM to call set_panic_hook themselves.
80c81d8 to
71e328b
Compare
71e328b to
f104f5a
Compare
|
Logic looks reasonable, would be nice to see more unit tests to ensure the top level (non- |
This PR is a follow-up on #2813.
Context: Handling Fatal Errors in snarkVM
Generally, snarkVM and snarkOS aim to never panic if it is possible to avoid. However, there is always the possibility of bugs that were overlooked or ending up in a state where the code cannot recover safely.
If the code panics, there are two different scenarios:
try_vm_runtime) we can safely log it and continue operating normally. snarkVM will simply treat this as a failed execution.We had several instances where a thread or tokio task panicked, and, while information about the panic was printed to
stdout, there was no mention of it in the logs.While the former is already implemented. The latter was partially introduced by #2813, but this PR adds some important improvement and fixes. The behavior of
try_vm_runtimeremains unchanged by this, albeit its implementation changes slightly.Improved Panic Handling Mechanisms
Panic handling as implemented in #2813 (and also before but to a limited extent) has a major shortcoming: it did not work well with multi-threading. That is because the panic handler is set per-process but, while executing a snarkVM program another thread, unrelated to execution, might panic, and snarkVM may falsely log it as a VM error.
The proposed mechanism in this PR is to store backtraces and error messages in a thread-local variable.
try_vm_runtimeis changed to retrieve the error message from this thread-local variable.Additionally, there is now a
catch_unwindfunction that wraps around the function of the same name from the standard library and returns the error message and backtrace on error. The latter is useful when printing error messages about panics, for example, in snarkOS.Utilities for Task Management
tokioallows propagating panics throughJoinHandles. This PR also provides wrappers around tokio'sspawnandspawn_blockingfunctions that contain the boilerplate for propagating such panics.This replaces logic that before resided in
snarkos-node-bftbut will be useful to other crates.Usage
In snarkVM
snarkVM has one notable instance of spawning a background task, the sequential ops thread (see the changes in #2975). 80c81d8 adds
catch_unwindto that thread to ensure future panics in the storage thread are logged properly.In snarkOS
For snarkOS, the plan is to use
catch_unwindfor important background tasks such as block synchronization. Again, this is to ensure that a panic in block synchronization does not happen silently and block synchronization stops for no apparent reason.ProvableHQ/snarkOS#3874 contains the full set of proposed changes for snarkOS.
Notes
JoinHandles) does not work well (yet). Future improvements to tokio will change this, but we also do not use detached tasks much.