PVF: Consider re-preparing artifact on failed runtime construction

# Overview

The inability to execute a properly prepared artifact (and raising a dispute over a valid candidate as a result) has a year-long story that started after the Polkadot incident in March 2023 (paritytech/polkadot#6862). Efforts to mitigate it started with node version check (paritytech/polkadot#6861), evolved over time into more checks and safety measures (#1918, #2895, to name a few), and are still ongoing (#2742, #661).

Overviewing the broader picture of the problem, it increasingly looks like plugging holes in a spaghetti strainer. In this issue, I try to summarize many discussions carried out on that topic and propose a possible solution, for which I can't take credit; it was born by the hivemind somewhere in discussions.

# The problem

[`execute_artifact`](https://github.com/paritytech/polkadot-sdk/blob/fafe2722a1a015cf59ef7d09c13836a88ddbc621/polkadot/node/core/pvf/common/src/executor_interface.rs#L108) is the point where actual parachain runtime construction, instantiation and execution takes place. It is an `unsafe` function that comes with the following security contract:

> The caller must ensure that the compiled artifact passed here was:
> 1) produced by `prepare`,
> 2) was not modified,

There are two problems here: first, this contract is not always held (and, more broadly, it cannot be held every single time in the real world), and second, it is somewhat incomplete.

Let's consider some known incidents to understand what are the obstacles to upholding that contract.

## Dirty node upgrade

The node now consists of three binaries: the node itself, the preparation worker, and the execution worker. That leaves a lot of room for the node operator in the sense of how he can screw up the node upgrade:
* Upgrade all the binaries but forget to restart the node;
* Upgrade only the node binary, leaving workers from the previous version;
* Upgrade only one worker and forget to upgrade the other one;
* All the possible combinations of the aforementioned scenarios.

Some day, we found a good way of handling that: to let node version and worker version cross-check. "Version" was meant to be the commit hash, but that resulted in an awful developer experience because every change to the code required manually rebuilding workers. So, that was relaxed, and now we check the node's semantic version. That is still problematic: versions are not bumped every hour, and upgrading from the latest stable version to `master` can still lead to undetected version mismatch.

## Node hardware downgrade

Sometimes, node operators decide to save a bit of money by transferring their VMs from an expensive VPS plan, where the VMs were running on Intel Xeon CPUs, to a not-so-expensive plan with consumer-grade Intel Core CPUs. Pre-compiled artifacts that survived the transfer couldn't be executed on the Core hardware as they were compiled for Xeon hardware and used its extended set of features.

That was "fixed" in #2895, but that's a regressive approach. We wanted artifact persistence for optimization, and we still want it.

## Wasmtime version change

This is closely related to the "dirty node upgrade" scenario. Wasmtime versions are not backward-compatible in the general case. A later Wasmtime version may refuse to execute an artifact produced by a former version. That is the second reason for removing artifact persistence, and that's also why #2398 was raised.

# The solution

The aforementioned `execute_artifact` must distinguish between error types, returning a concrete error to the caller so the caller can know if the problem is with runtime construction, instantiation, or execution itself.

Given that the runtime construction is checked during the PVF pre-checking process, it shouldn't fail during the actual execution, so the runtime construction error means the artifact is either corrupted or there's some kind of artifact version mismatch. In that case, the caller must discard the artifact, remove it from the filesystem, and re-submit the PVF in question into the preparation queue.

After the PVF is re-prepared, execution must be retried.

If the runtime construction fails for the second time in a row, that means that some external conditions have changed, and the PVF cannot be executed anymore. That is a deterministic error, and raising a dispute for the candidate is okay.

# Outcomes

* We wouldn't need #2398 and #2742 anymore. In case of Wasmtime version mismatch, we would just re-prepare the artifact (which we need to do in that case anyway) and execute successfully during the retry.
* We could enable artifact persistence again, even without extensive checks for node upgrade events. If the node has upgraded, or the hardware has been downgraded, okay, whatever, that's no nevermind of ours, everything just gets re-prepared and executed.
* In the "dirty node upgrade" scenario, as far as I can imagine in my head, only one concern remains: one worker is upgraded, and another is not. All the other scenarios are covered by re-preparation. And this is only the concern in the "latest-to-master" upgrade scenario, as upgrades from version to version are still guarded by the version-checking mechanism.

CC @eskimor @koute @bkchr 

**This issue is open for external contributions**

CC @maksimryndin 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PVF: Consider re-preparing artifact on failed runtime construction #3139

Overview

The problem

Dirty node upgrade

Node hardware downgrade

Wasmtime version change

The solution

Outcomes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PVF: Consider re-preparing artifact on failed runtime construction #3139

Description

Overview

The problem

Dirty node upgrade

Node hardware downgrade

Wasmtime version change

The solution

Outcomes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions