Branch Prediction

**Prototype**: https://github.com/powdr-labs/powdr/pull/3453

**Motivation**: In the thread of *optimistic precompiles* (e.g. https://github.com/powdr-labs/powdr/pull/3443), we could also do a form of *branch prediction*. For example, consider [this `memcpy` block](https://georgwiese.github.io/autoprecompile-analyzer/?data=https%3A%2F%2Fgist.githubusercontent.com%2Fgeorgwiese%2Fe1beccbe79e48c7199757bb6f0e16e80%2Fraw%2F3bbde192322b9e66ea7f58d73fbc908fbab08d98%2Fapc_candidates_keccak.json&block=0x201ecc). In about 90% of the cases (see prototype above), it jumps back to itself. We could automatically build a precompile that runs the basic block twice. This would be more efficient than only having a precompile for a single execution of the block and running it twice (which has a cost of $132 \times 2 = 264$ trace cells) because:
- The 8 distinct registers accessed in this block only have to be accessed once. Assuming a registers access costs at least 6 main columns (4 to commit to the previous value, 2 to commit to the timestamp diff limbs), that saves $6 \times 8 = 48$ (roughly 18%) columns.
- The additions (updating source pointer, destination pointer, and bytes to copy) could at least in theory be only done once (I doubt that our current optimizations would optimize it away though...).

**Basic algorithm**: Given a list of basic blocks that we want to merge, the basic algorithm is very simple:
1. Concatenate the basic block's assembly codes before passing it to the autoprecompiles builder.
2. Because this will lead to the assembly code having dynamic jumps in the middle, pass additional constraint that hard-code the program counter in the next instruction. For example, this will turn conditional jumps into assertions.

**Selection**: Deciding *which* basic blocks to merge is complicated, as the search space is huge. I think as a first step, we can only consider *pairs* of basic blocks. Using an analysis as done in the prototype, we will have data on how often each pair is executed on the sample runs. Similar to how we now compute the APCs for each basic block, we could also create it for each *pair* of basic blocks (with nonzero frequency), and rank it together with the single-block APCs. For example, we could have the following scenario:
- After ranking all basic block APCs and all pairs-of-basic-blocks APCs by the number of trace cells saved, it could turn out that accelerating *two* runs of the `memcpy` block (and reverting to software when it is not applicable) is better than accelerating a *single* run.
- After the two-block APC has been selected, the ranking of the single-block APC goes down, because 90% of its executions are already covered by the two-block APC.
- Depending on the total number of APCs, the single-block APC might *also* be accelerated or not.

**Execution**: The above could lead to the prover having *several* choices at runtime: Running software, running the single-block APC, or running the multi-block APC (which has a satisfiable witness iff. the block is *actually* run at least twice). At least for app proofs, it is always better to pick the larger APC if possible. Whether it is possible should be clear by looking at the PC list (mapping each cycle to a PC), which could be collected during the preflight execution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Branch Prediction #3455

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Branch Prediction #3455

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions