-
Notifications
You must be signed in to change notification settings - Fork 121
Description
Prototype: #3453
Motivation: In the thread of optimistic precompiles (e.g. #3443), we could also do a form of branch prediction. For example, consider this memcpy block. In about 90% of the cases (see prototype above), it jumps back to itself. We could automatically build a precompile that runs the basic block twice. This would be more efficient than only having a precompile for a single execution of the block and running it twice (which has a cost of
- The 8 distinct registers accessed in this block only have to be accessed once. Assuming a registers access costs at least 6 main columns (4 to commit to the previous value, 2 to commit to the timestamp diff limbs), that saves
$6 \times 8 = 48$ (roughly 18%) columns. - The additions (updating source pointer, destination pointer, and bytes to copy) could at least in theory be only done once (I doubt that our current optimizations would optimize it away though...).
Basic algorithm: Given a list of basic blocks that we want to merge, the basic algorithm is very simple:
- Concatenate the basic block's assembly codes before passing it to the autoprecompiles builder.
- Because this will lead to the assembly code having dynamic jumps in the middle, pass additional constraint that hard-code the program counter in the next instruction. For example, this will turn conditional jumps into assertions.
Selection: Deciding which basic blocks to merge is complicated, as the search space is huge. I think as a first step, we can only consider pairs of basic blocks. Using an analysis as done in the prototype, we will have data on how often each pair is executed on the sample runs. Similar to how we now compute the APCs for each basic block, we could also create it for each pair of basic blocks (with nonzero frequency), and rank it together with the single-block APCs. For example, we could have the following scenario:
- After ranking all basic block APCs and all pairs-of-basic-blocks APCs by the number of trace cells saved, it could turn out that accelerating two runs of the
memcpyblock (and reverting to software when it is not applicable) is better than accelerating a single run. - After the two-block APC has been selected, the ranking of the single-block APC goes down, because 90% of its executions are already covered by the two-block APC.
- Depending on the total number of APCs, the single-block APC might also be accelerated or not.
Execution: The above could lead to the prover having several choices at runtime: Running software, running the single-block APC, or running the multi-block APC (which has a satisfiable witness iff. the block is actually run at least twice). At least for app proofs, it is always better to pick the larger APC if possible. Whether it is possible should be clear by looking at the PC list (mapping each cycle to a PC), which could be collected during the preflight execution.