Bug Fix: Make batch production non-optional during recovery. Add sequencer proof processing tests#2605
Bug Fix: Make batch production non-optional during recovery. Add sequencer proof processing tests#2605preston-evans98 wants to merge 6 commits intodevfrom
Conversation
Expose zk proof progress through a shared manager status and a controlled prover/blueprint path so sequencer tests can hold and release aggregate proof publication deterministically. Refactor proof progress tracking into a helper so status updates stay coupled to state changes.
| // Update the next height to receive | ||
| self.stf_info_receiver | ||
| .inc_next_height_to_receive_by(num_proofs_to_create as u64); | ||
| self.proof_progress.finish_aggregate_proof_creation(); |
There was a problem hiding this comment.
begin_aggregate_proof_creation sets aggregate_proof_in_flight = true, and finish_aggregate_proof_creation clears it. However, if create_aggregate_proof_with_retries or publish_proof_blob_with_metadata fails in this code block, the flag is never reset.
There was a problem hiding this comment.
I think this is fine as the ZkProofManager just exits if this happens
| /// background to test node APIs. | ||
| #[derive(Clone)] | ||
| pub struct RollupBuilder<R: FullNodeBlueprint<Native>> { | ||
| blueprint: Arc<R>, |
There was a problem hiding this comment.
This will make RollupBuilder stateful, which means some state can persist across restarts.
builder.with_blueprint(..);
let rollup = builder.start()
let builder = rollup.shutdown() / rollup.restart()
rollup_after_restart = builder.star()
rollup_after_restart <- inherited r`eady_proof_count`
In this flow, rollup_after_restart may inherit state from the previous run. For example, in ManualProofPostingRtAgnosticBlueprint, we keep shared state such as ready_proof_count, available_releases, and is_open. That shared state would also be inherited by rollup_after_restart.
Would it be possible to add an Option<R> in start_test_rollup. So we would provide the bluerint only when TestRollup is created?
There was a problem hiding this comment.
Or we could jut add another method start_test_rollup_with_manual_proof_posting and call ManualProofPostingRtAgnosticBlueprint::<TestSpec, RT>::new_with_control() in it.
| .trigger_batch_production_if_convenient_msg( | ||
| "recover_and_catch_up:dump_catchup_batches", | ||
| ) | ||
| .trigger_batch_production_msg("recover_and_catch_up:dump_catchup_batches") |
There was a problem hiding this comment.
One of the checks that is skipped now is:
if !self.seq_config.automatic_batch_production {
warn!("Skipping batch production due to settings");
return;
}
This means that the sequencer will produce batches during recovery, although it is configured to never do it. This only matters in tests, but it can still be confusing. Let's log an error if we hit recovery and automatic_batch_production = true?
There was a problem hiding this comment.
Another option: we could panic when this happens, and cfg!(debug_assertions) = true
Description
This PR rewrites and expands the sequencer proof processing test suite to include resync and recovery, as well as handling of proofs that arrive while an open batch is in progress. This should fix the flaky test as well as improving coverage.
The new tests uncovered a pre-existing bug in recovery mode. The sequencer used
produce_batch_if_convenient, but assumed that the requested batches were always produced. Trigger a failure was easier thanks to proofs (the blob sender is much more likely to be busy) which allowed the new tests to detect the issue. This PR fixes the issue by removing the "is_convenient" checks from batch production during recovery mode only.CHANGELOG.mdwith a new entry if my PR makes any breaking changes or fixes a bug. If my PR removes a feature or changes its behavior, I provide help for users on how to migrate to the new behavior.Cargo.tomlchanges before opening the PRs. (Are all new dependencies necessary? Is any module dependency leaked into the full-node (hint: it shouldn't)?)Linked Issues