fix: deflake //rs/state_manager:state_manager_integration (#9173)

basvandijk · web-flow · commit d90d6afb8269 · 2026-03-04T14:10:45.000Z
## Root Cause All flaky failures across 8 recent CI runs share the same root cause: the `wait_for_checkpoint` helper in `rs/state_manager/tests/common/mod.rs` has a 100-second timeout that is too tight when many tests (~158) run concurrently on CI. Under parallel execution, the background hash-computation threads get starved for CPU/IO, causing `wait_for_checkpoint` to time out with: ``` Checkpoint @n didn't complete in 100s ``` This affects many different tests non-deterministically — some runs had 2 failures, others had up to 21 — because any test calling `wait_for_checkpoint` can be affected depending on system load. ## Fix Increase the timeout in `wait_for_checkpoint` from 100s to 300s. The Bazel test target already has `timeout = "long"` (900s), so 300s is well within the overall test timeout while providing much more headroom for slow CI environments. --- This PR was created following the steps in `.claude/skills/fix-flaky-tests/SKILL.md`.
diff --git a/rs/state_manager/tests/common/mod.rs b/rs/state_manager/tests/common/mod.rs
@@ -329,7 +329,7 @@ pub fn modify_encoded_stream_helper<F: FnOnce(StreamSlice) -> Stream>(
 pub fn wait_for_checkpoint(state_manager: &impl StateManager, h: Height) -> CryptoHashOfState {
     use std::time::{Duration, Instant};
 
-    let timeout = Duration::from_secs(100);
+    let timeout = Duration::from_secs(300);
     let started = Instant::now();
     while started.elapsed() < timeout {
         match state_manager.get_state_hash_at(h) {