Commit 2c7451b
authored
fix: deflake //rs/tests/consensus/tecdsa:tecdsa_key_rotation_test (#9177)
## Root Cause Analysis
In `await_pre_signature_stash_size`
(`rs/tests/consensus/tecdsa/utils/src/lib.rs`), line 1141 had:
```rust
assert_eq!(sizes.len(), subnet.nodes().count());
```
This runs inside a `retry_with_msg!` loop. When
`MetricsFetcher::fetch()` returns metrics, not all nodes may have
started reporting the `execution_pre_signature_stash_size` metric yet.
For example, only 1, 2, or 3 out of 4 nodes might be reporting. The
`assert_eq!` **panics** in this case, killing the test immediately
instead of allowing the retry loop to continue.
This caused 5 out of 7 flaky failures in the last week with errors like:
- `assertion left == right failed: left: 1, right: 4`
- `assertion left == right failed: left: 2, right: 4`
- `assertion left == right failed: left: 3, right: 4`
The remaining 2 failures were timeouts (600s), likely caused by the same
timing sensitivity.
## Fix
Replaced `assert_eq!` with `bail!` so the retry loop continues when not
all nodes have reported the metric yet, giving them time to catch up.
## Verification
All 3 runs passed with `--runs_per_test=3 --jobs=3` (avg 271.5s).
---
This PR was created following the steps in
`.claude/skills/fix-flaky-tests/SKILL.md`.1 parent 160725c commit 2c7451b
1 file changed
+7
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1138 | 1138 | | |
1139 | 1139 | | |
1140 | 1140 | | |
1141 | | - | |
| 1141 | + | |
| 1142 | + | |
| 1143 | + | |
| 1144 | + | |
| 1145 | + | |
| 1146 | + | |
| 1147 | + | |
1142 | 1148 | | |
1143 | 1149 | | |
1144 | 1150 | | |
| |||
0 commit comments