Skip to content

fix: deflake //rs/rosetta-api/icrc1:icrc_multitoken_rosetta_system_tests/multitoken_system_tests#9187

Merged
basvandijk merged 1 commit intomasterfrom
ai/deflake-multitoken_system_tests-2026-03-04
Mar 4, 2026
Merged

fix: deflake //rs/rosetta-api/icrc1:icrc_multitoken_rosetta_system_tests/multitoken_system_tests#9187
basvandijk merged 1 commit intomasterfrom
ai/deflake-multitoken_system_tests-2026-03-04

Conversation

@basvandijk
Copy link
Collaborator

Root Cause

The multitoken Rosetta system tests run 21 tests with RUST_TEST_THREADS=4, meaning 4 tests execute in parallel. Each test spawns its own Rosetta server and PocketIC instance. Under resource contention, block synchronization and transaction confirmation take longer than the tight timeouts allowed.

Three failure modes were observed in the last week (7 flaky runs):

  1. test_continuous_block_sync (2/7 runs): wait_for_rosetta_block had MAX_ATTEMPTS=20 with 1-second sleeps, giving only ~20 seconds for Rosetta to sync. Under load, Rosetta couldn't sync in time (e.g., reached block 4 instead of expected block 6).

  2. test_construction_submit (2/7 runs): make_submit_and_wait_for_transaction used the default 60-second timeout (since the RosettaClient was created without an explicit timeout). Under load, the transaction search couldn't find the submitted transaction within 60 seconds.

  3. Overall test timeouts (4/7 runs): Slow sync cascading across sequential test steps caused the bazel test timeout to be hit.

Fix

  1. Increased MAX_ATTEMPTS in wait_for_rosetta_block from 20 to 60 (20s → 60s) to give Rosetta more time to sync blocks.

  2. Created the RosettaClient with an explicit 120-second timeout (via from_str_url_and_timeout) instead of from_str_url (which had no timeout, defaulting to 60s internally). This gives make_submit_and_wait_for_transaction sufficient time.

Verification

All 3 parallel test runs passed consistently with low variance:

Stats over 3 runs: max = 67.9s, min = 65.9s, avg = 66.8s, dev = 0.8s

This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

…sts/multitoken_system_tests

Increase timeouts to handle resource contention when 4 tests run in parallel:

1. Increase MAX_ATTEMPTS in wait_for_rosetta_block from 20 to 60 (20s -> 60s)
   to give Rosetta more time to sync blocks under load.

2. Create RosettaClient with an explicit 120s timeout (instead of None which
   defaults to 60s) so make_submit_and_wait_for_transaction has sufficient
   time to find submitted transactions.

Root cause: With RUST_TEST_THREADS=4, multiple Rosetta + PocketIC instances
compete for resources, causing block sync and transaction confirmation to
take longer than the previous tight timeouts allowed.
@github-actions github-actions bot added the fix label Mar 4, 2026
@basvandijk basvandijk marked this pull request as ready for review March 4, 2026 19:31
@basvandijk basvandijk requested a review from a team as a code owner March 4, 2026 19:31
@github-actions github-actions bot added the @defi label Mar 4, 2026
Copy link
Contributor

@mbjorkqvist mbjorkqvist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @basvandijk !

@basvandijk basvandijk added this pull request to the merge queue Mar 4, 2026
Merged via the queue into master with commit 142182f Mar 4, 2026
42 checks passed
@basvandijk basvandijk deleted the ai/deflake-multitoken_system_tests-2026-03-04 branch March 4, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants