test(shared-log): instrument and stabilize part-4 parallel flakes#610
Closed
Faolain wants to merge 2 commits intodao-xyz:masterfrom
Closed
test(shared-log): instrument and stabilize part-4 parallel flakes#610Faolain wants to merge 2 commits intodao-xyz:masterfrom
Faolain wants to merge 2 commits intodao-xyz:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR expands shared-log flake diagnostics and stabilizes CI part-4-equivalent tests that were failing under repeated parallel runs.
The work was driven by repeated execution of:
node ./node_modules/aegir/src/index.js run test --roots ./packages/programs/data/shared-log ./packages/programs/data/shared-log/proxy -- -t node --no-buildand 3-worker fail-fast parallel stress runs with diagnostics env flags.
What Changed
1) Cross-suite diagnostics plumbing
test/utils.ts:waitForConvergedsupportsjitterand emits timeout snapshots.checkReplicasemits detailed per-node invariant diagnostics.PEERBIT_TRACE_*,PEERBIT_TRACE_ALL_TEST_FAILURES) across flaky suites:replication.spec.tssharding.spec.tsstatistics.spec.tsobserver.spec.tsload.spec.tsdelivery.spec.tsafterEachfailure diagnostics in suites where flakes were observed.2) Stabilizations for observed flakes
replication.spec.tsdoes not lose entries when ranges rotate with delayed replication updates (prune delay 0)sharding.spec.tsdistribution > objectives > memory > inserting half limiteddelivery.spec.tsdoes not fall back to rpc on target=all when a fanout member dropsdb1.add(..., { target: "all" }).3) Additional robustness/observability updates
observer.spec.ts: longer wait budget + membership/readiness diagnostics.load.spec.ts: longer convergence/prune windows and failure diagnostics.statistics.spec.ts: larger convergence windows + diagnostics for partial approximations.Failures Captured and Root-Cause Direction
During parallel stress, we repeatedly captured and instrumented failures including:
The recurring pattern was assertion brittleness under high concurrency/timer/GC pressure, not a single deterministic protocol break. This PR keeps behavioral intent while removing assumptions that were stricter than the runtime contract.
Validation Performed
Full suite runs (part-4 equivalent)
Targeted stress
Files Changed
packages/programs/data/shared-log/test/utils.tspackages/programs/data/shared-log/test/replication.spec.tspackages/programs/data/shared-log/test/sharding.spec.tspackages/programs/data/shared-log/test/statistics.spec.tspackages/programs/data/shared-log/test/observer.spec.tspackages/programs/data/shared-log/test/load.spec.tspackages/programs/data/shared-log/test/delivery.spec.ts