Skip to content

IT: Fix and reenable flaky tests#1284

Open
kaikulimu wants to merge 6 commits intobloomberg:mainfrom
kaikulimu:reenable-its
Open

IT: Fix and reenable flaky tests#1284
kaikulimu wants to merge 6 commits intobloomberg:mainfrom
kaikulimu:reenable-its

Conversation

@kaikulimu
Copy link
Copy Markdown
Collaborator

@kaikulimu kaikulimu commented Apr 9, 2026

Summary

  • Re-enabled three previously disabled flaky integration tests: test_open_queue_while_cluster_blips_quorum,
    test_open_queue_while_cluster_blips_quorum_and_kill_all_non_leader, and test_force_leader_primary_divergence.
  • Fixed test_force_leader_primary_divergence by waiting for the restarted node to be ready before killing tproxies to trigger the divergence.
  • Split the blips-quorum test into two variants: one where the original leader wins re-election (quorum pinned via config), and one where leadership changes (strong consistency only), each with targeted recovery logic.
  • Fixed flakiness caused by cascading leader-primary divergence: when the old primary aborts after a leadership change, the new leader can lose quorum (voter count drops below threshold), triggering yet another election and a second divergence crash — leaving too few nodes for quorum.
  • In mqbc::ClusterUtil, removed explit load uncommitted advs to merged state.

Future Work

  • Discovered that leader-primary divergence can cause a cascading leader switch. When the old leader shuts down, it could cause the new leader to lose quorum, thus a second re-election can occur. If we are unlucky, the second leader can also abort due to leader-primary divergence. Worth discussing with @chrisbeard, @dorjesinpo, and @678098 for a proper solution.

@kaikulimu kaikulimu requested a review from a team as a code owner April 9, 2026 16:45
@kaikulimu kaikulimu force-pushed the reenable-its branch 3 times, most recently from 740e3c5 to bcd11d1 Compare April 16, 2026 21:27
@kaikulimu kaikulimu force-pushed the reenable-its branch 4 times, most recently from 1290181 to b2f5507 Compare April 17, 2026 23:38
@kaikulimu kaikulimu changed the title IT: reenable previously flaky tests (WIP) IT: reenable previously flaky tests Apr 17, 2026
@kaikulimu kaikulimu force-pushed the reenable-its branch 4 times, most recently from 1b99541 to e8e04fb Compare April 20, 2026 18:44
@kaikulimu kaikulimu changed the title (WIP) IT: reenable previously flaky tests IT: reenable previously flaky tests Apr 20, 2026
@kaikulimu kaikulimu force-pushed the reenable-its branch 2 times, most recently from 0346101 to 350fb8f Compare April 20, 2026 18:47
@kaikulimu kaikulimu changed the title IT: reenable previously flaky tests IT: Fix and reenable previously flaky tests Apr 20, 2026
@kaikulimu kaikulimu changed the title IT: Fix and reenable previously flaky tests IT: Fix and reenable flaky tests Apr 20, 2026
Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
@kaikulimu kaikulimu force-pushed the reenable-its branch 2 times, most recently from 8044751 to 877ff6f Compare April 21, 2026 19:26
Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
@kaikulimu kaikulimu requested a review from 678098 April 23, 2026 18:52
@kaikulimu kaikulimu self-assigned this Apr 23, 2026
@kaikulimu kaikulimu assigned 678098 and unassigned kaikulimu Apr 23, 2026
data["myClusters"][0]["elector"]["quorum"] = quorum
f.seek(0)
json.dump(data, f, indent=4)
f.truncate()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be done by using Configurator
For example, this is how domains configuration is updated in ITs:

    domains = {domain.name: domain for domain in cluster.configurator.domains}
    domains[
        domain_fanout
    ].definition.parameters.mode.fanout.publish_app_id_metrics = True
    cluster.deploy_domains()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deploy_domains is a poor name. Will split the function into deploy_clusters and deploy_domains.

@678098 678098 assigned kaikulimu and unassigned 678098 Apr 23, 2026
Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants