Skip to content

Commit bc1faf2

Browse files
committed
storage_api: fix TestDecommissionSelf flake
TestDecommissionSelf was flaking with timeouts waiting for decommissioned nodes to observe their own DECOMMISSIONED status. The test would fail when node 4 (one of the decommissioned nodes) never saw its liveness record update to DECOMMISSIONED within the 5-second timeout. The likely root cause is a race condition in how decommission status propagates: when a node is marked as decommissioned, the updated liveness record is gossiped to all nodes. However, if other nodes receive this gossip update before the decommissioned node does, they will write a tombstone to local storage and subsequently reject all RPCs from the decommissioned node, including gossip messages. This can prevent the decommissioned node from ever learning about its own status change. This commit fixes the test by only verifying the cluster state from the perspective of non-decommissioned nodes. We now assert that: 1. Active nodes see themselves as ACTIVE 2. Active nodes see the decommissioned nodes as DECOMMISSIONED We no longer attempt to verify that decommissioned nodes observe their own status, since this is not guaranteed due to the gossip/tombstone race. Fixes cockroachdb#156402. Fixes cockroachdb#156104. Fixes cockroachdb#154474. Release note: None Epic: None
1 parent e00f221 commit bc1faf2

File tree

1 file changed

+27
-8
lines changed

1 file changed

+27
-8
lines changed

pkg/server/storage_api/decommission_test.go

Lines changed: 27 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -702,20 +702,39 @@ func TestDecommissionSelf(t *testing.T) {
702702
require.NoError(t, err)
703703
require.Empty(t, resp.Status)
704704

705-
// The nodes should now have been (or soon become) decommissioned.
705+
// The nodes should now have been (or soon become) decommissioned. In a cruel
706+
// twist of fate, the decommissioned nodes may not find out about that,
707+
// however. This is because the other nodes may learn that (say) n4 is
708+
// decommissioned before n4 does, and will block all communication with it,
709+
// which includes receiving an updated liveness record. So we only verify that
710+
// the non-decommissioned nodes see the decommissioned nodes as such, but
711+
// don't verify the decommissioned nodes' own view of their (or anyone's, really)
712+
// liveness.
706713
for i := 0; i < tc.NumServers(); i++ {
707714
srv := tc.Server(i)
708-
expect := livenesspb.MembershipStatus_ACTIVE
715+
var omit bool
709716
for _, nodeID := range decomNodeIDs {
710717
if srv.NodeID() == nodeID {
711-
expect = livenesspb.MembershipStatus_DECOMMISSIONED
712-
break
718+
omit = true
713719
}
714720
}
715-
require.Eventually(t, func() bool {
716-
liveness, ok := srv.NodeLiveness().(*liveness.NodeLiveness).GetLiveness(srv.NodeID())
717-
return ok && liveness.Membership == expect
718-
}, 5*time.Second, 100*time.Millisecond, "timed out waiting for node %v status %v", i, expect)
721+
if omit {
722+
continue
723+
}
724+
nl := srv.NodeLiveness().(*liveness.NodeLiveness)
725+
testutils.SucceedsSoon(t, func() error {
726+
entry, ok := nl.GetLiveness(srv.NodeID())
727+
if !ok || entry.Membership != livenesspb.MembershipStatus_ACTIVE {
728+
return errors.Errorf("n%d not ACTIVE: %v", srv.NodeID(), entry.Membership)
729+
}
730+
for _, nodeID := range decomNodeIDs {
731+
entry, ok := nl.GetLiveness(nodeID)
732+
if !ok || entry.Membership != livenesspb.MembershipStatus_DECOMMISSIONED {
733+
return errors.Errorf("n%d not DECOMMISSIONED: %v", srv.NodeID(), entry.Membership)
734+
}
735+
}
736+
return nil
737+
})
719738
}
720739
}
721740

0 commit comments

Comments
 (0)