asim: reproduce loss of quorum during zone cfg change with suspect nodes

tbg · tbg · commit 582e5313e506 · 2025-10-31T10:23:58.000+01:00
I believe this newly added datadriven test reproduces #152604. The test sets up five nodes with 5x replication, marks n4 and n5 as non-live, and drops the repication factor to 3. We see that the allocator merrily removes replicas from n1-n3 and loses quorum in the process. If it removed any replicas in this scenario, it really ought to be removing from n4 and n5. > next replica action: remove voter > removing voting replica n2,s2 due to over-replication: [1*:2, 2:2, 3:2, 4:2, 5:2] Then: > unable to take action - live voters [(n1,s1):1 (n3,s3):3] don't meet quorum of 3 Informs #152604. Epic: none
diff --git a/pkg/kv/kvserver/asim/tests/testdata/non_rand/sma/downreplicate_recently_down_node_issue152604.txt b/pkg/kv/kvserver/asim/tests/testdata/non_rand/sma/downreplicate_recently_down_node_issue152604.txt
@@ -0,0 +1,48 @@
+# This test reproduces the issue observed in #152604, where
+# a zone config change prefers removing live replicas despite
+# the presence of recently down replicas. This leads to loss
+# of quorum.
+#
+# See:
+# - https://github.com/cockroachdb/cockroach/issues/152604
+# - https://github.com/cockroachdb/cockroach/issues/155734
+
+gen_cluster nodes=5
+----
+
+# Place ranges, replicated across all five nodes.
+gen_ranges ranges=100 repl_factor=5 min_key=1 max_key=10000
+----
+
+# Mark n4 and n5 as NodeLivenessStatus_UNAVAILABLE, which is the status
+# stores have when down but not down for long enough to be marked as dead.
+# The range doesn't lose quorum as a result of this, since three replicas
+# are still around.
+set_liveness node=4 liveness=unavailable
+----
+
+set_liveness node=5 liveness=unavailable
+----
+
+# Trigger down-replication to three replicas.
+
+set_span_config
+[0,10000): num_replicas=3 num_voters=3
+----
+
+# Note how s4 and s5 retain their replicas, while replicas are being
+# remved from live nodes s1-s3. This leads to a loss of quorum that
+# isn't immediately obvious since this is an asim test, but the logs
+# show that the allocator itself realizes (when trying to make the
+# next change), but that is too late.
+#
+# In the real world, as of #156464, these dangerous replication changes
+# would be blocked, but it is far from ideal that they are attempted
+# in the first place.
+eval duration=10m cfgs=(sma-count) metrics=(replicas)
+----
+replicas#1: first: [s1=101, s2=101, s3=101, s4=101, s5=101] (stddev=0.00, mean=101.00, sum=505)
+replicas#1: last:  [s1=68, s2=67, s3=68, s4=101, s5=101] (stddev=16.33, mean=81.00, sum=405)
+replicas#1: thrash_pct: [s1=0%, s2=0%, s3=0%, s4=0%, s5=0%]  (sum=0%)
+artifacts[sma-count]: ff4c6613afd4b749
+==========================