Skip to content

Commit 9f54ab3

Browse files
authored
Merge pull request ceph#64122 from kamoltat/wip-ksirivad-fix-msg-v2
msg/async/ProtocolV2: Speed up connection logic when peer restarts & Added Netsplit Grace Period
2 parents 70417d2 + 5ee0bbf commit 9f54ab3

29 files changed

+533
-91
lines changed

doc/rados/operations/health-checks.rst

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,12 @@ which are frequently updated. This warning only appears when
167167
the cluster is provisioned with at least three Ceph Monitors and are using the
168168
``connectivity`` election strategy.
169169

170+
To reduce false alarms from transient network issues, detected netsplits are
171+
not immediately reported as health warnings. Instead, they must persist for at
172+
least ``mon_netsplit_grace_period`` seconds (default: 9 seconds) before being
173+
reported. If the network partition resolves within this grace period, no health
174+
warning is emitted.
175+
170176
Network partitions are reported in two ways:
171177

172178
- As location-level netsplits (e.g., "Netsplit detected between dc1 and dc2") when
@@ -177,6 +183,18 @@ Network partitions are reported in two ways:
177183
The system prioritizes reporting at the highest topology level (``datacenter``, ``rack``, etc.)
178184
when possible, to better help operators identify infrastructure-level network issues.
179185

186+
To adjust the grace period threshold, run the following command:
187+
188+
.. prompt:: bash $
189+
190+
ceph config set mon mon_netsplit_grace_period <seconds>
191+
192+
To disable the grace period entirely (immediate reporting), set the value to 0:
193+
194+
.. prompt:: bash $
195+
196+
ceph config set mon mon_netsplit_grace_period 0
197+
180198
AUTH_INSECURE_GLOBAL_ID_RECLAIM
181199
_______________________________
182200

qa/config/rados.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ overrides:
1111
osd mclock profile: high_recovery_ops
1212
mon:
1313
mon scrub interval: 300
14+
debug mon: 30

qa/suites/netsplit/ceph.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ overrides:
3131
- \(PG_AVAILABILITY\)
3232
- \(SLOW_OPS\)
3333
- \[WRN\]
34+
- \(MON_NETSPLIT\)
3435
tasks:
3536
- install:
3637
- ceph:

qa/suites/rados/monthrash/ceph.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ overrides:
1313
mon osdmap full prune txsize: 2
1414
mon scrub inject crc mismatch: 0.01
1515
mon scrub inject missing keys: 0.05
16+
debug ms: 20
1617
# thrashing monitors may make mgr have trouble w/ its keepalive
1718
log-ignorelist:
1819
- ScrubResult

qa/suites/rados/monthrash/msgr-failures/few.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ overrides:
66
mon client directed command retry: 5
77
log-ignorelist:
88
- \(OSD_SLOW_PING_TIME
9+
- \(MON_NETSPLIT)

qa/suites/rados/monthrash/msgr-failures/mon-delay.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,3 +12,4 @@ overrides:
1212
debug monc: 10
1313
log-ignorelist:
1414
- \(OSD_SLOW_PING_TIME
15+
- \(MON_NETSPLIT)
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
overrides:
2+
ceph:
3+
conf:
4+
global:
5+
ms inject socket failures: 0
6+

qa/suites/rados/multimon/ceph.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
overrides:
2+
ceph:
3+
conf:
4+
mon:
5+
debug ms: 20
6+

qa/suites/rados/multimon/msgr-failures/few.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ overrides:
66
mon client directed command retry: 5
77
log-ignorelist:
88
- \(OSD_SLOW_PING_TIME
9+
- \(MON_NETSPLIT)

qa/suites/rados/multimon/msgr-failures/many.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@ overrides:
77
mon mgr beacon grace: 90
88
log-ignorelist:
99
- \(OSD_SLOW_PING_TIME
10+
- \(MON_NETSPLIT)

0 commit comments

Comments
 (0)