Skip to content

Commit e87b2e0

Browse files
authored
Merge pull request ceph#51322 from zdover23/wip-doc-2023-05-03-rados-operations-stretch-mode-stretch-mode-issues
doc/rados: stretch-mode: stretch cluster issues Reviewed-by: Greg Farnum <[email protected]>
2 parents ecfeb18 + 6c1baff commit e87b2e0

File tree

1 file changed

+33
-20
lines changed

1 file changed

+33
-20
lines changed

doc/rados/operations/stretch-mode.rst

Lines changed: 33 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -48,26 +48,39 @@ place that will cause the cluster to re-replicate the data until the
4848

4949
Stretch Cluster Issues
5050
======================
51-
No matter what happens, Ceph will not compromise on data integrity
52-
and consistency. If there's a failure in your network or a loss of nodes and
53-
you can restore service, Ceph will return to normal functionality on its own.
54-
55-
But there are scenarios where you lose data availability despite having
56-
enough servers available to satisfy Ceph's consistency and sizing constraints, or
57-
where you may be surprised to not satisfy Ceph's constraints.
58-
The first important category of these failures resolve around inconsistent
59-
networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick
60-
them out of the acting PG sets despite the primary being unable to replicate data.
61-
If this happens, IO will not be permitted, because Ceph can't satisfy its durability
62-
guarantees.
63-
64-
The second important category of failures is when you think you have data replicated
65-
across data centers, but the constraints aren't sufficient to guarantee this.
66-
For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies
67-
and places a copy in each data center with a ``min_size`` of 2. The PG may go active with
68-
2 copies in site A and no copies in site B, which means that if you then lose site A you
69-
have lost data and Ceph can't operate on it. This situation is surprisingly difficult
70-
to avoid with standard CRUSH rules.
51+
52+
Ceph does not permit the compromise of data integrity and data consistency
53+
under any circumstances. When service is restored after a network failure or a
54+
loss of Ceph nodes, Ceph will restore itself to a state of normal functioning
55+
without operator intervention.
56+
57+
Ceph does not permit the compromise of data integrity or data consistency, but
58+
there are situations in which *data availability* is compromised. These
59+
situations can occur even though there are enough clusters available to satisfy
60+
Ceph's consistency and sizing constraints. In some situations, you might
61+
discover that your cluster does not satisfy those constraints.
62+
63+
The first category of these failures that we will discuss involves inconsistent
64+
networks -- if there is a netsplit (a disconnection between two servers that
65+
splits the network into two pieces), Ceph might be unable to mark OSDs ``down``
66+
and remove them from the acting PG sets. This failure to mark ODSs ``down``
67+
will occur, despite the fact that the primary PG is unable to replicate data (a
68+
situation that, under normal non-netsplit circumstances, would result in the
69+
marking of affected OSDs as ``down`` and their removal from the PG). If this
70+
happens, Ceph will be unable to satisfy its durability guarantees and
71+
consequently IO will not be permitted.
72+
73+
The second category of failures that we will discuss involves the situation in
74+
which the constraints are not sufficient to guarantee the replication of data
75+
across data centers, though it might seem that the data is correctly replicated
76+
across data centers. For example, in a scenario in which there are two data
77+
centers named Data Center A and Data Center B, and the CRUSH rule targets three
78+
replicas and places a replica in each data center with a ``min_size`` of ``2``,
79+
the PG might go active with two replicas in Data Center A and zero replicas in
80+
Data Center B. In a situation of this kind, the loss of Data Center A means
81+
that the data is lost and Ceph will not be able to operate on it. This
82+
situation is surprisingly difficult to avoid using only standard CRUSH rules.
83+
7184

7285
Stretch Mode
7386
============

0 commit comments

Comments
 (0)