@@ -76,90 +76,102 @@ the progression of the read position to compute the expected time to complete.
7676Avoiding recovery roadblocks
7777----------------------------
7878
79- When trying to urgently restore your file system during an outage, here are some
80- things to do:
79+ Do the following when restoring your file system:
8180
82- * **Deny all reconnect to clients. ** This effectively blocklists all existing
83- CephFS sessions so all mounts will hang or become unavailable.
81+ * **Deny all reconnection to clients. ** Blocklist all existing CephFS sessions,
82+ causing all mounts to hang or become unavailable:
8483
85- .. code :: bash
84+ .. prompt :: bash #
8685
8786 ceph config set mds mds_deny_all_reconnect true
8887
8988 Remember to undo this after the MDS becomes active.
9089
91- .. note :: This does not prevent new sessions from connecting. For that, see the ``refuse_client_session`` file system setting.
90+ .. note :: This does not prevent new sessions from connecting. Use the
91+ ``refuse_client_session `` file-system setting to prevent new sessions from
92+ connecting to the CephFS.
9293
93- * **Extend the MDS heartbeat grace period **. This avoids replacing an MDS that appears
94- "stuck" doing some operation. Sometimes recovery of an MDS may involve an
95- operation that may take longer than expected (from the programmer's
96- perspective). This is more likely when recovery is already taking a longer than
97- normal amount of time to complete (indicated by your reading this document).
98- Avoid unnecessary replacement loops by extending the heartbeat graceperiod:
94+ * **Extend the MDS heartbeat grace period. ** This avoids replacing an MDS that
95+ appears "stuck" during some operation. Sometimes recovery of an MDS may
96+ involve an operation that takes longer than expected (from the programmer's
97+ perspective). This is more likely when recovery is already taking longer than
98+ normal to complete (indicated by your reading this document). Avoid
99+ unnecessary replacement loops by running the following command and extending
100+ the heartbeat grace period:
99101
100- .. code :: bash
102+ .. prompt :: bash #
101103
102- ceph config set mds mds_heartbeat_grace 3600
104+ ceph config set mds mds_heartbeat_grace 3600
103105
104- .. note :: This has the effect of having the MDS continue to send beacons to the monitors
105- even when its internal "heartbeat" mechanism has not been reset (beat) in one
106- hour. The previous mechanism for achieving this was via the
107- ` mds_beacon_grace ` monitor setting.
106+ .. note :: This causes the MDS to continue to send beacons to the monitors
107+ even when its internal "heartbeat" mechanism has not been reset (it has
108+ not beaten) in one hour. In the past, this was achieved with the
109+ `` mds_beacon_grace ` ` monitor setting.
108110
109- * **Disable open file table prefetch. ** Normally, the MDS will prefetch
110- directory contents during recovery to heat up its cache. During long
111- recovery, the cache is probably already hot **and large **. So this behavior
112- can be undesirable. Disable using:
111+ * **Disable open-file-table prefetch. ** Under normal circumstances, the MDS
112+ prefetches directory contents during recovery as a way of heating up its
113+ cache. During a long recovery, the cache is probably already hot **and
114+ large **. So this behavior is unnecessary and can be undesirable. Disable
115+ open-file-table prefetching by running the following command:
113116
114- .. code :: bash
117+ .. prompt :: bash #
115118
116119 ceph config set mds mds_oft_prefetch_dirfrags false
117120
118- * **Turn off clients. ** Clients reconnecting to the newly ``up:active `` MDS may
119- cause new load on the file system when it's just getting back on its feet.
120- There will likely be some general maintenance to do before workloads should be
121- resumed. For example, expediting journal trim may be advisable if the recovery
122- took a long time because replay was reading a overly large journal.
121+ * **Turn off clients. ** Clients that reconnect to the newly ``up:active `` MDS
122+ can create new load on the file system just as it is becoming operational.
123+ Maintenance is often necessary before allowing clients to connect to the file
124+ system and resuming a regular workload. For example, expediting the trimming
125+ of journals may be advisable if the recovery took a long time because replay
126+ was reading a very large journal.
123127
124- You can do this manually or use the new file system tunable:
128+ Client sessions can be refused manually, or by using the
129+ ``refuse_client_session `` tunable as in the following command:
125130
126- .. code :: bash
131+ .. prompt :: bash #
127132
128133 ceph fs set <fs_name> refuse_client_session true
129134
130- That prevents any clients from establishing new sessions with the MDS.
135+ This command has the effect of preventing clients from establishing new
136+ sessions with the MDS.
131137
132- * **Dont tweak max_mds ** Modifying the FS setting variable ``max_mds `` is
133- sometimes perceived as a good step during troubleshooting or recovery effort.
134- Instead, doing so might further destabilize the cluster. If ``max_mds `` must
135- be changed in such circumstances, run the command to change ``max_mds `` with
136- the confirmation flag (``--yes-i-really-mean-it ``)
138+ * **Do not tweak max_mds. ** Modifying the file system setting variable
139+ ``max_mds `` is sometimes thought to be good step during troubleshooting or
140+ recovery. But modifying ``max_mds `` might have the effect of further
141+ destabilizing the cluster. If ``max_mds `` must be changed in such
142+ circumstances, run the command to change ``max_mds `` with the confirmation
143+ flag (``--yes-i-really-mean-it ``).
137144
138145.. _pause-purge-threads :
139146
140- * **Turn off async purge threads ** The volumes plugin spawns threads for
141- asynchronously purging trashed/deleted subvolumes. To help troubleshooting or
142- recovery effort, these purge threads can be disabled using:
147+ * **Turn off async purge threads. ** The volumes plugin spawns threads that
148+ asynchronously purge trashed or deleted subvolumes. During troubleshooting or
149+ recovery, these purge threads can be disabled by running the following
150+ command:
143151
144- .. code :: bash
152+ .. prompt :: bash #
145153
146154 ceph config set mgr mgr/volumes/pause_purging true
147155
148- To resume purging run::
156+ To resume purging, run the following command:
157+
158+ .. prompt :: bash #
149159
150160 ceph config set mgr mgr/volumes/pause_purging false
151161
152162.. _pause-clone-threads :
153163
154- * **Turn off async cloner threads ** The volumes plugin spawns threads for
155- asynchronously cloning subvolume snapshots. To help troubleshooting or
156- recovery effort, these cloner threads can be disabled using :
164+ * **Turn off async cloner threads. ** The volumes plugin spawns threads that
165+ asynchronously clone subvolume snapshots. During troubleshooting or recovery,
166+ these cloner threads can be disabled by running the following command :
157167
158- .. code :: bash
168+ .. prompt :: bash #
159169
160170 ceph config set mgr mgr/volumes/pause_cloning true
161171
162- To resume cloning run::
172+ To resume cloning, run the following command:
173+
174+ .. prompt :: bash #
163175
164176 ceph config set mgr mgr/volumes/pause_cloning false
165177
0 commit comments