Skip to content

Commit 2f6e4f7

Browse files
authored
Merge pull request ceph#52327 from batrick/i61865
doc: add information on expediting MDS recovery Reviewed-by: Zac Dover <[email protected]>
2 parents 91af689 + 0a15144 commit 2f6e4f7

File tree

1 file changed

+127
-0
lines changed

1 file changed

+127
-0
lines changed

doc/cephfs/troubleshooting.rst

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,133 @@ We can get hints about what's going on by dumping the MDS cache ::
2121
If high logging levels are set on the MDS, that will almost certainly hold the
2222
information we need to diagnose and solve the issue.
2323

24+
Stuck during recovery
25+
=====================
26+
27+
Stuck in up:replay
28+
------------------
29+
30+
If your MDS is stuck in ``up:replay`` then it is likely that the journal is
31+
very long. Did you see ``MDS_HEALTH_TRIM`` cluster warnings saying the MDS is
32+
behind on trimming its journal? If the journal has grown very large, it can
33+
take hours to read the journal. There is no working around this but there
34+
are things you can do to speed things along:
35+
36+
Reduce MDS debugging to 0. Even at the default settings, the MDS logs some
37+
messages to memory for dumping if a fatal error is encountered. You can avoid
38+
this:
39+
40+
.. code:: bash
41+
42+
ceph config set mds debug_mds 0
43+
ceph config set mds debug_ms 0
44+
ceph config set mds debug_monc 0
45+
46+
Note if the MDS fails then there will be virtually no information to determine
47+
why. If you can calculate when ``up:replay`` will complete, you should restore
48+
these configs just prior to entering the next state:
49+
50+
.. code:: bash
51+
52+
ceph config rm mds debug_mds
53+
ceph config rm mds debug_ms
54+
ceph config rm mds debug_monc
55+
56+
Once you've got replay moving along faster, you can calculate when the MDS will
57+
complete. This is done by examining the journal replay status:
58+
59+
.. code:: bash
60+
61+
$ ceph tell mds.<fs_name>:0 status | jq .replay_status
62+
{
63+
"journal_read_pos": 4195244,
64+
"journal_write_pos": 4195244,
65+
"journal_expire_pos": 4194304,
66+
"num_events": 2,
67+
"num_segments": 2
68+
}
69+
70+
Replay completes when the ``journal_read_pos`` reaches the
71+
``journal_write_pos``. The write position will not change during replay. Track
72+
the progression of the read position to compute the expected time to complete.
73+
74+
75+
Avoiding recovery roadblocks
76+
----------------------------
77+
78+
When trying to urgently restore your file system during an outage, here are some
79+
things to do:
80+
81+
* **Deny all reconnect to clients.** This effectively blocklists all existing
82+
CephFS sessions so all mounts will hang or become unavailable.
83+
84+
.. code:: bash
85+
86+
ceph config set mds mds_deny_all_reconnect true
87+
88+
Remember to undo this after the MDS becomes active.
89+
90+
.. note:: This does not prevent new sessions from connecting. For that, see the ``refuse_client_session`` file system setting.
91+
92+
* **Extend the MDS heartbeat grace period**. This avoids replacing an MDS that appears
93+
"stuck" doing some operation. Sometimes recovery of an MDS may involve an
94+
operation that may take longer than expected (from the programmer's
95+
perspective). This is more likely when recovery is already taking a longer than
96+
normal amount of time to complete (indicated by your reading this document).
97+
Avoid unnecessary replacement loops by extending the heartbeat graceperiod:
98+
99+
.. code:: bash
100+
101+
ceph config set mds mds_heartbeat_reset_grace 3600
102+
103+
This has the effect of having the MDS continue to send beacons to the monitors
104+
even when its internal "heartbeat" mechanism has not been reset (beat) in one
105+
hour. Note the previous mechanism for achieving this was via the
106+
`mds_beacon_grace` monitor setting.
107+
108+
* **Disable open file table prefetch.** Normally, the MDS will prefetch
109+
directory contents during recovery to heat up its cache. During long
110+
recovery, the cache is probably already hot **and large**. So this behavior
111+
can be undesirable. Disable using:
112+
113+
.. code:: bash
114+
115+
ceph config set mds mds_oft_prefetch_dirfrags false
116+
117+
* **Turn off clients.** Clients reconnecting to the newly ``up:active`` MDS may
118+
cause new load on the file system when it's just getting back on its feet.
119+
There will likely be some general maintenance to do before workloads should be
120+
resumed. For example, expediting journal trim may be advisable if the recovery
121+
took a long time because replay was reading a overly large journal.
122+
123+
You can do this manually or use the new file system tunable:
124+
125+
.. code:: bash
126+
127+
ceph fs set <fs_name> refuse_client_session true
128+
129+
That prevents any clients from establishing new sessions with the MDS.
130+
131+
132+
133+
Expediting MDS journal trim
134+
===========================
135+
136+
If your MDS journal grew too large (maybe your MDS was stuck in up:replay for a
137+
long time!), you will want to have the MDS trim its journal more frequently.
138+
You will know the journal is too large because of ``MDS_HEALTH_TRIM`` warnings.
139+
140+
The main tunable available to do this is to modify the MDS tick interval. The
141+
"tick" interval drives several upkeep activities in the MDS. It is strongly
142+
recommended no significant file system load be present when modifying this tick
143+
interval. This setting only affects an MDS in ``up:active``. The MDS does not
144+
trim its journal during recovery.
145+
146+
.. code:: bash
147+
148+
ceph config set mds mds_tick_interval 2
149+
150+
24151
RADOS Health
25152
============
26153

0 commit comments

Comments
 (0)