@@ -21,6 +21,133 @@ We can get hints about what's going on by dumping the MDS cache ::
2121If high logging levels are set on the MDS, that will almost certainly hold the
2222information we need to diagnose and solve the issue.
2323
24+ Stuck during recovery
25+ =====================
26+
27+ Stuck in up:replay
28+ ------------------
29+
30+ If your MDS is stuck in ``up:replay `` then it is likely that the journal is
31+ very long. Did you see ``MDS_HEALTH_TRIM `` cluster warnings saying the MDS is
32+ behind on trimming its journal? If the journal has grown very large, it can
33+ take hours to read the journal. There is no working around this but there
34+ are things you can do to speed things along:
35+
36+ Reduce MDS debugging to 0. Even at the default settings, the MDS logs some
37+ messages to memory for dumping if a fatal error is encountered. You can avoid
38+ this:
39+
40+ .. code :: bash
41+
42+ ceph config set mds debug_mds 0
43+ ceph config set mds debug_ms 0
44+ ceph config set mds debug_monc 0
45+
46+ Note if the MDS fails then there will be virtually no information to determine
47+ why. If you can calculate when ``up:replay `` will complete, you should restore
48+ these configs just prior to entering the next state:
49+
50+ .. code :: bash
51+
52+ ceph config rm mds debug_mds
53+ ceph config rm mds debug_ms
54+ ceph config rm mds debug_monc
55+
56+ Once you've got replay moving along faster, you can calculate when the MDS will
57+ complete. This is done by examining the journal replay status:
58+
59+ .. code :: bash
60+
61+ $ ceph tell mds.< fs_name> :0 status | jq .replay_status
62+ {
63+ " journal_read_pos" : 4195244,
64+ " journal_write_pos" : 4195244,
65+ " journal_expire_pos" : 4194304,
66+ " num_events" : 2,
67+ " num_segments" : 2
68+ }
69+
70+ Replay completes when the ``journal_read_pos `` reaches the
71+ ``journal_write_pos ``. The write position will not change during replay. Track
72+ the progression of the read position to compute the expected time to complete.
73+
74+
75+ Avoiding recovery roadblocks
76+ ----------------------------
77+
78+ When trying to urgently restore your file system during an outage, here are some
79+ things to do:
80+
81+ * **Deny all reconnect to clients. ** This effectively blocklists all existing
82+ CephFS sessions so all mounts will hang or become unavailable.
83+
84+ .. code :: bash
85+
86+ ceph config set mds mds_deny_all_reconnect true
87+
88+ Remember to undo this after the MDS becomes active.
89+
90+ .. note :: This does not prevent new sessions from connecting. For that, see the ``refuse_client_session`` file system setting.
91+
92+ * **Extend the MDS heartbeat grace period **. This avoids replacing an MDS that appears
93+ "stuck" doing some operation. Sometimes recovery of an MDS may involve an
94+ operation that may take longer than expected (from the programmer's
95+ perspective). This is more likely when recovery is already taking a longer than
96+ normal amount of time to complete (indicated by your reading this document).
97+ Avoid unnecessary replacement loops by extending the heartbeat graceperiod:
98+
99+ .. code :: bash
100+
101+ ceph config set mds mds_heartbeat_reset_grace 3600
102+
103+ This has the effect of having the MDS continue to send beacons to the monitors
104+ even when its internal " heartbeat" mechanism has not been reset (beat) in one
105+ hour. Note the previous mechanism for achieving this was via the
106+ ` mds_beacon_grace` monitor setting.
107+
108+ * **Disable open file table prefetch. ** Normally, the MDS will prefetch
109+ directory contents during recovery to heat up its cache. During long
110+ recovery, the cache is probably already hot **and large **. So this behavior
111+ can be undesirable. Disable using:
112+
113+ .. code :: bash
114+
115+ ceph config set mds mds_oft_prefetch_dirfrags false
116+
117+ * **Turn off clients. ** Clients reconnecting to the newly ``up:active `` MDS may
118+ cause new load on the file system when it's just getting back on its feet.
119+ There will likely be some general maintenance to do before workloads should be
120+ resumed. For example, expediting journal trim may be advisable if the recovery
121+ took a long time because replay was reading a overly large journal.
122+
123+ You can do this manually or use the new file system tunable:
124+
125+ .. code :: bash
126+
127+ ceph fs set < fs_name> refuse_client_session true
128+
129+ That prevents any clients from establishing new sessions with the MDS.
130+
131+
132+
133+ Expediting MDS journal trim
134+ ===========================
135+
136+ If your MDS journal grew too large (maybe your MDS was stuck in up:replay for a
137+ long time!), you will want to have the MDS trim its journal more frequently.
138+ You will know the journal is too large because of ``MDS_HEALTH_TRIM `` warnings.
139+
140+ The main tunable available to do this is to modify the MDS tick interval. The
141+ "tick" interval drives several upkeep activities in the MDS. It is strongly
142+ recommended no significant file system load be present when modifying this tick
143+ interval. This setting only affects an MDS in ``up:active ``. The MDS does not
144+ trim its journal during recovery.
145+
146+ .. code :: bash
147+
148+ ceph config set mds mds_tick_interval 2
149+
150+
24151 RADOS Health
25152============
26153
0 commit comments