Commit b570e54
DAOS-9576 chk: aggregated patch for cat_recovery (#13718)
* DAOS-15084 chk: aggregated patch for cat_recovery
DAOS employed various fault tolerance mechanisms to cope with regular temporary
or permanent hardware failures, such as Raft engine used for pool/container metadata,
EC/Replica used for user data, etc. These mechanisms ensure the system survives
from most regular failures, and automatic self- healing mechanisms are in place to
bring back redundancy once tolerable failure happens. However, several factors can
challenge this design:
- Users can control (and lower) the data protection on a per-container basis.
- The system is facing some unexpected events causing more failures than it is
designed to tolerate.
- The self-healing mechanism may fail in some specific conditions (e.g., ENOSPC
on the surviving nodes).
- Hardware bugs (e.g., broken flush, firmware issue, data corruption, ...) are causing
massive corruption.
- Software bugs (e.g., corner cases, overflow, ...).
- Human errors.
The DAOS catastrophic recovery feature is introduced to address the failure cases
above. While it is unreasonable to assume that all cases can be covered, this feature
covers the most likely ones. The first goal is to detect corruptions and distributed
consistency issues and then offer a remediation path whenever possible.
Remediation options can range from a transparent, automatic fix to a manual repair
or deletion of a pool or container. If catastrophic recovery fails, then the system will
ultimately have to be reformatted. Another aspect to take into account is that the
check and repair should complete in a reasonable amount of time and the framework
should provide estimates on how long it is expected to take for each pool and allow
the administrator to prioritize some pools over others.
This patch allows offline check & repair (when possible) of a DAOS system.
Signed-off-by: Fan Yong <[email protected]>
Signed-off-by: Dalton Bohning <[email protected]>
Signed-off-by: Kris Jacque <[email protected]>
Co-authored-by: Dalton Bohning <[email protected]>1 parent 9de9b71 commit b570e54
File tree
344 files changed
+65808
-1165
lines changed- ci
- debian
- src
- bio
- cart
- chk
- common
- container
- control
- cmd
- daos
- ddb
- dmg
- pretty
- common/proto
- chk
- ctl
- mgmt
- srv
- drpc
- fault/code
- lib
- control
- ui
- security
- server
- engine
- system
- checker
- raft
- testdata/raft_recovery
- snapshots
- 2-11-1665528548388
- 2-19-1665528549936
- 2-20-1710888709486
- 2-44-1710888711237
- vendor
- github.com
- desertbit
- closer/v3
- columnize
- go-shlex
- grumble
- readline
- hashicorp
- errwrap
- go-multierror
- ddb
- tests
- dtx
- engine
- tests
- gurt/tests
- include
- cart
- daos_srv
- daos
- mgmt
- tests
- object
- pool
- proto
- chk
- ctl
- mgmt
- srv
- rdb
- tests
- rebuild
- rsvc
- tests
- ftest
- daos_test
- recovery
- util
- suite
- vea
- vos
- utils
- completion
- cq
- cr_demo
- rpms
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
344 files changed
+65808
-1165
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
108 | 108 | | |
109 | 109 | | |
110 | 110 | | |
111 | | - | |
| 111 | + | |
| 112 | + | |
112 | 113 | | |
113 | 114 | | |
114 | 115 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| 37 | + | |
| 38 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
1 | 7 | | |
2 | 8 | | |
3 | 9 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| 11 | + | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| 11 | + | |
| 12 | + | |
11 | 13 | | |
| 14 | + | |
12 | 15 | | |
13 | 16 | | |
14 | 17 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
106 | 106 | | |
107 | 107 | | |
108 | 108 | | |
| 109 | + | |
109 | 110 | | |
110 | 111 | | |
111 | 112 | | |
| |||
128 | 129 | | |
129 | 130 | | |
130 | 131 | | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
131 | 135 | | |
132 | 136 | | |
133 | 137 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
13 | | - | |
14 | 12 | | |
15 | 13 | | |
16 | 14 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
60 | | - | |
61 | | - | |
| 60 | + | |
| 61 | + | |
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| |||
906 | 906 | | |
907 | 907 | | |
908 | 908 | | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
909 | 914 | | |
910 | 915 | | |
911 | 916 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
0 commit comments