|
| 1 | +--- |
| 2 | +title: "Ozone Repair" |
| 3 | +date: 2025-07-22 |
| 4 | +summary: Advanced tool to repair Ozone. |
| 5 | +--- |
| 6 | +<!--- |
| 7 | + Licensed to the Apache Software Foundation (ASF) under one or more |
| 8 | + contributor license agreements. See the NOTICE file distributed with |
| 9 | + this work for additional information regarding copyright ownership. |
| 10 | + The ASF licenses this file to You under the Apache License, Version 2.0 |
| 11 | + (the "License"); you may not use this file except in compliance with |
| 12 | + the License. You may obtain a copy of the License at |
| 13 | +
|
| 14 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 15 | +
|
| 16 | + Unless required by applicable law or agreed to in writing, software |
| 17 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 18 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 19 | + See the License for the specific language governing permissions and |
| 20 | + limitations under the License. |
| 21 | +--> |
| 22 | + |
| 23 | +Ozone Repair (`ozone repair`) is an advanced tool to repair Ozone. The nodes being repaired must be stopped before the tool is run. |
| 24 | +Note: All repair commands support a `--dry-run` option which allows a user to see what repair the command will be performing without actually making any changes to the cluster. |
| 25 | +Use the `--force` flag to override the running service check in false-positive cases. |
| 26 | + |
| 27 | +```bash |
| 28 | +Usage: ozone repair [-hV] [--verbose] [-conf=<configurationPath>] |
| 29 | + [-D=<String=String>]... [COMMAND] |
| 30 | +Advanced tool to repair Ozone. The nodes being repaired must be stopped before |
| 31 | +the tool is run. |
| 32 | + -conf=<configurationPath> |
| 33 | + |
| 34 | + -D, --set=<String=String> |
| 35 | + |
| 36 | + -h, --help Show this help message and exit. |
| 37 | + -V, --version Print version information and exit. |
| 38 | + --verbose More verbose output. Show the stack trace of the errors. |
| 39 | +Commands: |
| 40 | + datanode Tools to repair Datanode |
| 41 | + ldb Operational tool to repair ldb. |
| 42 | + om Operational tool to repair OM. |
| 43 | + scm Operational tool to repair SCM. |
| 44 | +``` |
| 45 | +For more detailed usage see the output of `--help` for each of the subcommands. |
| 46 | + |
| 47 | +## ozone repair datanode |
| 48 | +Operational tool to repair datanode. |
| 49 | + |
| 50 | +### upgrade-container-schema |
| 51 | +Upgrade all schema V2 containers to schema V3 for a datanode in offline mode. |
| 52 | +Optionally takes `--volume` option to specify which volume needs the upgrade. |
| 53 | + |
| 54 | +## ozone repair ldb |
| 55 | +Operational tool to repair ldb. |
| 56 | + |
| 57 | +### compact |
| 58 | +Compact a column family in the DB to clean up tombstones while the service is offline. |
| 59 | +```bash |
| 60 | +Usage: ozone repair ldb compact [-hV] [--dry-run] [--force] [--verbose] |
| 61 | + --cf=<columnFamilyName> --db=<dbPath> |
| 62 | +CLI to compact a column-family in the DB while the service is offline. |
| 63 | +Note: If om.db is compacted with this tool then it will negatively impact the |
| 64 | +Ozone Manager\'s efficient snapshot diff. |
| 65 | + --cf, --column-family, --column_family=<columnFamilyName> |
| 66 | + Column family name |
| 67 | + --db=<dbPath> Database File Path |
| 68 | +``` |
| 69 | +
|
| 70 | +## ozone repair om |
| 71 | +Operational tool to repair OM. |
| 72 | +
|
| 73 | +#### Subcommands under OM |
| 74 | +- fso-tree |
| 75 | +- snapshot |
| 76 | +- update-transaction |
| 77 | +- quota |
| 78 | +- compact |
| 79 | +- skip-ratis-transaction |
| 80 | +
|
| 81 | +### fso-tree |
| 82 | +Identify and repair a disconnected FSO tree by marking unreferenced entries for deletion. |
| 83 | +Reports the reachable, unreachable (pending delete) and unreferenced (orphaned) directories and files. |
| 84 | +OM should be stopped while this tool is run. |
| 85 | +```bash |
| 86 | +Usage: ozone repair om fso-tree [-hV] [--dry-run] [--force] [--verbose] |
| 87 | + [-b=<bucketFilter>] --db=<omDBPath> |
| 88 | + [-v=<volumeFilter>] |
| 89 | +Identify and repair a disconnected FSO tree by marking unreferenced entries for |
| 90 | +deletion. OM should be stopped while this tool is run. |
| 91 | + -b, --bucket=<bucketFilter> |
| 92 | + Filter by bucket name |
| 93 | + --db=<omDBPath> Path to OM RocksDB |
| 94 | + -v, --volume=<volumeFilter> |
| 95 | + Filter by volume name. Add '/' before the volume name. |
| 96 | +``` |
| 97 | +
|
| 98 | +### snapshot |
| 99 | +Subcommand for all snapshot related repairs. |
| 100 | +
|
| 101 | +#### chain |
| 102 | +Update global and path previous snapshot for a snapshot in case snapshot chain is corrupted. |
| 103 | +```bash |
| 104 | +Usage: ozone repair om snapshot chain [-hV] [--dry-run] [--force] [--verbose] |
| 105 | + --db=<dbPath> |
| 106 | + --gp=<globalPreviousSnapshotId> |
| 107 | + --pp=<pathPreviousSnapshotId> <value> |
| 108 | + <snapshotName> |
| 109 | +CLI to update global and path previous snapshot for a snapshot in case snapshot |
| 110 | +chain is corrupted. |
| 111 | + <value> URI of the bucket (format: volume/bucket). |
| 112 | + <snapshotName> Snapshot name to update |
| 113 | + --db=<dbPath> Database File Path |
| 114 | + --gp, --global-previous=<globalPreviousSnapshotId> |
| 115 | + Global previous snapshotId to set for the given snapshot |
| 116 | + --pp, --path-previous=<pathPreviousSnapshotId> |
| 117 | + Path previous snapshotId to set for the given snapshot |
| 118 | +``` |
| 119 | +
|
| 120 | +### update-transaction |
| 121 | +To avoid modifying Ratis logs and only update the latest applied transaction, use `update-transaction` command. |
| 122 | +This updates the highest transaction index in the OM transaction info table. |
| 123 | +```bash |
| 124 | +Usage: ozone repair om update-transaction [-hV] [--dry-run] [--force] |
| 125 | + [--verbose] --db=<dbPath> --index=<highestTransactionIndex> |
| 126 | + --term=<highestTransactionTerm> |
| 127 | +CLI to update the highest index in transaction info table. |
| 128 | + --db=<dbPath> Database File Path |
| 129 | + --index=<highestTransactionIndex> |
| 130 | + Highest index to set. The input should be non-zero long |
| 131 | + integer. |
| 132 | + --term=<highestTransactionTerm> |
| 133 | + Highest term to set. The input should be non-zero long |
| 134 | + integer. |
| 135 | +``` |
| 136 | +
|
| 137 | +### quota |
| 138 | +Operational tool to repair quota in OM DB. |
| 139 | +
|
| 140 | +#### start |
| 141 | +To trigger quota repair use the `start` command. |
| 142 | +```bash |
| 143 | +Usage: ozone repair om quota start [-hV] [--dry-run] [--force] [--verbose] |
| 144 | + [--buckets=<buckets>] |
| 145 | + [--service-host=<omHost>] |
| 146 | + [--service-id=<omServiceId>] |
| 147 | +CLI to trigger quota repair. |
| 148 | + --buckets=<buckets> start quota repair for specific buckets. Input will |
| 149 | + be list of uri separated by comma as |
| 150 | + /<volume>/<bucket>[,...] |
| 151 | + --service-host=<omHost> |
| 152 | + Ozone Manager Host. If OM HA is enabled, use |
| 153 | + --service-id instead. If you must use |
| 154 | + --service-host with OM HA, this must point |
| 155 | + directly to the leader OM. This option is |
| 156 | + required when --service-id is not provided or |
| 157 | + when HA is not enabled. |
| 158 | + --service-id, --om-service-id=<omServiceId> |
| 159 | + Ozone Manager Service ID |
| 160 | +``` |
| 161 | +
|
| 162 | +#### status |
| 163 | +Get the status of last triggered quota repair. |
| 164 | +```bash |
| 165 | +Usage: ozone repair om quota status [-hV] [--verbose] [--service-host=<omHost>] |
| 166 | + [--service-id=<omServiceId>] |
| 167 | +CLI to get the status of last trigger quota repair if available. |
| 168 | + --service-host=<omHost> |
| 169 | + Ozone Manager Host. If OM HA is enabled, use --service-id |
| 170 | + instead. If you must use --service-host with OM HA, this |
| 171 | + must point directly to the leader OM. This option is |
| 172 | + required when --service-id is not provided or when HA is |
| 173 | + not enabled. |
| 174 | + --service-id, --om-service-id=<omServiceId> |
| 175 | + Ozone Manager Service ID |
| 176 | +``` |
| 177 | +
|
| 178 | +### compact |
| 179 | +Compact a column family in the OM DB to clean up tombstones. The compaction happens asynchronously. Requires admin privileges. |
| 180 | +```bash |
| 181 | +Usage: ozone repair om compact [-hV] [--dry-run] [--force] [--verbose] |
| 182 | + --cf=<columnFamilyName> [--node-id=<nodeId>] |
| 183 | + [--service-id=<omServiceId>] |
| 184 | +CLI to compact a column family in the om.db. The compaction happens |
| 185 | +asynchronously. Requires admin privileges. |
| 186 | + --cf, --column-family, --column_family=<columnFamilyName> |
| 187 | + Column family name |
| 188 | + --node-id=<nodeId> NodeID of the OM for which db needs to be compacted. |
| 189 | + --service-id, --om-service-id=<omServiceId> |
| 190 | + Ozone Manager Service ID |
| 191 | +``` |
| 192 | +
|
| 193 | +### skip-ratis-transaction, srt |
| 194 | +Omit a raft log in a ratis segment file by replacing the specified index with a dummy EchoOM command. |
| 195 | +This is an offline tool meant to be used only when all 3 OMs crash on the same transaction. |
| 196 | +If the issue is isolated to one OM, manually copy the DB from a healthy OM instead. |
| 197 | +```bash |
| 198 | +Usage: ozone repair om skip-ratis-transaction [-hV] [--dry-run] [--force] |
| 199 | + [--verbose] -b=<backupDir> --index=<index> (-s=<segmentFile> | |
| 200 | + -d=<logDir>) |
| 201 | +CLI to omit a raft log in a ratis segment file. The raft log at the index |
| 202 | +specified is replaced with an EchoOM command (which is a dummy command). It is |
| 203 | +an offline command i.e., doesn\'t require OM to be running. The command should |
| 204 | +be run for the same transaction on all 3 OMs only when all the OMs are crashing |
| 205 | +while applying the same transaction. If only one OM is crashing and the other |
| 206 | +OMs have executed the log successfully, then the DB should be manually copied |
| 207 | +from one of the good OMs to the crashing OM instead. |
| 208 | + -b, --backup=<backupDir> Directory to put the backup of the original |
| 209 | + repaired segment file before the repair. |
| 210 | + -d, --ratis-log-dir=<logDir> |
| 211 | + Path of the ratis log directory |
| 212 | + --index=<index> Index of the failing transaction that should be |
| 213 | + removed |
| 214 | + -s, --segment-path=<segmentFile> |
| 215 | + Path of the input segment file |
| 216 | +``` |
| 217 | +
|
| 218 | +## ozone repair scm |
| 219 | +Operational tool to repair SCM. |
| 220 | +
|
| 221 | +#### Subcommands under SCM |
| 222 | +- cert |
| 223 | +- update-transaction |
| 224 | +
|
| 225 | +### cert |
| 226 | +Subcommand for all certificate related repairs on SCM |
| 227 | +
|
| 228 | +#### recover |
| 229 | +Recover Deleted SCM Certificate from RocksDB |
| 230 | +```bash |
| 231 | +Usage: ozone repair scm cert recover [-hV] [--dry-run] [--force] [--verbose] |
| 232 | + --db=<dbPath> |
| 233 | +Recover Deleted SCM Certificate from RocksDB |
| 234 | + --db=<dbPath> SCM DB Path |
| 235 | +``` |
| 236 | +
|
| 237 | +### update-transaction |
| 238 | +To avoid modifying Ratis logs and only update the latest applied transaction, use `update-transaction` command. |
| 239 | +This updates the highest transaction index in the SCM transaction info table. |
| 240 | +```bash |
| 241 | +Usage: ozone repair scm update-transaction [-hV] [--dry-run] [--force] |
| 242 | + [--verbose] --db=<dbPath> --index=<highestTransactionIndex> |
| 243 | + --term=<highestTransactionTerm> |
| 244 | +CLI to update the highest index in transaction info table. |
| 245 | + --db=<dbPath> Database File Path |
| 246 | + --index=<highestTransactionIndex> |
| 247 | + Highest index to set. The input should be non-zero long |
| 248 | + integer. |
| 249 | + --term=<highestTransactionTerm> |
| 250 | + Highest term to set. The input should be non-zero long |
| 251 | + integer. |
| 252 | +``` |
0 commit comments