Skip to content

Commit ec1df90

Browse files
committed
fix linting
1 parent c423c31 commit ec1df90

File tree

1 file changed

+38
-23
lines changed
  • docs/setup_installation/admin/ha-dr

1 file changed

+38
-23
lines changed

docs/setup_installation/admin/ha-dr/dr.md

Lines changed: 38 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,32 @@
11
# Disaster Recovery
22

33
## Backup
4+
45
The state of a Hopsworks cluster is split between data and metadata and distributed across multiple services. This section explains how to take consistent backups for the offline and online feature stores as well as cluster metadata.
56

67
In Hopsworks, a consistent backup should back up the following services:
78

8-
* **RonDB**: cluster metadata and the online feature store data.
9-
* **HopsFS**: offline feature store data plus checkpoints and logs for feature engineering applications.
10-
* **Opensearch**: search metadata, logs, dashboards, and user embeddings.
11-
* **Kubernetes objects**: cluster credentials, backup metadata, serving metadata, and project namespaces with service accounts, roles, secrets, and configmaps.
12-
* **Python environments**: custom project environments are stored in your configured container registry. Back up the registry separately. If a project and its environment are deleted, you must recreate the environment after restore.
9+
- **RonDB**: cluster metadata and the online feature store data.
10+
- **HopsFS**: offline feature store data plus checkpoints and logs for feature engineering applications.
11+
- **Opensearch**: search metadata, logs, dashboards, and user embeddings.
12+
- **Kubernetes objects**: cluster credentials, backup metadata, serving metadata, and project namespaces with service accounts, roles, secrets, and configmaps.
13+
- **Python environments**: custom project environments are stored in your configured container registry. Back up the registry separately. If a project and its environment are deleted, you must recreate the environment after restore.
1314

1415
Besides the above services, Hopsworks uses also Apache Kafka which carries in-flight data heading to the online feature store. In the event of a total cluster loss, running jobs with in-flight data must be replayed.
1516

1617
### Prerequisites
18+
1719
When enabling backup in Hopsworks, cron jobs are configured for RonDB and Opensearch. For HopsFS, backups rely on versioning in the object store. For Kubernetes objects, Hopsworks uses Velero to snapshot the required resources. Before enabling backups:
1820

1921
- Enable versioning on the S3-compatible bucket used for HopsFS.
2022
- Install and configure Velero with the AWS plugin (S3).
2123

2224
#### Install Velero
23-
Velero provides backup and restore for Kubernetes resources. Install it with either the Velero CLI or Helm (Velero docs [here](https://velero.io/docs/v1.17/basic-install/)).
25+
26+
Velero provides backup and restore for Kubernetes resources. Install it with either the Velero CLI or Helm (Velero docs: [Velero basic install guide](https://velero.io/docs/v1.17/basic-install/)).
2427

2528
- Using the Velero CLI, set up the CRDs and deployment:
29+
2630
```bash
2731
velero install \
2832
--plugins velero/velero-plugin-for-aws:v1.13.0 \
@@ -33,6 +37,7 @@ velero install \
3337
```
3438

3539
- Using the Velero Helm chart:
40+
3641
```bash
3742
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
3843
helm repo update
@@ -51,17 +56,18 @@ helm install velero vmware-tanzu/velero \
5156
```
5257

5358
### Configuring Backup
59+
5460
!!! Note
5561
Backup is only supported for clusters that use S3-compatible object storage.
5662

5763
You can enable backups during installation or a later upgrade. Set the schedule with a cron expression in the values file:
5864

5965
```yaml
6066
global:
61-
_hopsworks:
67+
_hopsworks:
6268
backups:
6369
enabled: true
64-
schedule: "@weekly"
70+
schedule: "@weekly"
6571
```
6672
6773
After configuring backups, go to the cluster settings and open the Backup tab. You should see `enabled` at the top level and for all services if everything is configured correctly.
@@ -84,11 +90,11 @@ Use the backup time-to-live (`ttl`) flag to automatically prune backups older th
8490

8591
```yaml
8692
global:
87-
_hopsworks:
93+
_hopsworks:
8894
backups:
8995
enabled: true
9096
schedule: "@weekly"
91-
ttl: 60d
97+
ttl: 60d
9298
```
9399

94100
For S3 object storage, you can also configure a bucket lifecycle policy to expire old object versions. Example for AWS S3:
@@ -112,6 +118,7 @@ For S3 object storage, you can also configure a bucket lifecycle policy to expir
112118
```
113119

114120
## Restore
121+
115122
!!! Note
116123
Restore is only supported in a newly created cluster; in-place restore is not supported.
117124

@@ -121,11 +128,14 @@ The restore process has two phases:
121128
- Install the cluster with Helm using the correct backup IDs.
122129

123130
### Restore Kubernetes objects
131+
124132
Restore the Kubernetes objects that were backed up using Velero.
125133

126134
- Ensure that Velero is installed and configured with the AWS plugin as described in the [prerequisites](#prerequisites).
127135
- Set up a [Velero backup storage location](https://velero.io/docs/v1.17/api-types/backupstoragelocation/) to point to the S3 bucket.
128-
- If you are using AWS S3:
136+
137+
- If you are using AWS S3:
138+
129139
```bash
130140
kubectl apply -f - <<EOF
131141
apiVersion: velero.io/v1
@@ -142,7 +152,9 @@ Restore the Kubernetes objects that were backed up using Velero.
142152
prefix: k8s_backup
143153
EOF
144154
```
145-
- If you are using an S3-compatible object storage, provide credentials and endpoint:
155+
156+
- If you are using an S3-compatible object storage, provide credentials and endpoint:
157+
146158
```bash
147159
cat << EOF > hopsworks-bsl-credentials
148160
[default]
@@ -171,6 +183,7 @@ Restore the Kubernetes objects that were backed up using Velero.
171183
prefix: k8s_backup
172184
EOF
173185
```
186+
174187
- After the backup storage location becomes available, restore the backups. The following script restores the latest available backup. To restore a specific backup, set `backupName` instead of `scheduleName`.
175188

176189
```bash
@@ -209,7 +222,7 @@ kubectl apply -f - <<EOF
209222
apiVersion: velero.io/v1
210223
kind: Restore
211224
metadata:
212-
name: k8s-backups-users-resources-restore-$RESTORE_SUFFIX
225+
name: k8s-backups-users-resources-restore-$RESTORE_SUFFIX
213226
namespace: velero
214227
spec:
215228
scheduleName: k8s-backups-users-resources
@@ -221,31 +234,33 @@ until [ "$(kubectl get restore k8s-backups-users-resources-restore-$RESTORE_SUFF
221234
done
222235
```
223236

224-
### Restore on Cluster installation
237+
### Restore on Cluster installation
238+
225239
To restore a cluster during installation, configure the backup ID in the values YAML file:
226240

227241
```yaml
228242
global:
229-
_hopsworks:
243+
_hopsworks:
230244
backups:
231245
enabled: true
232-
schedule: "@weekly"
246+
schedule: "@weekly"
233247
restoreFromBackup:
234248
backupId: "254811200"
235249
```
236250

237251
#### Customizations
252+
238253
!!! Warning
239254
Even if you override the backup IDs for RonDB and Opensearch, you must still set `.global._hopsworks.restoreFromBackup.backupId` to ensure HopsFS is restored.
240255

241256
To restore a different backup ID for RonDB:
242257

243258
```yaml
244259
global:
245-
_hopsworks:
260+
_hopsworks:
246261
backups:
247262
enabled: true
248-
schedule: "@weekly"
263+
schedule: "@weekly"
249264
restoreFromBackup:
250265
backupId: "254811200"
251266
@@ -259,10 +274,10 @@ To restore a different backup for Opensearch:
259274

260275
```yaml
261276
global:
262-
_hopsworks:
277+
_hopsworks:
263278
backups:
264279
enabled: true
265-
schedule: "@weekly"
280+
schedule: "@weekly"
266281
restoreFromBackup:
267282
backupId: "254811200"
268283
@@ -280,10 +295,10 @@ You can also customize the Opensearch restore process to skip specific indices:
280295

281296
```yaml
282297
global:
283-
_hopsworks:
298+
_hopsworks:
284299
backups:
285300
enabled: true
286-
schedule: "@weekly"
301+
schedule: "@weekly"
287302
restoreFromBackup:
288303
backupId: "254811200"
289304
@@ -296,5 +311,5 @@ olk:
296311
default:
297312
snapshot_name: "254811140"
298313
payload:
299-
indices: "-myindex"
314+
indices: "-myindex"
300315
```

0 commit comments

Comments
 (0)