You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/setup_installation/admin/ha-dr/dr.md
+38-23Lines changed: 38 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,28 +1,32 @@
1
1
# Disaster Recovery
2
2
3
3
## Backup
4
+
4
5
The state of a Hopsworks cluster is split between data and metadata and distributed across multiple services. This section explains how to take consistent backups for the offline and online feature stores as well as cluster metadata.
5
6
6
7
In Hopsworks, a consistent backup should back up the following services:
7
8
8
-
***RonDB**: cluster metadata and the online feature store data.
9
-
***HopsFS**: offline feature store data plus checkpoints and logs for feature engineering applications.
10
-
***Opensearch**: search metadata, logs, dashboards, and user embeddings.
11
-
***Kubernetes objects**: cluster credentials, backup metadata, serving metadata, and project namespaces with service accounts, roles, secrets, and configmaps.
12
-
***Python environments**: custom project environments are stored in your configured container registry. Back up the registry separately. If a project and its environment are deleted, you must recreate the environment after restore.
9
+
-**RonDB**: cluster metadata and the online feature store data.
10
+
-**HopsFS**: offline feature store data plus checkpoints and logs for feature engineering applications.
11
+
-**Opensearch**: search metadata, logs, dashboards, and user embeddings.
12
+
-**Kubernetes objects**: cluster credentials, backup metadata, serving metadata, and project namespaces with service accounts, roles, secrets, and configmaps.
13
+
-**Python environments**: custom project environments are stored in your configured container registry. Back up the registry separately. If a project and its environment are deleted, you must recreate the environment after restore.
13
14
14
15
Besides the above services, Hopsworks uses also Apache Kafka which carries in-flight data heading to the online feature store. In the event of a total cluster loss, running jobs with in-flight data must be replayed.
15
16
16
17
### Prerequisites
18
+
17
19
When enabling backup in Hopsworks, cron jobs are configured for RonDB and Opensearch. For HopsFS, backups rely on versioning in the object store. For Kubernetes objects, Hopsworks uses Velero to snapshot the required resources. Before enabling backups:
18
20
19
21
- Enable versioning on the S3-compatible bucket used for HopsFS.
20
22
- Install and configure Velero with the AWS plugin (S3).
21
23
22
24
#### Install Velero
23
-
Velero provides backup and restore for Kubernetes resources. Install it with either the Velero CLI or Helm (Velero docs [here](https://velero.io/docs/v1.17/basic-install/)).
25
+
26
+
Velero provides backup and restore for Kubernetes resources. Install it with either the Velero CLI or Helm (Velero docs: [Velero basic install guide](https://velero.io/docs/v1.17/basic-install/)).
24
27
25
28
- Using the Velero CLI, set up the CRDs and deployment:
Backup is only supported for clusters that use S3-compatible object storage.
56
62
57
63
You can enable backups during installation or a later upgrade. Set the schedule with a cron expression in the values file:
58
64
59
65
```yaml
60
66
global:
61
-
_hopsworks:
67
+
_hopsworks:
62
68
backups:
63
69
enabled: true
64
-
schedule: "@weekly"
70
+
schedule: "@weekly"
65
71
```
66
72
67
73
After configuring backups, go to the cluster settings and open the Backup tab. You should see `enabled` at the top level and for all services if everything is configured correctly.
@@ -84,11 +90,11 @@ Use the backup time-to-live (`ttl`) flag to automatically prune backups older th
84
90
85
91
```yaml
86
92
global:
87
-
_hopsworks:
93
+
_hopsworks:
88
94
backups:
89
95
enabled: true
90
96
schedule: "@weekly"
91
-
ttl: 60d
97
+
ttl: 60d
92
98
```
93
99
94
100
For S3 object storage, you can also configure a bucket lifecycle policy to expire old object versions. Example for AWS S3:
@@ -112,6 +118,7 @@ For S3 object storage, you can also configure a bucket lifecycle policy to expir
112
118
```
113
119
114
120
## Restore
121
+
115
122
!!! Note
116
123
Restore is only supported in a newly created cluster; in-place restore is not supported.
117
124
@@ -121,11 +128,14 @@ The restore process has two phases:
121
128
- Install the cluster with Helm using the correct backup IDs.
122
129
123
130
### Restore Kubernetes objects
131
+
124
132
Restore the Kubernetes objects that were backed up using Velero.
125
133
126
134
- Ensure that Velero is installed and configured with the AWS plugin as described in the [prerequisites](#prerequisites).
127
135
- Set up a [Velero backup storage location](https://velero.io/docs/v1.17/api-types/backupstoragelocation/) to point to the S3 bucket.
128
-
- If you are using AWS S3:
136
+
137
+
- If you are using AWS S3:
138
+
129
139
```bash
130
140
kubectl apply -f - <<EOF
131
141
apiVersion: velero.io/v1
@@ -142,7 +152,9 @@ Restore the Kubernetes objects that were backed up using Velero.
142
152
prefix: k8s_backup
143
153
EOF
144
154
```
145
-
- If you are using an S3-compatible object storage, provide credentials and endpoint:
155
+
156
+
- If you are using an S3-compatible object storage, provide credentials and endpoint:
157
+
146
158
```bash
147
159
cat << EOF > hopsworks-bsl-credentials
148
160
[default]
@@ -171,6 +183,7 @@ Restore the Kubernetes objects that were backed up using Velero.
171
183
prefix: k8s_backup
172
184
EOF
173
185
```
186
+
174
187
- After the backup storage location becomes available, restore the backups. The following script restores the latest available backup. To restore a specific backup, set `backupName` instead of `scheduleName`.
@@ -221,31 +234,33 @@ until [ "$(kubectl get restore k8s-backups-users-resources-restore-$RESTORE_SUFF
221
234
done
222
235
```
223
236
224
-
### Restore on Cluster installation
237
+
### Restore on Cluster installation
238
+
225
239
To restore a cluster during installation, configure the backup ID in the values YAML file:
226
240
227
241
```yaml
228
242
global:
229
-
_hopsworks:
243
+
_hopsworks:
230
244
backups:
231
245
enabled: true
232
-
schedule: "@weekly"
246
+
schedule: "@weekly"
233
247
restoreFromBackup:
234
248
backupId: "254811200"
235
249
```
236
250
237
251
#### Customizations
252
+
238
253
!!! Warning
239
254
Even if you override the backup IDs for RonDB and Opensearch, you must still set `.global._hopsworks.restoreFromBackup.backupId` to ensure HopsFS is restored.
240
255
241
256
To restore a different backup ID for RonDB:
242
257
243
258
```yaml
244
259
global:
245
-
_hopsworks:
260
+
_hopsworks:
246
261
backups:
247
262
enabled: true
248
-
schedule: "@weekly"
263
+
schedule: "@weekly"
249
264
restoreFromBackup:
250
265
backupId: "254811200"
251
266
@@ -259,10 +274,10 @@ To restore a different backup for Opensearch:
259
274
260
275
```yaml
261
276
global:
262
-
_hopsworks:
277
+
_hopsworks:
263
278
backups:
264
279
enabled: true
265
-
schedule: "@weekly"
280
+
schedule: "@weekly"
266
281
restoreFromBackup:
267
282
backupId: "254811200"
268
283
@@ -280,10 +295,10 @@ You can also customize the Opensearch restore process to skip specific indices:
0 commit comments