You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: modules/manage/partials/whole-cluster-restore.adoc
+16-60Lines changed: 16 additions & 60 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,14 +12,14 @@ endif::[]
12
12
include::shared:partial$enterprise-license.adoc[]
13
13
====
14
14
15
-
With xref:{link-tiered-storage}[Tiered Storage] enabled, you can use Whole Cluster Restore to restore data from a failed cluster (source cluster), including its metadata, onto a new cluster (target cluster). This is a simpler and cheaper alternative to active-active replication, for example with xref:migrate:data-migration.adoc[MirrorMaker 2]. Use this recovery method to restore your application to the latest functional state as quickly as possible.
15
+
With xref:{link-tiered-storage}[Tiered Storage] enabled, you can use Whole Cluster Restore to restore data from a failed cluster (source cluster you are restoring from), including its metadata, onto a new cluster (target cluster you are restoring to). This is a simpler and cheaper alternative to active-active replication, for example with xref:migrate:data-migration.adoc[MirrorMaker 2]. Use this recovery method to restore your application to the latest functional state as quickly as possible.
16
16
17
17
[CAUTION]
18
18
====
19
19
Whole Cluster Restore is not a fully-functional disaster recovery solution. It does not provide snapshot-style consistency. Some partitions in some topics will be more up-to-date than others. Committed transactions are not guaranteed to be atomic.
20
20
====
21
21
22
-
TIP: If you need to restore only a subset of topic data, consider using xref:deploy:redpanda/manual/disaster-recovery/topic-recovery.adoc[topic recovery] instead of a Whole Cluster Restore.
22
+
TIP: If you need to restore only a subset of topic data, consider using xref:manage:disaster-recovery/topic-recovery.adoc[topic recovery] instead of a Whole Cluster Restore.
23
23
24
24
The following metadata is included in a Whole Cluster Restore:
25
25
@@ -227,74 +227,34 @@ endif::[]
227
227
228
228
When the cluster restore is successfully completed, you can redirect your application workload to the new cluster. Make sure to update your application code to use the new addresses of your brokers.
229
229
230
-
== (Advanced) Restore data when multiple clusters share data
230
+
== Advanced: Restore data when multiple clusters share data
231
231
232
232
[CAUTION]
233
233
====
234
-
This is an advanced use case and should be performed only after consulting with Redpanda support.
234
+
This is an advanced use case that should be performed only by Redpanda support.
235
235
====
236
236
237
-
Typically, there is a one-to-one mapping between a Redpanda cluster and its object storage bucket. However, you can also run multiple clusters that share the same bucket. This allows you to move tenants between clusters without moving data, as the data remains in the same bucket. For example, you can mount topics to multiple clusters in the same bucket.
237
+
Typically, you will have a one-to-one mapping between a Redpanda cluster and its object storage bucket. However, it's possible to run multiple clusters that share the same storage bucket. Sharing an object storage bucket allows you to move tenants between clusters without moving data. For example, you might wish to mount topics to multiple clusters in the same bucket without having to move data.
238
238
239
-
Running multiple clusters that share the same storage bucket presents unique challenges during Whole Cluster Restore operations. To manage these challenges, you must first understand how Redpanda uses <<the-role-of-cluster-uuids-in-whole-cluster-restore,UUIDs>> (universal unique identifiers) to identify clusters during Whole Cluster Restore.
239
+
Running multiple clusters that share the same storage bucket presents unique challenges during Whole Cluster Restore operations. To manage these challenges, you must understand how Redpanda uses <<the-role-of-cluster-uuids-in-whole-cluster-restore,UUIDs>> (universally unique identifiers) to identify clusters during a Whole Cluster Restore. This shared storage approach can create identification challenges during restore operations.
240
240
241
241
=== The role of cluster UUIDs in Whole Cluster Restore
242
242
243
-
Every time a Redpanda cluster (single node or more) starts, it is automatically assigned a random UUID. From that moment forward, all entities created by the cluster are identifiable using that cluster UUID. Such entities include:
243
+
Each Redpanda cluster (single node or more) receives a unique UUID every time it starts. From that moment forward, all entities created by the cluster are identifiable using this cluster UUID. These entities include:
244
244
245
245
- Topic data
246
246
- Topic metadata
247
247
- Whole Cluster Restore manifests
248
248
- Controller log snapshots for Whole Cluster Restore
249
249
- Consumer offsets for Whole Cluster Restore
250
250
251
-
However, not all entities _managed_ by the cluster are identifiable using this cluster UUID. In fact, Redpanda can recover a different cluster in lieu of the existing cluster, or mount topics from different clusters. For a cluster that has been running for some time, your object storage may look like this:
252
-
253
-
[source,bash]
254
-
----
255
-
/
256
-
+- cluster_metadata/
257
-
+- <uuid-a>/manifests/
258
-
| +- 0/cluster_manifest.json
259
-
| +- 1/cluster_manifest.json
260
-
| +- 2/cluster_manifest.json
261
-
+ <uuid-b>/manifests/
262
-
| +- 3/cluster_manifest.json
263
-
| +- 4/cluster_manifest.json
264
-
+ <uuid-c>/manifests/ # Previously active but not restored.
265
-
| # Still, the manifest number starts at
266
-
| # highest found in the bucket plus one.
267
-
| +- 5/cluster_manifest.json
268
-
| +- 6/cluster_manifest.json
269
-
+ <uuid-d>/manifests/ # active cluster (not restored)
270
-
+- 7/cluster_manifest.json
271
-
+- 8/cluster_manifest.json
272
-
----
273
-
274
-
Redpanda's algorithm lists all objects (cluster manifests) from object storage and during a Whole Cluster Restore, picks the object with the _highest ID available_, not the current UUID. In this case, if you attempt to restore you would recover `/cluster_metadata/<uuid-c>/manifests/6/cluster_manifest.json`, even though the active cluster is `<uuid-d>`.
275
-
276
-
However, this algorithm does not work if you have multiple clusters sharing the same object storage bucket. For example, your object storage might look like:
277
-
278
-
[source,bash]
279
-
----
280
-
/
281
-
+- cluster_metadata/
282
-
+ <uuid-a>/manifests/
283
-
| +- 0/cluster_manifest.json
284
-
| +- 1/cluster_manifest.json
285
-
| +- 2/cluster_manifest.json
286
-
+ <uuid-b>/manifests/
287
-
+- 0/cluster_manifest.json
288
-
+- 1/cluster_manifest.json (lost cluster)
289
-
----
290
-
291
-
Here, if you've lost the cluster `uuid-b` and wish to recover it, the recovery process will select the metadata for `uuid-a`, which will lead to a split-brain/data corruption scenario. For troubleshooting details, see <<resolve-repeated-recovery-failures,Resolve repeated recovery failures>>
251
+
However, not all entities _managed_ by the cluster are identifiable using this cluster UUID. Each time a cluster uploads its metadata, the name of the object has two parts: the cluster UUID, which is unique each time you create a cluster (even after a restore it will have a new UUID), and a metadata (sequence) ID. When performing a restore, Redpanda scans the bucket to find the highest-sequenced ID uploaded by the cluster. It can be ambiguous what to restore when the highest sequential ID has been uploaded by another cluster, and result in a split-brain scenario, where you have two independent clusters that both believe they are the “rightful owner” of the same logical data.
292
252
293
253
=== Configure cluster names for multiple source clusters
294
254
295
-
To disambiguate cluster metadata from multiple clusters, use the xref:reference:properties/object-storage-properties.adoc#cloud_storage_cluster_name[`cloud_storage_cluster_name`] property (off by default), which allows you to assign a unique name to each cluster sharing the same object storage bucket. This name must be unique within the bucket, 1-64 characters, and use only letters, numbers, underscores, and hyphens. Do not change this value once set. Once set, your object storage bucket may look like this:
255
+
To disambiguate cluster metadata from multiple clusters, use the xref:reference:properties/object-storage-properties.adoc#cloud_storage_cluster_name[`cloud_storage_cluster_name`] property (off by default), which allows you to assign a unique name to each cluster sharing the same object storage bucket. Redpanda uses this name to organize the cluster metadata within the shared object storage bucket. This ensures that each cluster's data remains distinct and prevents conflicts during recovery operations.The name must be unique within the bucket, 1-64 characters, and use only letters, numbers, underscores, and hyphens. Do not change this value once set. After setting, your object storage bucket organization may look like the following:
296
256
297
-
[source,bash]
257
+
[,bash]
298
258
----
299
259
/
300
260
+- cluster_metadata/
@@ -310,9 +270,9 @@ To disambiguate cluster metadata from multiple clusters, use the xref:reference:
310
270
+- rp-qux/uuid/<uuid-b>
311
271
----
312
272
313
-
When a new cluster is created, and you have specified its `cloud_storage_cluster_name` (here, `rp-qux`), your object storage bucket may look like this:
273
+
During a Whole Cluster Restore, Redpanda looks for the cluster name specified in `cloud_storage_cluster_name` and only consider manifests associated with that name. Because the name specified here is `rp-qux`, Redpanda only considers manifests for the clusters `<uuid-b>` and `<uuid-c>`, ignoring cluster `<uuid-a>` entirely. In this case, your object storage bucket may look like the following:
314
274
315
-
[source,bash]
275
+
[,bash]
316
276
----
317
277
+- cluster_metadata/
318
278
| + <uuid-a>/manifests/
@@ -332,15 +292,11 @@ When a new cluster is created, and you have specified its `cloud_storage_cluster
332
292
+- <uuid-c> # reference to new cluster
333
293
----
334
294
335
-
During a Whole Cluster Restore, Redpanda will look for the cluster name specified in `cloud_storage_cluster_name` and only consider manifests associated with that name. In this example, if you start a cluster with `cloud_storage_cluster_name` set to `rp-qux`, Redpanda will only consider manifests under `<uuid-b>` and `<uuid-c>`, ignoring `<uuid-a>` entirely.
336
-
337
-
Redpanda uses this name to organize the cluster metadata within the shared object storage bucket. This ensures that each cluster's data remains distinct and prevents conflicts during recovery operations.
338
-
339
295
=== Resolve repeated recovery failures
340
296
341
-
If you are experiencing repeated failures when a cluster is lost and recreated, the automated recovery algorithm may have selected the manifest with the highest sequence number, which might be the most recent one with no data, instead of the original one that contains the data. Your object storage bucket might look like this:
297
+
If you experience repeated failures when a cluster is lost and recreated, the automated recovery algorithm may have selected the manifest with the highest sequence number, which might be the most recent one with no data, instead of the original one containing the data. In such a scenario, your object storage bucket might be organized like the following:
342
298
343
-
[source,bash]
299
+
[,bash]
344
300
----
345
301
/
346
302
+- cluster_metadata/
@@ -356,11 +312,11 @@ If you are experiencing repeated failures when a cluster is lost and recreated,
356
312
357
313
In such cases, you can explicitly run a POST request using the Admin API:
0 commit comments