Skip to content

Commit f79fb8c

Browse files
authored
Introduce repository integrity verification API (#112348)
Adds an API which scans all the metadata (and optionally the raw data) in a snapshot repository to look for corruptions or other inconsistencies. Closes #52622 Closes ES-8560
1 parent 2d09423 commit f79fb8c

File tree

26 files changed

+3606
-3
lines changed

26 files changed

+3606
-3
lines changed

docs/changelog/112348.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
pr: 112348
2+
summary: Introduce repository integrity verification API
3+
area: Snapshot/Restore
4+
type: enhancement
5+
issues:
6+
- 52622

docs/reference/snapshot-restore/apis/snapshot-restore-apis.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ For more information, see <<snapshot-restore>>.
2828
include::put-repo-api.asciidoc[]
2929
include::verify-repo-api.asciidoc[]
3030
include::repo-analysis-api.asciidoc[]
31+
include::verify-repo-integrity-api.asciidoc[]
3132
include::get-repo-api.asciidoc[]
3233
include::delete-repo-api.asciidoc[]
3334
include::clean-up-repo-api.asciidoc[]
Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
[role="xpack"]
2+
[[verify-repo-integrity-api]]
3+
=== Verify repository integrity API
4+
++++
5+
<titleabbrev>Verify repository integrity</titleabbrev>
6+
++++
7+
8+
Verifies the integrity of the contents of a snapshot repository.
9+
10+
////
11+
[source,console]
12+
----
13+
PUT /_snapshot/my_repository
14+
{
15+
"type": "fs",
16+
"settings": {
17+
"location": "my_backup_location"
18+
}
19+
}
20+
----
21+
// TESTSETUP
22+
////
23+
24+
[source,console]
25+
----
26+
POST /_snapshot/my_repository/_verify_integrity
27+
----
28+
29+
[[verify-repo-integrity-api-request]]
30+
==== {api-request-title}
31+
32+
`POST /_snapshot/<repository>/_verify_integrity`
33+
34+
[[verify-repo-integrity-api-prereqs]]
35+
==== {api-prereq-title}
36+
37+
* If the {es} {security-features} are enabled, you must have the `manage`
38+
<<privileges-list-cluster,cluster privilege>> to use this API. For more
39+
information, see <<security-privileges>>.
40+
41+
[[verify-repo-integrity-api-desc]]
42+
==== {api-description-title}
43+
44+
This API allows you to perform a comprehensive check of the contents of a
45+
repository, looking for any anomalies in its data or metadata which might
46+
prevent you from restoring snapshots from the repository or which might cause
47+
future snapshot create or delete operations to fail.
48+
49+
If you suspect the integrity of the contents of one of your snapshot
50+
repositories, cease all write activity to this repository immediately, set its
51+
`read_only` option to `true`, and use this API to verify its integrity. Until
52+
you do so:
53+
54+
* It may not be possible to <<snapshots-restore-snapshot,restore some
55+
snapshots>> from this repository.
56+
57+
* <<searchable-snapshots>> may report errors when searched, or may have
58+
unassigned shards.
59+
60+
* <<snapshots-take-snapshot,Taking snapshots>> into this repository may fail,
61+
or may appear to succeed having created a snapshot which cannot be restored.
62+
63+
* <<delete-snapshot-api,Deleting snapshots>> from this repository may fail, or
64+
may appear to succeed leaving the underlying data on disk.
65+
66+
* Continuing to write to the repository while it is in an invalid state may
67+
causing additional damage to its contents.
68+
69+
If the <<verify-repo-integrity-api>> API finds any problems with the integrity
70+
of the contents of your repository, {es} will not be able to repair the damage.
71+
The only way to bring the repository back into a fully working state after its
72+
contents have been damaged is by restoring its contents from a
73+
<<snapshots-repository-backup,repository backup>> which was taken before the
74+
damage occurred. You must also identify what caused the damage and take action
75+
to prevent it from happening again.
76+
77+
If you cannot restore a repository backup,
78+
<<snapshots-register-repository,register a new repository>> and use this for
79+
all future snapshot operations. In some cases it may be possible to recover
80+
some of the contents of a damaged repository, either by
81+
<<snapshots-restore-snapshot,restoring>> as many of its snapshots as needed and
82+
<<snapshots-take-snapshot,taking new snapshots>> of the restored data, or by
83+
using the <<docs-reindex>> API to copy data from any <<searchable-snapshots>>
84+
mounted from the damaged repository.
85+
86+
Avoid all operations which write to the repository while the
87+
<<verify-repo-integrity-api>> API is running. If something changes the
88+
repository contents while an integrity verification is running then {es} may
89+
incorrectly report having detected some anomalies in its contents due to the
90+
concurrent writes. It may also incorrectly fail to report some anomalies that
91+
the concurrent writes prevented it from detecting.
92+
93+
NOTE: This API is intended for exploratory use by humans. You should expect the
94+
request parameters and the response format to vary in future versions.
95+
96+
NOTE: This API may not work correctly in a mixed-version cluster.
97+
98+
[[verify-repo-integrity-api-path-params]]
99+
==== {api-path-parms-title}
100+
101+
`<repository>`::
102+
(Required, string)
103+
Name of the snapshot repository whose integrity to verify.
104+
105+
[[verify-repo-integrity-api-query-params]]
106+
==== {api-query-parms-title}
107+
108+
The default values for the parameters of this API are designed to limit the
109+
impact of the integrity verification on other activities in your cluster. For
110+
instance, by default it will only use at most half of the `snapshot_meta`
111+
threads to verify the integrity of each snapshot, allowing other snapshot
112+
operations to use the other half of this thread pool.
113+
114+
If you modify these parameters to speed up the verification process, you risk
115+
disrupting other snapshot-related operations in your cluster. For large
116+
repositories, consider setting up a separate single-node {es} cluster just for
117+
running the integrity verification API.
118+
119+
`snapshot_verification_concurrency`::
120+
(Optional, integer) Specifies the number of snapshots to verify concurrently.
121+
Defaults to `0` which means to use at most half of the `snapshot_meta` thread
122+
pool at once.
123+
124+
`index_verification_concurrency`::
125+
(Optional, integer) Specifies the number of indices to verify concurrently.
126+
Defaults to `0` which means to use the entire `snapshot_meta` thread pool.
127+
128+
`meta_thread_pool_concurrency`::
129+
(Optional, integer) Specifies the maximum number of snapshot metadata
130+
operations to execute concurrently. Defaults to `0` which means to use at most
131+
half of the `snapshot_meta` thread pool at once.
132+
133+
`index_snapshot_verification_concurrency`::
134+
(Optional, integer) Specifies the maximum number of index snapshots to verify
135+
concurrently within each index verification. Defaults to `1`.
136+
137+
`max_failed_shard_snapshots`::
138+
(Optional, integer) Limits the number of shard snapshot failures to track
139+
during integrity verification, in order to avoid excessive resource usage. If
140+
your repository contains more than this number of shard snapshot failures then
141+
the verification will fail. Defaults to `10000`.
142+
143+
`verify_blob_contents`::
144+
(Optional, boolean) Specifies whether to verify the checksum of every data blob
145+
in the repository. Defaults to `false`. If this feature is enabled, {es} will
146+
read the entire repository contents, which may be extremely slow and expensive.
147+
148+
`blob_thread_pool_concurrency`::
149+
(Optional, integer) If `?verify_blob_contents` is `true`, this parameter
150+
specifies how many blobs to verify at once. Defaults to `1`.
151+
152+
`max_bytes_per_sec`::
153+
(Optional, <<size-units, size units>>)
154+
If `?verify_blob_contents` is `true`, this parameter specifies the maximum
155+
amount of data that {es} will read from the repository every second. Defaults
156+
to `10mb`.
157+
158+
[role="child_attributes"]
159+
[[verify-repo-integrity-api-response-body]]
160+
==== {api-response-body-title}
161+
162+
The response exposes implementation details of the analysis which may change
163+
from version to version. The response body format is therefore not considered
164+
stable and may be different in newer versions.
165+
166+
`log`::
167+
(array) A sequence of objects that report the progress of the analysis.
168+
+
169+
.Properties of `log`
170+
[%collapsible%open]
171+
====
172+
`timestamp_in_millis`::
173+
(integer) The timestamp of this log entry, represented as the number of
174+
milliseconds since the {wikipedia}/Unix_time[Unix epoch].
175+
176+
`timestamp`::
177+
(string) The timestamp of this log entry, represented as a string formatted
178+
according to {wikipedia}/ISO_8601[ISO 8601]. Only included if the
179+
<<common-options,`?human`>> flag is set.
180+
181+
`snapshot`::
182+
(object) If the log entry pertains to a particular snapshot then the snapshot
183+
will be described in this object.
184+
185+
`index`::
186+
(object) If the log entry pertains to a particular index then the index will be
187+
described in this object.
188+
189+
`snapshot_restorability`::
190+
(object) If the log entry pertains to the restorability of an index then the
191+
details will be described in this object.
192+
193+
`anomaly`::
194+
(string) If the log entry pertains to an anomaly in the repository contents then
195+
this string will describe the anomaly.
196+
197+
`exception`::
198+
(object) If the log entry pertains to an exception that {es} encountered during
199+
the verification then the details will be included in this object.
200+
201+
====
202+
203+
`results`::
204+
(object) An object which describes the final results of the analysis.
205+
+
206+
.Properties of `results`
207+
[%collapsible%open]
208+
====
209+
`status`::
210+
(object) The final status of the analysis task.
211+
212+
`final_repository_generation`::
213+
(integer) The repository generation at the end of the analysis. If there were
214+
any writes to the repository during the analysis then this value will be
215+
different from the `generation` reported in the task status, and the analysis
216+
may have detected spurious anomalies due to the concurrent writes, or may even
217+
have failed to detect some anomalies in the repository contents.
218+
219+
`total_anomalies`::
220+
(integer) The total number of anomalies detected during the analysis.
221+
222+
`result`::
223+
(string) The final result of the analysis. If the repository contents appear to
224+
be intact then this will be the string `pass`. If this field is missing, or
225+
contains some other value, then the repository contents were not fully
226+
verified.
227+
228+
====
229+
230+
`exception`::
231+
(object) If the analysis encountered an exception which prevented it from
232+
completing successfully then this exception will be reported here.

docs/reference/snapshot-restore/register-repository.asciidoc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -272,7 +272,9 @@ filesystem snapshot of this repository.
272272
When restoring a repository from a backup, you must not register the repository
273273
with {es} until the repository contents are fully restored. If you alter the
274274
contents of a repository while it is registered with {es} then the repository
275-
may become unreadable or may silently lose some of its contents.
275+
may become unreadable or may silently lose some of its contents. After
276+
restoring a repository from a backup, use the <<verify-repo-integrity-api>> API
277+
to verify its integrity before you start to use the repository.
276278

277279
include::repository-azure.asciidoc[]
278280
include::repository-gcs.asciidoc[]
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
{
2+
"snapshot.repository_verify_integrity":{
3+
"documentation":{
4+
"url":"https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-snapshots.html",
5+
"description":"Verifies the integrity of the contents of a snapshot repository"
6+
},
7+
"stability":"experimental",
8+
"visibility":"private",
9+
"headers": {
10+
"accept": [
11+
"application/json"
12+
]
13+
},
14+
"url":{
15+
"paths":[
16+
{
17+
"path":"/_snapshot/{repository}/_verify_integrity",
18+
"methods":[
19+
"POST"
20+
],
21+
"parts":{
22+
"repository":{
23+
"type":"string",
24+
"description":"A repository name"
25+
}
26+
}
27+
}
28+
]
29+
},
30+
"params":{
31+
"meta_thread_pool_concurrency":{
32+
"type":"number",
33+
"description":"Number of threads to use for reading metadata"
34+
},
35+
"blob_thread_pool_concurrency":{
36+
"type":"number",
37+
"description":"Number of threads to use for reading blob contents"
38+
},
39+
"snapshot_verification_concurrency":{
40+
"type":"number",
41+
"description":"Number of snapshots to verify concurrently"
42+
},
43+
"index_verification_concurrency":{
44+
"type":"number",
45+
"description":"Number of indices to verify concurrently"
46+
},
47+
"index_snapshot_verification_concurrency":{
48+
"type":"number",
49+
"description":"Number of snapshots to verify concurrently within each index"
50+
},
51+
"max_failed_shard_snapshots":{
52+
"type":"number",
53+
"description":"Maximum permitted number of failed shard snapshots"
54+
},
55+
"verify_blob_contents":{
56+
"type":"boolean",
57+
"description":"Whether to verify the contents of individual blobs"
58+
},
59+
"max_bytes_per_sec":{
60+
"type":"string",
61+
"description":"Rate limit for individual blob verification"
62+
}
63+
}
64+
}
65+
}

server/src/main/java/org/elasticsearch/repositories/RepositoryData.java

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -281,6 +281,13 @@ public Collection<SnapshotId> getSnapshotIds() {
281281
return snapshotIds.values();
282282
}
283283

284+
/**
285+
* @return the number of index snapshots (i.e. the sum of the index count of each snapshot)
286+
*/
287+
public long getIndexSnapshotCount() {
288+
return indexSnapshots.values().stream().mapToLong(List::size).sum();
289+
}
290+
284291
/**
285292
* @return whether some of the {@link SnapshotDetails} of the given snapshot are missing, due to BwC, so that they must be loaded from
286293
* the {@link SnapshotInfo} blob instead.

x-pack/plugin/ml/qa/native-multi-node-tests/build.gradle

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ dependencies {
1717
javaRestTestImplementation project(path: xpackModule('rank-rrf'))
1818
javaRestTestImplementation project(path: xpackModule('esql-core'))
1919
javaRestTestImplementation project(path: xpackModule('esql'))
20+
javaRestTestImplementation project(path: xpackModule('snapshot-repo-test-kit'))
2021
}
2122

2223
// location for keys and certificates

x-pack/plugin/security/qa/operator-privileges-tests/src/javaRestTest/java/org/elasticsearch/xpack/security/operator/Constants.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ public class Constants {
9898
"cluster:admin/snapshot/restore",
9999
"cluster:admin/snapshot/status",
100100
"cluster:admin/snapshot/status[nodes]",
101+
"cluster:admin/repository/verify_integrity",
101102
"cluster:admin/features/get",
102103
"cluster:admin/features/reset",
103104
"cluster:admin/tasks/cancel",
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
setup:
3+
- requires:
4+
cluster_features: "snapshot.repository_verify_integrity"
5+
reason: "required feature"
6+
7+
- do:
8+
snapshot.create_repository:
9+
repository: test_repo
10+
body:
11+
type: fs
12+
settings:
13+
location: "test_repo_loc"
14+
15+
- do:
16+
bulk:
17+
index: test
18+
refresh: true
19+
body:
20+
- '{"index":{}}'
21+
- '{}'
22+
23+
- do:
24+
snapshot.create:
25+
repository: test_repo
26+
snapshot: snap
27+
wait_for_completion: true
28+
29+
---
30+
"Integrity verification":
31+
- do:
32+
snapshot.repository_verify_integrity:
33+
repository: test_repo
34+
35+
- match: {results.result: pass}
36+
- match: {results.status.snapshots.total: 1}
37+
- match: {results.status.snapshots.verified: 1}
38+
- match: {results.status.indices.total: 1}
39+
- match: {results.status.indices.verified: 1}

0 commit comments

Comments
 (0)