Skip to content

Commit 25d8845

Browse files
authored
Avoid counting snapshot failures twice in SLM (#136759) (#136849)
We came across a scenario where 3 snapshot failures were counted as 5 "invocations since last success", resulting in a premature yellow SLM health indicator. The three snapshot failures completed at virtually the same time. Our theory is that the listener of the first snapshot failure already processed the other two snapshot failures (incrementing the `invocationsSinceLastSuccess`), but the listeners of those other two snapshots then incremented that field too. There we two warning logs indicating that the snapshots weren't found in the registered set, confirming our hypothesis. We simply avoid incrementing `invocationsSinceLastSuccess` if the listener failed with an exception and the snapshot isn't registered anymore; assuming that another listener has already incremented the field.
1 parent 822f5dc commit 25d8845

File tree

3 files changed

+14
-3
lines changed

3 files changed

+14
-3
lines changed

docs/changelog/136759.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 136759
2+
summary: Avoid counting snapshot failures twice in SLM
3+
area: ILM+SLM
4+
type: bug
5+
issues: []

x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -497,7 +497,8 @@ public ClusterState execute(ClusterState currentState) throws Exception {
497497
final SnapshotLifecyclePolicyMetadata.Builder newPolicyMetadata = SnapshotLifecyclePolicyMetadata.builder(policyMetadata);
498498
SnapshotLifecycleStats newStats = snapMeta.getStats();
499499

500-
if (registeredSnapshots.contains(snapshotId) == false) {
500+
final boolean snapshotIsRegistered = registeredSnapshots.contains(snapshotId);
501+
if (snapshotIsRegistered == false) {
501502
logger.warn(
502503
"Snapshot [{}] not found in registered set after snapshot completion. This means snapshot was"
503504
+ " recorded as a failure by another snapshot's cleanup run.",
@@ -564,7 +565,11 @@ public ClusterState execute(ClusterState currentState) throws Exception {
564565
exception.map(SnapshotLifecycleTask::exceptionToString).orElse(null)
565566
)
566567
);
567-
newPolicyMetadata.incrementInvocationsSinceLastSuccess();
568+
// If the snapshot was not registered, it means it was already counted as a failure by another snapshot's cleanup run
569+
// so we should not increment the invocationsSinceLastSuccess again.
570+
if (snapshotIsRegistered) {
571+
newPolicyMetadata.incrementInvocationsSinceLastSuccess();
572+
}
568573
} else {
569574
newStats = newStats.withTakenIncremented(policyName);
570575
newPolicyMetadata.setLastSuccess(

x-pack/plugin/slm/src/test/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTaskTests.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -528,7 +528,8 @@ public void testCleanUpRegisteredInitiatedByFailure() throws Exception {
528528
inferredFailureSnapshot,
529529
snapshotInfoSuccess.snapshotId(),
530530
snapshotInfoFailure1.snapshotId(),
531-
snapshotInfoFailure2.snapshotId()
531+
snapshotInfoFailure2.snapshotId(),
532+
initiatingSnapshot
532533
)
533534
);
534535
var inProgress = Map.of(policyId, List.of(stillRunning));

0 commit comments

Comments
 (0)