-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improve SLM Health Indicator to cover missing snapshot #121370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
| if (policy.getLastSuccess() != null) { | ||
| // prefer snapshotStartTimestamp over snapshotFinishTimestamp in case of a very long-running snapshot | ||
| // that started a long time ago | ||
| SnapshotInvocationRecord lastSuccess = policy.getLastSuccess(); | ||
| return lastSuccess.getSnapshotStartTimestamp() != null | ||
| ? lastSuccess.getSnapshotStartTimestamp() | ||
| : lastSuccess.getSnapshotFinishTimestamp(); | ||
| } | ||
| // SX TODO: handle first snapshot (i.e. no prior success of failure), maybe record the policy first trigger timestamp | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't yet handle the case for the first snapshot (i.e. no prior successful snapshot). To handle that, I am thinking to record the first SLM trigger time in SLM metadata SnapshotLifecyclePolicyMetadata so here we can check against that. I can do that in a follow up PR if the idea makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to handle that particular case, so +1 to investigate it after this PR
|
Pinging @elastic/es-data-management (Team:Data Management) |
|
Hi @samxbr, I've created a changelog YAML for you. |
|
Due to current doc freeze, the doc change for this PR is not added yet. It will likely be in a separate PR after the doc freeze. |
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, but this generally looks good! What do you think of the naming and validation questions?
test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicy.java
Outdated
Show resolved
Hide resolved
| "invalid missingSnapshotUnhealthyThreshold [" | ||
| + missingSnapshotUnhealthyThreshold.getStringRep() | ||
| + "]: " | ||
| + "time is too short, expecting at least more than the interval between snapshots [" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence about this validation, on one hand, should we actually prevent someone from making the threshold too small, given that they can always manually execute the SLM policy to take a snapshot? I'm not sure it's worth it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this validation is not strictly necessary, the worse case without it is that the indicator will always be YELLOW if the threshold is set too small.
I am slightly leaning towards keeping it, my philosophy here is to providing quick feedback for user over letting them find out through health indicator. I can't think of a use case where user would set this threshold smaller than the snapshot interval. I also can't think of a drawback for keeping this validation, it's ok if user occasionally execute SLM manually, it just resets the threshold time, and doesn't defeat the purpose of this setting. We can definitely remove this validation if it can have potential negative impact I didn't think of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards keeping the validation as well. I also don't really see a reason why we would allow users to do this, but I do see value in avoiding support tickets where users to do this. Plus, we get unhealthyIfNoSnapshotWithin > 0 validation for free.
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
| if (policy.getLastSuccess() != null) { | ||
| // prefer snapshotStartTimestamp over snapshotFinishTimestamp in case of a very long-running snapshot | ||
| // that started a long time ago | ||
| SnapshotInvocationRecord lastSuccess = policy.getLastSuccess(); | ||
| return lastSuccess.getSnapshotStartTimestamp() != null | ||
| ? lastSuccess.getSnapshotStartTimestamp() | ||
| : lastSuccess.getSnapshotFinishTimestamp(); | ||
| } | ||
| // SX TODO: handle first snapshot (i.e. no prior success of failure), maybe record the policy first trigger timestamp | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to handle that particular case, so +1 to investigate it after this PR
| ? lastSuccess.getSnapshotStartTimestamp() | ||
| : lastSuccess.getSnapshotFinishTimestamp(); | ||
| } | ||
| // SX TODO: handle first snapshot (i.e. no prior success of failure), maybe record the policy first trigger timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can change this to just a regular TODO instead of one specific to you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget about this one :)
….java Co-authored-by: Lee Hinman <[email protected]>
|
Hi @nielsbauman, I added you as a backup reviewer in case @leehinman may not have time to review this week 😄 |
nielsbauman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a look at most code, I only have the two large test files left to check, which I'll do later today. Looking great so far!
| * Runs the code block for the provided interval, waiting for no assertions to trip. | ||
| * Runs the code block for the provided interval, waiting for no assertions to trip. Retries on AssertionError | ||
| * with exponential backoff until provided time runs out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change really necessary? I don't really feel like this change is super valuable (it's not wrong, just not absolutely necessary) and since it's the only change in this file, I'd prefer to revert it to reduce the scope of the PR a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding the comment makes it clearer that it will retry until timeout, I wasn't sure about whether it will retry or not from the original comment, and had to read the code. Maybe it's just me though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah that's why I said you're not wrong. I'm just not a huge fan of PRs changing unrelated files. I prefer PRs to have a more defined scope to make them easier to read and for commit history to be a bit more understandable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you mean, and I would usually agree with you for changes that are less trivial than this. It's pretty common for people to find opportunities for tiny improvements while working on unrelated code. I think it's adding unnecessary friction if we were to split every tiny improvements to separate PRs (like a PR just to change this line of comment). To me it's more important to continuously make tiny improvements to the code base than trying to keep every PR within its scope.
That being said, I agree with you 100% on keeping PR in scope. It's just this one feels too trivial to be "out-of-scope".
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicy.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicy.java
Outdated
Show resolved
Hide resolved
| "invalid missingSnapshotUnhealthyThreshold [" | ||
| + missingSnapshotUnhealthyThreshold.getStringRep() | ||
| + "]: " | ||
| + "time is too short, expecting at least more than the interval between snapshots [" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards keeping the validation as well. I also don't really see a reason why we would allow users to do this, but I do see value in avoiding support tickets where users to do this. Plus, we get unhealthyIfNoSnapshotWithin > 0 validation for free.
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/slm/src/main/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorService.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/slm/src/test/java/org/elasticsearch/xpack/slm/SnapshotLifecyclePolicyTests.java
Show resolved
Hide resolved
...ster-restart/src/javaRestTest/java/org/elasticsearch/xpack/restart/FullClusterRestartIT.java
Show resolved
Hide resolved
nielsbauman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I hadn't come around to checking the test classes yet. I've reviewed everything now, so we're almost good to go from my end :)
...slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMHealthBlockedSnapshotIT.java
Outdated
Show resolved
Hide resolved
...slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMHealthBlockedSnapshotIT.java
Show resolved
Hide resolved
...slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMHealthBlockedSnapshotIT.java
Outdated
Show resolved
Hide resolved
...slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMHealthBlockedSnapshotIT.java
Outdated
Show resolved
Hide resolved
...slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMHealthBlockedSnapshotIT.java
Outdated
Show resolved
Hide resolved
...slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMHealthBlockedSnapshotIT.java
Show resolved
Hide resolved
...slm/src/internalClusterTest/java/org/elasticsearch/xpack/slm/SLMHealthBlockedSnapshotIT.java
Outdated
Show resolved
Hide resolved
| * Runs the code block for the provided interval, waiting for no assertions to trip. | ||
| * Runs the code block for the provided interval, waiting for no assertions to trip. Retries on AssertionError | ||
| * with exponential backoff until provided time runs out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah that's why I said you're not wrong. I'm just not a huge fan of PRs changing unrelated files. I prefer PRs to have a more defined scope to make them easier to read and for commit history to be a bit more understandable.
x-pack/plugin/slm/src/test/java/org/elasticsearch/xpack/slm/SlmHealthIndicatorServiceTests.java
Outdated
Show resolved
Hide resolved
nielsbauman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for iterating on this, @samxbr!
Currently the SLM health indicator in health report turns YELLOW when snapshots fail for a number of times. However, the SLM health indicator stays GREEN if snapshot is not completed (no success or failure) for a long time. This change adds a new optional setting
unhealthy_if_no_snapshot_withinto SLM policy, that sets a time threshold. If the SLM policy has not had a successful snapshot for longer than the threshold, the SLM health indicator will turn YELLOW.