-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Fix SnapshotMetricsIT race conditions #132780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SnapshotMetricsIT race conditions #132780
Conversation
Co-authored-by: Jeremy Dahlgren <[email protected]>
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
|
||
| // wait for snapshot to finish to test the other metrics | ||
| awaitNumberOfSnapshotsInProgress(0); | ||
| safeGet(snapshotFuture); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we wait for the response to be received that should mean the metrics should be written? (note waitForCompletion=true above now)
I'm running in a loop to try and confirm that now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduced a source of flakiness in the tension between the throttling and safeGet. If the setting was too high, we exceeded the safeGet timeout, if the setting was too low, we didn't see any throttling.
For this reason I've moved out the test for the throttling metrics to another test. It has more complexity to ensure we see throttling and we don't run too long.
mhl-b
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@JeremyDahlgren I think the additional ceremony is necessary to make a timely and reliable throttling metrics test. I broke it out into its own test because I didn't want to muddy the waters in the existing one with all the extra steps. I'm running See now we set the |
…ics_it # Conflicts: # muted-tests.yml
JeremyDahlgren
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I had this in a loop also and couldn't get a failure. @mhl-b had also reviewed earlier, he might want to double check since the updates.
| createRepository( | ||
| repositoryName, | ||
| "mock", | ||
| repositorySettings.put(BlobStoreRepository.MAX_RESTORE_BYTES_PER_SEC.getKey(), ByteSizeValue.ZERO) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it significant that verify is omitted here, which will default to true, while the previous createRepository() calls explicitly set it to false? Is this to save time and just verify at the end?
I see this is a PUT when I drill down, but createOrUpdateRepository() or putRepository() might help when reading. Just an observation though, this method and overrides have been around a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I set it to false because verify makes it take longer, when all we're trying to do is turn off the throttling. Good pick up, I've set it to false in 62f1935
Sadly overnight the test failed on line 105, another race condition where the snapshot thread blocks in the repo. before the SNAPSHOTS_STARTED counter is incremented. I'll fix that and run for another day again 🙄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is a PUT when I drill down, but createOrUpdateRepository() or putRepository() might help when reading. Just an observation though, this method and overrides have been around a long time.
Agree on the naming, I won't do that as part of this PR but I might follow up with a second one (it will probably add lots of noise)
Attempt at fixing race condition(s) in the
SnapshotMetricsITThanks @JeremyDahlgren for troubleshooting this
Fixes #132731