Skip to content

Conversation

@nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Aug 13, 2025

Attempt at fixing race condition(s) in the SnapshotMetricsIT

Thanks @JeremyDahlgren for troubleshooting this

Fixes #132731

@nicktindall nicktindall added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Aug 13, 2025
@elasticsearchmachine elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0 labels Aug 13, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)


// wait for snapshot to finish to test the other metrics
awaitNumberOfSnapshotsInProgress(0);
safeGet(snapshotFuture);
Copy link
Contributor Author

@nicktindall nicktindall Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we wait for the response to be received that should mean the metrics should be written? (note waitForCompletion=true above now)

I'm running in a loop to try and confirm that now

Copy link
Contributor Author

@nicktindall nicktindall Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduced a source of flakiness in the tension between the throttling and safeGet. If the setting was too high, we exceeded the safeGet timeout, if the setting was too low, we didn't see any throttling.

For this reason I've moved out the test for the throttling metrics to another test. It has more complexity to ensure we see throttling and we don't run too long.

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicktindall
Copy link
Contributor Author

nicktindall commented Aug 14, 2025

@JeremyDahlgren I think the additional ceremony is necessary to make a timely and reliable throttling metrics test. I broke it out into its own test because I didn't want to muddy the waters in the existing one with all the extra steps. I'm running SnapshotMetricsIT on repeat on a GCP machine to see if it holds.

See now we set the max_bytes_per_sec much lower to ensure throttling, but then once we've seen throttling we remove the limits to allow the snapshot/restore to complete in a reasonable amount of time.

Copy link
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I had this in a loop also and couldn't get a failure. @mhl-b had also reviewed earlier, he might want to double check since the updates.

Comment on lines 273 to 277
createRepository(
repositoryName,
"mock",
repositorySettings.put(BlobStoreRepository.MAX_RESTORE_BYTES_PER_SEC.getKey(), ByteSizeValue.ZERO)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it significant that verify is omitted here, which will default to true, while the previous createRepository() calls explicitly set it to false? Is this to save time and just verify at the end?

I see this is a PUT when I drill down, but createOrUpdateRepository() or putRepository() might help when reading. Just an observation though, this method and overrides have been around a long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I set it to false because verify makes it take longer, when all we're trying to do is turn off the throttling. Good pick up, I've set it to false in 62f1935

Sadly overnight the test failed on line 105, another race condition where the snapshot thread blocks in the repo. before the SNAPSHOTS_STARTED counter is incremented. I'll fix that and run for another day again 🙄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is a PUT when I drill down, but createOrUpdateRepository() or putRepository() might help when reading. Just an observation though, this method and overrides have been around a long time.

Agree on the naming, I won't do that as part of this PR but I might follow up with a second one (it will probably add lots of noise)

@nicktindall nicktindall merged commit e469514 into elastic:main Aug 20, 2025
34 checks passed
@nicktindall nicktindall deleted the alt_fix_snapshot_metrics_it branch August 21, 2025 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team >test Issues or PRs that are addressing/adding tests v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] SnapshotMetricsIT testSnapshotAPMMetrics failing

4 participants