Improve OSU benchmark to better detect system jitters #7190
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes
This PR includes improvements and fixes to the OSU benchmark integration tests:
Remove devsettings to disable in_place_update_on_fleet in OSU test
Retry OSU benchmark once in case of failure
Run OSU alltoall only on small clusters
Improve performance test README
Skip OSU pt2pt benchmarks on large clusters
Add barrier benchmark to OSU test
Run OSU on large clusters with 500 compute nodes
Only compare with baseline if baseline file exists
Remove OSU result upload to CloudWatch metrics
See commits descriptions for more details.
Tests
The following test has passed:
References
This change is able to reproduce these issues:
#7095
#6449
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.