Skip to content

Conversation

@hanwen-cluster
Copy link
Contributor

@hanwen-cluster hanwen-cluster commented Jan 9, 2026

Description of changes

This PR includes improvements and fixes to the OSU benchmark integration tests:

  • Remove devsettings to disable in_place_update_on_fleet in OSU test

  • Retry OSU benchmark once in case of failure

  • Run OSU alltoall only on small clusters

  • Improve performance test README

  • Skip OSU pt2pt benchmarks on large clusters

  • Add barrier benchmark to OSU test

  • Run OSU on large clusters with 500 compute nodes

  • Only compare with baseline if baseline file exists

  • Remove OSU result upload to CloudWatch metrics

See commits descriptions for more details.

Tests

The following test has passed:

test-suites:
  performance_tests:
    test_osu.py::test_osu:
      dimensions:
        - regions: [ {{ c5_xlarge_CAPACITY_RESERVATION_510_INSTANCES_2_HOURS_YESPG_alinux2023 }} ]
          instances: [ "c5.xlarge" ]
          oss: ["alinux2023"] # ParallelCluster does not release official Rocky images. Skip the test.
          schedulers: [ "slurm" ]

References

This change is able to reproduce these issues:
#7095
#6449

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hanwen-cluster hanwen-cluster requested review from a team as code owners January 9, 2026 16:06
@hanwen-cluster hanwen-cluster added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Jan 9, 2026
The result is uploaded to dynamodb. We don't need the result to be uploaded to CloudWatch metrics anymore
This commit makes the test more flexible to test different instance types. Even without comparing with baseline, the test is meaningful because it outputs the result to console and dynamodb
This helps us better detecting system jitter.

This commit uses smaller instance type for large clusters, because system jitter is independent of instance types.
Barrier benchmark is quick and significant in detecting system jitters

This commit also changes the regex pattern to correctly recognize output from barrier benchmark

Example output for OSU barrier:
```
# OSU MPI Barrier Latency Test v5.7.1
# Avg Latency(us)
          5397.91
```

Example output for other benchmarks:
```
# OSU MPI Allgather Latency Test v5.7.1
# Size       Avg Latency(us)
1                     973.61
2                    1045.82
4                     942.89
8                    1334.42
16                   1298.60
32                   1517.23
64                   1658.20
128                  5892.50
256                 10734.44
512                  2216.81
1024                 2453.63
2048                 2635.08
4096                 4636.03
8192                 9921.76
16384               22506.78
32768               46332.33
65536               98798.96
131072             136354.77
262144             237340.72
524288             465504.41
```
pt2pt benchmarks are meaningful only on small clusters
…fleet in OSU test

This commit prepares OSU test for validating the long term fix
@hanwen-cluster hanwen-cluster changed the title Improve OSU benchmark to better detect system jitter Improve OSU benchmark to better detect system jitters Jan 9, 2026
@hanwen-cluster hanwen-cluster merged commit 356daf7 into aws:develop Jan 9, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants