Skip to content

Support b300 Instance#7210

Merged
hgreebe merged 8 commits intoaws:developfrom
hgreebe:develop-test
Jan 29, 2026
Merged

Support b300 Instance#7210
hgreebe merged 8 commits intoaws:developfrom
hgreebe:develop-test

Conversation

@hgreebe
Copy link
Contributor

@hgreebe hgreebe commented Jan 27, 2026

Description of changes

  • Support b300 Instance
  • Have network interface configuration default for b300 where primary interface use an ENA network interface and all others use efa-only (Configuration is use case 1 from this documentation)
  • Add support for running test_efa with b300 instance
  • Upgrade NCCL version to 2.28.9
  • Upgrade OFI NCCL plugin version to 1.18.0
  • Upgrade Fabtest version to 2.4.0

Tests

  • Ran test_efa with ubuntu2404, alinux2023, and rocky9
  • Ran gpu health checks
  • Compared NCCL benchmark results with OFI plugin team

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hgreebe hgreebe marked this pull request as ready for review January 27, 2026 14:15
@hgreebe hgreebe requested review from a team as code owners January 27, 2026 14:15
is_b300 = instance_family == P6_B300
efa_enabled = compute_resource.efa and compute_resource.efa.enabled
interface_type = "efa" if efa_enabled and not is_gb200 else None
interface_type = "efa" if efa_enabled and not is_gb200 and not is_b300 else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Documentation: I suggest to make this logic more talkative by either:

  1. add a comment here specifying that those instance families require the first interface to be ENA even if EFA is enabled.
  2. define a constant INSTANCE_TYPES_WITH_FIRST_INTERFACE_ENA = [P6E_GB200,P6_B300] and reference the constant in this logic

Networking:
PlacementGroup:
Enabled: {% if instance not in ["p4d.24xlarge", "p6-b200.48xlarge"] %}true{% else %}false{% endif %}
Enabled: {% if instance not in ["p4d.24xlarge", "p6-b200.48xlarge", "p6-b300.48xlarge"] %}true{% else %}false{% endif %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] In the test code we have the logic to determine the capacity reservation and placement group. We could improve config readability by putting here the variable placement_group_enabled and set this variable in the test code accordingly.

I know that this approach was here before your PR, so not a blocker for this PR.

PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_BOOTSTRAP_TAG_KEY = "parallelcluster:build-image-cleanup-role-bootstrapped"

P6E_GB200 = "p6e-gb200"
P6_B300 = "p6-b300"
Copy link
Contributor

@gmarciani gmarciani Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we introduced GB200 we did for a limited set of OSs. Are we sure we do not have any OS limitation for B300? For instance, according to documentation for NVIDIA 580, AL2 is not mentioned as a supported OS: https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-126-09/index.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addded a validator to prevent al2 and b300 from being used together.

Networking:
PlacementGroup:
Enabled: {% if instance not in ["p4d.24xlarge", "p6-b200.48xlarge"] %}true{% else %}false{% endif %}
Enabled: {% if instance not in ["p4d.24xlarge", "p6-b200.48xlarge", "p6-b300.48xlarge"] %}true{% else %}false{% endif %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor-NON Blocking] Lets change this to check for p family instance instead of the whole instance

is_b300 = instance_family == P6_B300
efa_enabled = compute_resource.efa and compute_resource.efa.enabled
interface_type = "efa" if efa_enabled and not is_gb200 else None
interface_type = "efa" if efa_enabled and not is_gb200 and not is_b300 else None
Copy link
Contributor

@gmarciani gmarciani Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Test] I suggest to cover the logic that determines the type of the interfaces with unit tests. To make it better I suggest to capture the logic into dedicated helper function and cover such function with unit test rather than the resulting template.

CHANGELOG.md Outdated
- Add validator that warns against the downsides of disabling in-place updates on compute and login nodes through DevSettings.
- Upgrade jmespath to ~=1.0 (from ~=0.10).
- Upgrade tabulate to <=0.9.0 (from <=0.8.10).
- Add support for p6-b300 instances by having a default network configuration for those instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should surface in the changelog that the support is for all OS except for AL2

" Please use one of the following OS: {2}".format(instance_type, os, SUPPORTED_OSES_FOR_P6E_GB200),
FailureLevel.ERROR,
)
if instance_type.startswith("p6-b300") and os in UNSUPPORTED_OSES_FOR_P6_B300:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Test] We should capture this change with unit tests.

" Please use one of the following OS: {2}".format(instance_type, os, SUPPORTED_OSES_FOR_P6E_GB200),
FailureLevel.ERROR,
)
if instance_type.startswith("p6-b300") and os in UNSUPPORTED_OSES_FOR_P6_B300:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] We can refer to the constant P6_B300 rather than redefining the string.

)
if instance_type.startswith("p6-b300") and os in UNSUPPORTED_OSES_FOR_P6_B300:
self._add_failure(
"The instance type {0} is not officially supported with OS {1}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we use the term "officially" elsewhere. However, I think we should avoid it to prevent misunderstanding. Let;'s simply say that it is not supported.

None,
),
(
"p6-b300.48xlarge",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] If you use p6-b300.WHATEVER_SIZE rather than p6-b300.48xlarge it will be clearer that the behavior applies whatever the size is. Otherwiose there could be a doubt that it is something specific to 48xlarge.

You applied this testing best practice in the unit test above. I suggest to apply it here as well.

@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.27%. Comparing base (6bd8e1d) to head (dd86c39).
⚠️ Report is 27 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7210      +/-   ##
===========================================
+ Coverage    90.21%   90.27%   +0.05%     
===========================================
  Files          183      183              
  Lines        16566    16645      +79     
===========================================
+ Hits         14945    15026      +81     
+ Misses        1621     1619       -2     
Flag Coverage Δ
unittests 90.27% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hgreebe hgreebe merged commit 12d911f into aws:develop Jan 29, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants