Skip to content

Conversation

@arnaldo2792
Copy link
Contributor

Issue number:

Closes #

Description of changes:

The EFA driver in the Amazon Linux kernel can have some delay between when the installer is released and it making it into the kernel tree. This adds a build step to update the driver from the installer to get the latest changes as soon as the installer is available.

As part of this change, the efa.cmake file was updated to do a serial execution of the tests to validate the available configurations in the kernel sources. This is a problem unique to Bottlerocket due to how out of tree kmod builds are treated.

Testing done:

Pending

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Comment on lines 260 to 261
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the other fix where cd %{_builddir} moves out of this arch conditional

@arnaldo2792
Copy link
Contributor Author

Forced push includes:

  • Move the cd %{_builddir} line out of the conditional
  • Rework modules_prepare patch, and use $(PREPARE) $(Q)$(MAKE) to support parallel tests executions in the EFA driver

The EFA driver in the Amazon Linux kernel can have some delay between
when the installer is released and it making it into the kernel tree.
This adds a build step to update the driver from the installer to get
the latest changes as soon as the installer is available.

As part of this change, the `efa.cmake` file was updated to do a serial
execution of the tests to validate the available configurations in the
kernel sources. This is a problem unique to Bottlerocket due to how out
of tree kmod builds are treated.

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@arnaldo2792
Copy link
Contributor Author

(Forgot to remove the unnecessary patch for the EFA sources)

@yeazelm
Copy link
Contributor

yeazelm commented Nov 17, 2025

I was able to run MPI/RDMA/EFA testing to exercise the same test that produced the error for bottlerocket-os/bottlerocket#4681 and with this change the test passes:

--- PASS: TestMPIJobPytorchTraining (300.36s)
    --- SKIP: TestMPIJobPytorchTraining/single-node (0.00s)
    --- PASS: TestMPIJobPytorchTraining/multi-node:all_reduce_perf (215.11s)
        --- PASS: TestMPIJobPytorchTraining/multi-node:all_reduce_perf/MPIJob_succeeds (215.08s)
    --- PASS: TestMPIJobPytorchTraining/multi-node:all_gather_perf (45.11s)
        --- PASS: TestMPIJobPytorchTraining/multi-node:all_gather_perf/MPIJob_succeeds (45.08s)
    --- PASS: TestMPIJobPytorchTraining/multi-node:alltoall_perf (40.14s)
        --- PASS: TestMPIJobPytorchTraining/multi-node:alltoall_perf/MPIJob_succeeds (40.10s)
=== RUN   TestSingleNodeUnitTest
=== RUN   TestSingleNodeUnitTest/unit-test
    unit_test.go:139: Skipping feature "unit-test": name not matched
=== RUN   TestSingleNodeUnitTest/hpc-benckmarks
    unit_test.go:139: Skipping feature "hpc-benckmarks": name not matched
--- PASS: TestSingleNodeUnitTest (0.00s)
    --- SKIP: TestSingleNodeUnitTest/unit-test (0.00s)
    --- SKIP: TestSingleNodeUnitTest/hpc-benckmarks (0.00s)
PASS
ok      github.com/aws/aws-k8s-tester/test/cases/nvidia 317.053s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants