Skip to content

fix(neuron): install aws-neuronx-dkms-2.21 at boot on inf1#2579

Merged
mselim00 merged 1 commit intoawslabs:mainfrom
mselim00:neuron-install
Jan 7, 2026
Merged

fix(neuron): install aws-neuronx-dkms-2.21 at boot on inf1#2579
mselim00 merged 1 commit intoawslabs:mainfrom
mselim00:neuron-install

Conversation

@mselim00
Copy link
Copy Markdown
Contributor

@mselim00 mselim00 commented Jan 6, 2026

Issue #, if available:

N/A but related PR: #2486

Description of changes:

This adds a new service to run on boot to forcefully downgrade aws-neuronx-dkms to a cached 2.21.x version if it's detected to be running on an inf1 instance type with a different version installed.

This should be functionally equivalent to the current AMIs with two caveats:

  1. non-inf1 instances running on a snapshot of an inf1 instance using this AMI will also use the downgraded driver version. This can be made equivalent by also caching the latest driver version and and always ensuring that is the one installed on those instances, but this introduces potentially unnecessary complexity as the need for this use case is unclear
  2. inf1 instances will now have the v2.21.x package installed, which is a requirement detailed in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-neuron-driver-support-inf1.html. Startup will take a bit longer on these instance types because of the additional time required to remove the newer module loaded and then building and loading the v2.21.x version.

For the hypothetical users relying on a snapshot and impacted by 1), they can restore parity to before by updating their snapshotting process to include

sudo dnf upgrade -y aws-neuronx-dkms
sudo systemctl disable neuron-package-install

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

Launched inf1.xlarge and inf2.xlarge to confirm that dkms status and modinfo neuron show the v2.21.x and latest driver versions respectively. Also checked the time it takes for neuron-package-install to complete, and it was approximately 45s across four launches on inf1, and less than 1 second in inf2.

Additionally launched an inf1.xlarge in a private subnet with no egress route to the public internet as well as a node in a public subnet with security groups allowing no egress to confirm that the script completes within a similar timeframe. This is to ensure that all network calls are bypassed, which is just the update-pciids override in this case.

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

@mselim00 mselim00 force-pushed the neuron-install branch 3 times, most recently from 6fa7f3b to adac3db Compare January 6, 2026 20:57
Comment on lines +3 to +5
# Run before cloud-init so packages are installed
# before user data that may query the installed information
Before=cloud-init.service
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need some ordering relative to user data script execution to ensure deterministic behavior, each has it's trade-offs.

  • before user data: node joining and SSM agent registering are delayed by the time it takes to execute this service, which is ~45s from my testing. advantage is that customers can query the installed package version and act on it
  • after user data: node joining is not delayed, but the neuron device plugin may schedule before the driver is loading, leading it into a crashloop. there's no clean way to establish an ordering there, and because of the exponential backoff that time could add up to quite a bit. customer user data queries of package version would also always return latest, regardless of what this service would load

@mselim00 mselim00 marked this pull request as ready for review January 6, 2026 21:11
@mselim00 mselim00 merged commit 9663c4f into awslabs:main Jan 7, 2026
12 checks passed
@mselim00 mselim00 deleted the neuron-install branch January 7, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants