fix(neuron): install aws-neuronx-dkms-2.21 at boot on inf1#2579
Merged
mselim00 merged 1 commit intoawslabs:mainfrom Jan 7, 2026
Merged
fix(neuron): install aws-neuronx-dkms-2.21 at boot on inf1#2579mselim00 merged 1 commit intoawslabs:mainfrom
mselim00 merged 1 commit intoawslabs:mainfrom
Conversation
6fa7f3b to
adac3db
Compare
mselim00
commented
Jan 6, 2026
Comment on lines
+3
to
+5
| # Run before cloud-init so packages are installed | ||
| # before user data that may query the installed information | ||
| Before=cloud-init.service |
Contributor
Author
There was a problem hiding this comment.
we need some ordering relative to user data script execution to ensure deterministic behavior, each has it's trade-offs.
- before user data: node joining and SSM agent registering are delayed by the time it takes to execute this service, which is ~45s from my testing. advantage is that customers can query the installed package version and act on it
- after user data: node joining is not delayed, but the neuron device plugin may schedule before the driver is loading, leading it into a crashloop. there's no clean way to establish an ordering there, and because of the exponential backoff that time could add up to quite a bit. customer user data queries of package version would also always return latest, regardless of what this service would load
adac3db to
9fe061a
Compare
9fe061a to
2b9e76c
Compare
fletcherw
approved these changes
Jan 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
N/A but related PR: #2486
Description of changes:
This adds a new service to run on boot to forcefully downgrade
aws-neuronx-dkmsto a cached2.21.xversion if it's detected to be running on aninf1instance type with a different version installed.This should be functionally equivalent to the current AMIs with two caveats:
inf1instance using this AMI will also use the downgraded driver version. This can be made equivalent by also caching the latest driver version and and always ensuring that is the one installed on those instances, but this introduces potentially unnecessary complexity as the need for this use case is unclearinf1instances will now have thev2.21.xpackage installed, which is a requirement detailed in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/announcements/neuron2.x/announce-eos-neuron-driver-support-inf1.html. Startup will take a bit longer on these instance types because of the additional time required to remove the newer module loaded and then building and loading thev2.21.xversion.For the hypothetical users relying on a snapshot and impacted by 1), they can restore parity to before by updating their snapshotting process to include
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Testing Done
Launched
inf1.xlargeandinf2.xlargeto confirm thatdkms statusandmodinfo neuronshow thev2.21.xand latest driver versions respectively. Also checked the time it takes forneuron-package-installto complete, and it was approximately 45s across four launches oninf1, and less than 1 second ininf2.Additionally launched an
inf1.xlargein a private subnet with no egress route to the public internet as well as a node in a public subnet with security groups allowing no egress to confirm that the script completes within a similar timeframe. This is to ensure that all network calls are bypassed, which is just theupdate-pciidsoverride in this case.See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.