-
Notifications
You must be signed in to change notification settings - Fork 37
Add dependency on the GRID license for kubelet #294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Add a unit that checks for the license to be valid for GRID. Kubelet requires this unit so if the license is not present, then the node never joins the cluster. This prevents a situation where a node could fail to get a license, join the cluster, and then later have workloads start to fail due to the unlicensed status. Signed-off-by: Matthew Yeazel <[email protected]>
Add a unit that checks for the license to be valid for GRID. Kubelet requires this unit so if the license is not present, then the node never joins the cluster. This prevents a situation where a node could fail to get a license, join the cluster, and then later have workloads start to fail due to the unlicensed status. Signed-off-by: Matthew Yeazel <[email protected]>
Add a unit that checks for the license to be valid for GRID. Kubelet requires this unit so if the license is not present, then the node never joins the cluster. This prevents a situation where a node could fail to get a license, join the cluster, and then later have workloads start to fail due to the unlicensed status. Signed-off-by: Matthew Yeazel <[email protected]>
| [Service] | ||
| Type=oneshot | ||
| ExecCondition=/usr/bin/ghostdog match-nvidia-driver grid | ||
| # Otherwise, attempt to load the module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, will this line actually load the module or are you just generating output that will be greped later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment might make more sense above the ExecCondition
| # Otherwise, attempt to load the module. | |
| # Confirm GRID is required, then attempt to check the license |
| ExecCondition=/usr/bin/ghostdog match-nvidia-driver grid | ||
| # Otherwise, attempt to load the module. | ||
| ExecStart=/usr/bin/nvidia-smi -q | ||
| # Ensure that the stderr file exists. Otherwise, grep fails on an empty file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the STDOUT file what you are creating, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a forgotten update, I moved to STDOUT but forgot to update the comment.
| [Install] | ||
| RequiredBy=kubelet.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below it's required by nvidia-k8s-device-plugin.service which seems more correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this was originally nvidia-k8s-device-plugin.service and I missed the refactor. In my mind we have several options for what experience we want:
- Don't prevent anything but log it. Sort of the behavior we have today just without logs (not ideal)
- Prevent kubelet from starting if the license isn't valid. This means the node never becomes ready and cannot accept any orchestrated work. (This is what I was intending to enforce in the PR)
- Prevent the device plugin from running, the node will become ready but not advertise GPU resources. It could take some work but not GPU work. (This might be confusing to users so I leaned away from this but was my original approach)
- Attempt to block boot entirely so the node doesn't even reach multi-user. (Seems harsh but would be fine too).
| # Ensure that the stderr file exists. Otherwise, grep fails on an empty file. | ||
| ExecStart=-/usr/bin/touch /tmp/.nvidia-gridd-license |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncate may be better than touch to reset the file between iterations
(would have to move this before the nvidia-smi call)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that, I'll give it a shot to see if it still gives me the behavior I want with things rearranged to use truncate
| [Install] | ||
| RequiredBy=nvidia-k8s-device-plugin.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this requirement cause the k8s device plugin to fail if the license check fails? I'm kind of worried about cluttering up the logs with a lot of failures.
This could possibly be modeled as:
- a timer unit that runs and creates a marker file when the license check passes
- a path unit that activates
nvidia-k8s-device-plugin.service - a fallback unit that runs when we don't match the grid driver that also creates the marker
- a condition in the k8s device plugin that requires the marker to exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be much cleaner in the logs, otherwise the unit is very angry and noisy in the journal when its failing. I'll play with that as a potential alternative to this. FWIW though I haven't seen this fail yet before the next unit runs when we are going to get a license, so it might be a situation where the only time its noisy, is when the node is already in a bad state. Nonetheless, I think making it cleaner is worth it.
Description of changes:
The GRID driver requires a license to fully function. This currently is fetched as a best effort but a node can come up without a "Licensed" driver and fail later after some amount of time. This change adds a
RequiredByto kubelet for this license so the node never becomes ready. This prevents a node from accepting work before the driver is fully configured.Testing done:
Built this change and confirmed that it does not run on a G6 (which uses open-gpu, not GRID so this isn't required), fails to have the node become ready when
nvidia-griddis misconfigured to not start. And the nodes become ready as expected whennvidia-griddstarts running normally.Normal g6f.xlarge
Broken nvidia-gridd results in the node not becoming ready:
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.