Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions packages/kmod-6.1-nvidia-r570/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
NVidiaEULAforAWS.pdf
COPYING
*.rpm
NvidiaGridAWSUserLicenseAgreement.DOCX
26 changes: 26 additions & 0 deletions packages/kmod-6.1-nvidia-r570/grid-license-check.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[Unit]
Description=GRID License Check
RefuseManualStart=true
RefuseManualStop=true
DefaultDependencies=no
Before=kubelet.service
After=nvidia-gridd.service
Requires=nvidia-gridd.service

[Service]
Type=oneshot
ExecCondition=/usr/bin/ghostdog match-nvidia-driver grid
# Otherwise, attempt to load the module.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, will this line actually load the module or are you just generating output that will be greped later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment might make more sense above the ExecCondition

Suggested change
# Otherwise, attempt to load the module.
# Confirm GRID is required, then attempt to check the license

ExecStart=/usr/bin/nvidia-smi -q
# Ensure that the stderr file exists. Otherwise, grep fails on an empty file.
ExecStart=-/usr/bin/touch /tmp/.nvidia-gridd-license
Comment on lines +15 to +16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncate may be better than touch to reset the file between iterations

(would have to move this before the nvidia-smi call)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that, I'll give it a shot to see if it still gives me the behavior I want with things rearranged to use truncate

# Succeed unless there was a fatal error.
ExecStart=/usr/bin/grep -Fqvzw Unlicensed /tmp/.nvidia-gridd-license
RemainAfterExit=true
StandardOutput=append:/tmp/.nvidia-gridd-license
Restart=on-failure
RestartSec=1
StartLimitBurst=120

[Install]
RequiredBy=kubelet.service
Comment on lines +25 to +26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below it's required by nvidia-k8s-device-plugin.service which seems more correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this was originally nvidia-k8s-device-plugin.service and I missed the refactor. In my mind we have several options for what experience we want:

  • Don't prevent anything but log it. Sort of the behavior we have today just without logs (not ideal)
  • Prevent kubelet from starting if the license isn't valid. This means the node never becomes ready and cannot accept any orchestrated work. (This is what I was intending to enforce in the PR)
  • Prevent the device plugin from running, the node will become ready but not advertise GPU resources. It could take some work but not GPU work. (This might be confusing to users so I leaned away from this but was my original approach)
  • Attempt to block boot entirely so the node doesn't even reach multi-user. (Seems harsh but would be fine too).

4 changes: 3 additions & 1 deletion packages/kmod-6.1-nvidia-r570/kmod-6.1-nvidia-r570.spec
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ Source206: nvidia-persistenced.service
Source207: fabricmanager.env
Source208: gridd.conf
Source209: nvidia-gridd.service
Source210: grid-license-check.service

# NVIDIA tesla conf files from 300 to 399
Source300: nvidia-tesla-tmpfiles.conf
Expand Down Expand Up @@ -394,7 +395,7 @@ install kernel-open/nvidia-drm.ko %{buildroot}%{_cross_datadir}/nvidia/grid/driv
# Install nvidia-gridd and related files
install -m 755 nvidia-gridd %{buildroot}%{_cross_bindir}/nvidia-gridd
install -m 644 %{S:208} %{buildroot}%{_cross_factorydir}%{_cross_sysconfdir}/nvidia/gridd.conf
install -p -m 0644 %{S:209} %{buildroot}%{_cross_unitdir}
install -p -m 0644 %{S:209} %{S:210} %{buildroot}%{_cross_unitdir}
popd
# End GRID driver
%endif
Expand Down Expand Up @@ -722,6 +723,7 @@ popd
%{_cross_bindir}/nvidia-gridd
%{_cross_factorydir}%{_cross_sysconfdir}/nvidia/gridd.conf
%{_cross_unitdir}/nvidia-gridd.service
%{_cross_unitdir}/grid-license-check.service

%{_cross_datadir}/nvidia/grid/drivers/nvidia.ko
%{_cross_datadir}/nvidia/grid/drivers/nvidia-uvm.ko
Expand Down
1 change: 1 addition & 0 deletions packages/kmod-6.12-nvidia-r570/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
NVidiaEULAforAWS.pdf
COPYING
*.rpm
NvidiaGridAWSUserLicenseAgreement.DOCX
26 changes: 26 additions & 0 deletions packages/kmod-6.12-nvidia-r570/grid-license-check.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[Unit]
Description=GRID License Check
RefuseManualStart=true
RefuseManualStop=true
DefaultDependencies=no
Before=kubelet.service
After=nvidia-gridd.service
Requires=nvidia-gridd.service

[Service]
Type=oneshot
ExecCondition=/usr/bin/ghostdog match-nvidia-driver grid
# Otherwise, attempt to load the module.
ExecStart=/usr/bin/nvidia-smi -q
# Ensure that the stderr file exists. Otherwise, grep fails on an empty file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the STDOUT file what you are creating, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a forgotten update, I moved to STDOUT but forgot to update the comment.

ExecStart=-/usr/bin/touch /tmp/.nvidia-gridd-license
# Succeed unless there was a fatal error.
ExecStart=/usr/bin/grep -Fqvzw Unlicensed /tmp/.nvidia-gridd-license
RemainAfterExit=true
StandardOutput=append:/tmp/.nvidia-gridd-license
Restart=on-failure
RestartSec=1
StartLimitBurst=120

[Install]
RequiredBy=nvidia-k8s-device-plugin.service
Comment on lines +25 to +26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this requirement cause the k8s device plugin to fail if the license check fails? I'm kind of worried about cluttering up the logs with a lot of failures.

This could possibly be modeled as:

  1. a timer unit that runs and creates a marker file when the license check passes
  2. a path unit that activates nvidia-k8s-device-plugin.service
  3. a fallback unit that runs when we don't match the grid driver that also creates the marker
  4. a condition in the k8s device plugin that requires the marker to exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be much cleaner in the logs, otherwise the unit is very angry and noisy in the journal when its failing. I'll play with that as a potential alternative to this. FWIW though I haven't seen this fail yet before the next unit runs when we are going to get a license, so it might be a situation where the only time its noisy, is when the node is already in a bad state. Nonetheless, I think making it cleaner is worth it.

4 changes: 3 additions & 1 deletion packages/kmod-6.12-nvidia-r570/kmod-6.12-nvidia-r570.spec
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Source206: nvidia-persistenced.service
Source207: fabricmanager.env
Source208: gridd.conf
Source209: nvidia-gridd.service
Source210: grid-license-check.service

# NVIDIA tesla conf files from 300 to 399
Source300: nvidia-tesla-tmpfiles.conf
Expand Down Expand Up @@ -410,7 +411,7 @@ install kernel-open/nvidia-drm.ko %{buildroot}%{_cross_datadir}/nvidia/grid/driv
# Install nvidia-gridd and related files
install -m 755 nvidia-gridd %{buildroot}%{_cross_bindir}/nvidia-gridd
install -m 644 %{S:208} %{buildroot}%{_cross_factorydir}%{_cross_sysconfdir}/nvidia/gridd.conf
install -p -m 0644 %{S:209} %{buildroot}%{_cross_unitdir}
install -p -m 0644 %{S:209} %{S:210} %{buildroot}%{_cross_unitdir}
popd
# End GRID driver
%endif
Expand Down Expand Up @@ -748,6 +749,7 @@ popd
%{_cross_bindir}/nvidia-gridd
%{_cross_factorydir}%{_cross_sysconfdir}/nvidia/gridd.conf
%{_cross_unitdir}/nvidia-gridd.service
%{_cross_unitdir}/grid-license-check.service

%{_cross_datadir}/nvidia/grid/drivers/nvidia.ko
%{_cross_datadir}/nvidia/grid/drivers/nvidia-uvm.ko
Expand Down
1 change: 1 addition & 0 deletions packages/kmod-6.12-nvidia-r580/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
NVidiaEULAforAWS.pdf
COPYING
*.rpm
NvidiaGridAWSUserLicenseAgreement.DOCX
26 changes: 26 additions & 0 deletions packages/kmod-6.12-nvidia-r580/grid-license-check.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[Unit]
Description=GRID License Check
RefuseManualStart=true
RefuseManualStop=true
DefaultDependencies=no
Before=kubelet.service
After=nvidia-gridd.service
Requires=nvidia-gridd.service

[Service]
Type=oneshot
ExecCondition=/usr/bin/ghostdog match-nvidia-driver grid
# Otherwise, attempt to load the module.
ExecStart=/usr/bin/nvidia-smi -q
# Ensure that the stderr file exists. Otherwise, grep fails on an empty file.
ExecStart=-/usr/bin/touch /tmp/.nvidia-gridd-license
# Succeed unless there was a fatal error.
ExecStart=/usr/bin/grep -Fqvzw Unlicensed /tmp/.nvidia-gridd-license
RemainAfterExit=true
StandardOutput=append:/tmp/.nvidia-gridd-license
Restart=on-failure
RestartSec=1
StartLimitBurst=120

[Install]
RequiredBy=kubelet.service
4 changes: 3 additions & 1 deletion packages/kmod-6.12-nvidia-r580/kmod-6.12-nvidia-r580.spec
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Source206: nvidia-persistenced.service
Source207: fabricmanager.env
Source208: gridd.conf
Source209: nvidia-gridd.service
Source210: grid-license-check.service

# NVIDIA tesla conf files from 300 to 399
Source300: nvidia-tesla-tmpfiles.conf
Expand Down Expand Up @@ -410,7 +411,7 @@ install kernel-open/nvidia-drm.ko %{buildroot}%{_cross_datadir}/nvidia/grid/driv
# Install nvidia-gridd and related files
install -m 755 nvidia-gridd %{buildroot}%{_cross_bindir}/nvidia-gridd
install -m 644 %{S:208} %{buildroot}%{_cross_factorydir}%{_cross_sysconfdir}/nvidia/gridd.conf
install -p -m 0644 %{S:209} %{buildroot}%{_cross_unitdir}
install -p -m 0644 %{S:209} %{S:210} %{buildroot}%{_cross_unitdir}
popd
# End GRID driver
%endif
Expand Down Expand Up @@ -754,6 +755,7 @@ popd
%{_cross_bindir}/nvidia-gridd
%{_cross_factorydir}%{_cross_sysconfdir}/nvidia/gridd.conf
%{_cross_unitdir}/nvidia-gridd.service
%{_cross_unitdir}/grid-license-check.service

%{_cross_datadir}/nvidia/grid/drivers/nvidia.ko
%{_cross_datadir}/nvidia/grid/drivers/nvidia-uvm.ko
Expand Down