Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions packages/kmod-6.12-nvidia-r570/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
NVidiaEULAforAWS.pdf
COPYING
*.rpm
NvidiaGridAWSUserLicenseAgreement.DOCX
26 changes: 26 additions & 0 deletions packages/kmod-6.12-nvidia-r570/grid-license-check.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[Unit]
Description=GRID License Check
RefuseManualStart=true
RefuseManualStop=true
DefaultDependencies=no
Before=kubelet.service
After=nvidia-gridd.service
Requires=nvidia-gridd.service

[Service]
Type=oneshot
ExecCondition=/usr/bin/ghostdog match-nvidia-driver grid
# Otherwise, attempt to load the module.
ExecStart=/usr/bin/nvidia-smi -q
# Ensure that the stderr file exists. Otherwise, grep fails on an empty file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the STDOUT file what you are creating, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a forgotten update, I moved to STDOUT but forgot to update the comment.

ExecStart=-/usr/bin/touch /tmp/.nvidia-gridd-license
# Succeed unless there was a fatal error.
ExecStart=/usr/bin/grep -Fqvzw Unlicensed /tmp/.nvidia-gridd-license
RemainAfterExit=true
StandardOutput=append:/tmp/.nvidia-gridd-license
Restart=on-failure
RestartSec=1
StartLimitBurst=120

[Install]
RequiredBy=nvidia-k8s-device-plugin.service
Comment on lines +25 to +26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this requirement cause the k8s device plugin to fail if the license check fails? I'm kind of worried about cluttering up the logs with a lot of failures.

This could possibly be modeled as:

  1. a timer unit that runs and creates a marker file when the license check passes
  2. a path unit that activates nvidia-k8s-device-plugin.service
  3. a fallback unit that runs when we don't match the grid driver that also creates the marker
  4. a condition in the k8s device plugin that requires the marker to exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be much cleaner in the logs, otherwise the unit is very angry and noisy in the journal when its failing. I'll play with that as a potential alternative to this. FWIW though I haven't seen this fail yet before the next unit runs when we are going to get a license, so it might be a situation where the only time its noisy, is when the node is already in a bad state. Nonetheless, I think making it cleaner is worth it.

4 changes: 3 additions & 1 deletion packages/kmod-6.12-nvidia-r570/kmod-6.12-nvidia-r570.spec
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ Source206: nvidia-persistenced.service
Source207: fabricmanager.env
Source208: gridd.conf
Source209: nvidia-gridd.service
Source210: grid-license-check.service

# NVIDIA tesla conf files from 300 to 399
Source300: nvidia-tesla-tmpfiles.conf
Expand Down Expand Up @@ -410,7 +411,7 @@ install kernel-open/nvidia-drm.ko %{buildroot}%{_cross_datadir}/nvidia/grid/driv
# Install nvidia-gridd and related files
install -m 755 nvidia-gridd %{buildroot}%{_cross_bindir}/nvidia-gridd
install -m 644 %{S:208} %{buildroot}%{_cross_factorydir}%{_cross_sysconfdir}/nvidia/gridd.conf
install -p -m 0644 %{S:209} %{buildroot}%{_cross_unitdir}
install -p -m 0644 %{S:209} %{S:210} %{buildroot}%{_cross_unitdir}
popd
# End GRID driver
%endif
Expand Down Expand Up @@ -748,6 +749,7 @@ popd
%{_cross_bindir}/nvidia-gridd
%{_cross_factorydir}%{_cross_sysconfdir}/nvidia/gridd.conf
%{_cross_unitdir}/nvidia-gridd.service
%{_cross_unitdir}/grid-license-check.service

%{_cross_datadir}/nvidia/grid/drivers/nvidia.ko
%{_cross_datadir}/nvidia/grid/drivers/nvidia-uvm.ko
Expand Down