Skip to content

Commit 90e757a

Browse files
committed
Run nvidia-smi after modules are loaded in driver ds startup probe
This commit eliminates the race condition where the startup probe in the driver daemonset runs after the kernel modules are built (and installed) but before the modules are loaded into the kernel. In this case, the invocation of nvidia-smi (by the startup probe) is what is actually loading the nvidia kernel module and not the modprobe we perform in our driver container scripts. As a result, the nvidia driver will be loaded with a default configuration -- none of the custom kernel module parameters provided by users (via a configmap) or set by our driver container will get applied. Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>
1 parent 526cc24 commit 90e757a

15 files changed

+15
-15
lines changed

assets/state-driver/0500_daemonset.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ spec:
137137
startupProbe:
138138
exec:
139139
command:
140-
[sh, -c, 'nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready']
140+
[sh, -c, '[ -f /sys/module/nvidia/refcnt ] && nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready']
141141
initialDelaySeconds: 60
142142
failureThreshold: 120
143143
successThreshold: 1

internal/state/testdata/golden/driver-additional-configs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ spec:
168168
command:
169169
- sh
170170
- -c
171-
- nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready
171+
- '[ -f /sys/module/nvidia/refcnt ] && nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready'
172172
failureThreshold: 120
173173
initialDelaySeconds: 60
174174
periodSeconds: 10

internal/state/testdata/golden/driver-full-spec.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ spec:
182182
command:
183183
- sh
184184
- -c
185-
- nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready
185+
- '[ -f /sys/module/nvidia/refcnt ] && nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready'
186186
failureThreshold: 120
187187
initialDelaySeconds: 60
188188
periodSeconds: 10

internal/state/testdata/golden/driver-gdrcopy-openshift.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ spec:
238238
command:
239239
- sh
240240
- -c
241-
- nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready
241+
- '[ -f /sys/module/nvidia/refcnt ] && nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready'
242242
failureThreshold: 120
243243
initialDelaySeconds: 60
244244
periodSeconds: 10

internal/state/testdata/golden/driver-gdrcopy.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ spec:
168168
command:
169169
- sh
170170
- -c
171-
- nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready
171+
- '[ -f /sys/module/nvidia/refcnt ] && nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready'
172172
failureThreshold: 120
173173
initialDelaySeconds: 60
174174
periodSeconds: 10

internal/state/testdata/golden/driver-gds.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ spec:
168168
command:
169169
- sh
170170
- -c
171-
- nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready
171+
- '[ -f /sys/module/nvidia/refcnt ] && nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready'
172172
failureThreshold: 120
173173
initialDelaySeconds: 60
174174
periodSeconds: 10

internal/state/testdata/golden/driver-minimal.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ spec:
168168
command:
169169
- sh
170170
- -c
171-
- nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready
171+
- '[ -f /sys/module/nvidia/refcnt ] && nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready'
172172
failureThreshold: 120
173173
initialDelaySeconds: 60
174174
periodSeconds: 10

internal/state/testdata/golden/driver-openshift-drivertoolkit.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,7 +238,7 @@ spec:
238238
command:
239239
- sh
240240
- -c
241-
- nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready
241+
- '[ -f /sys/module/nvidia/refcnt ] && nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready'
242242
failureThreshold: 120
243243
initialDelaySeconds: 60
244244
periodSeconds: 10

internal/state/testdata/golden/driver-precompiled.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ spec:
170170
command:
171171
- sh
172172
- -c
173-
- nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready
173+
- '[ -f /sys/module/nvidia/refcnt ] && nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready'
174174
failureThreshold: 120
175175
initialDelaySeconds: 60
176176
periodSeconds: 10

internal/state/testdata/golden/driver-rdma-hostmofed.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ spec:
172172
command:
173173
- sh
174174
- -c
175-
- nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready
175+
- '[ -f /sys/module/nvidia/refcnt ] && nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready'
176176
failureThreshold: 120
177177
initialDelaySeconds: 60
178178
periodSeconds: 10

0 commit comments

Comments
 (0)