-
Notifications
You must be signed in to change notification settings - Fork 69
[GPU Driver Container Resiliency] Implement userspace-only fast path when kernel modules already loaded with matching config #534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7bf634d to
f959234
Compare
f959234 to
a9bde48
Compare
rhel9/nvidia-driver
Outdated
| } | ||
|
|
||
| _userspace_only_install() { | ||
| echo "Detected matching loaded driver & config (${DRIVER_VERSION}); performing userspace-only install" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| echo "Detected matching loaded driver & config (${DRIVER_VERSION}); performing userspace-only install" | |
| echo "The NVIDIA driver is already loaded with the desired configuration, performing userspace-only install" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
rhel9/ocp_dtk_entrypoint
Outdated
| # Check if fast path is being used - if so, skip building and signal completion | ||
| if _should_use_fast_path; then | ||
| echo "Fast path detected in DTK container: driver already loaded with matching config, skipping build" | ||
| echo "Signaling driver_built and sleeping forever..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| echo "Signaling driver_built and sleeping forever..." | |
| echo "Signaling driver_built to the main container and sleeping forever..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
ubuntu22.04/nvidia-driver
Outdated
| if [ -f /sys/module/nvidia/refcnt ] && [ -f /run/nvidia/nvidia-driver.state ]; then | ||
| current_digest=$(_get_current_digest) | ||
| stored_digest=$(_read_stored_digest) | ||
|
|
||
| if [ -n "${current_digest}" ] && [ "${current_digest}" = "${stored_digest}" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the changes you made in rhel9, can we create a helper function to check if the fast-path install path should be taken or not? That will simplify the call-site here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the call-site, should look a bit cleaner here now.
a9bde48 to
82123af
Compare
…fig digest matches Signed-off-by: Karthik Vetrivel <[email protected]>
| exit 0 | ||
| } | ||
|
|
||
| _userspace_only_install() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this name would imply that it is "userspace only"
| _userspace_only_install() { | |
| _userspace_install() { |
|
|
||
| # Check if fast path should be used (driver already loaded with matching config) | ||
| # Compares current digest from DRIVER_CONFIG_DIGEST env var with stored digest | ||
| _should_use_fast_path() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this as the method name?
| _should_use_fast_path() { | |
| _should_skip_kernel_module_reload() { |
OR invert the conditional statement and use the name below
| _should_use_fast_path() { | |
| _should_rebuild_kernel_module() { |
Relevant PRs:
Description
/sys/module/nvidia/refcnt) and compares configuration digest