Enable GPU operator to install GRID driver on Azure NV instances#6
Enable GPU operator to install GRID driver on Azure NV instances#6
Conversation
| COPY ubuntu22.04/precompiled/nvidia-driver /opt/nvidia-driver/bin/nvidia-driver | ||
| COPY nvidia-driver-wrapper.sh /usr/local/bin/nvidia-driver | ||
|
|
||
| ADD download_azure_grid_driver.sh /tmp |
There was a problem hiding this comment.
nit but for consistency reasons could you use COPY (also it is officially recommended to use COPY)
| DEP_PACKAGES=$(apt-rdepends $BASE_PACKAGES_NAMES | grep -v "^ " | grep -v "^debconf-2.0$" | grep -v "^linux-image-unsigned-") && \ | ||
| apt-get install -y --download-only --no-install-recommends --reinstall $BASE_PACKAGES $DEP_PACKAGES | ||
|
|
||
| # Remove cuda repository before downloading dkms to avoid version conflicts |
There was a problem hiding this comment.
could you gather all the build required steps in a single block and make them only run on Azure?
| echo "Available versions: $AVAILABLE_VERSIONS" | ||
| } | ||
|
|
||
| get_grid_azure_url() { |
There was a problem hiding this comment.
I don't think we need to support all those versions, especially since they are hardcoded anyway. Only keeping 1 (the latest) per driver branch would shorten the script a little bit
| @@ -19,6 +19,7 @@ NVIDIA_PEERMEM_MODULE_PARAMS=() | |||
| TARGETARCH=${TARGETARCH:?"Missing TARGETARCH env"} | |||
There was a problem hiding this comment.
This is more or less the upstream nvidia-driver script. For the sake of keeping it easy to rebase, could you please reduce to a bare minimum (a line of script import) all changes that are related to Azure specificities and put everything you add in a separate script?
| exit 1 | ||
| fi | ||
|
|
||
| # Updating gridd.conf |
There was a problem hiding this comment.
Maybe add a link to the doc here because it's not obvious why we are doing this here
| # CUDA repo has dkms 1:3.3.0 but Ubuntu has 2.8.7 - we need Ubuntu version for runtime | ||
| # Note: We remove repo files but don't run apt-get update to preserve package cache | ||
| # for runtime installation of precompiled driver packages | ||
| RUN rm -f /etc/apt/sources.list.d/cuda* |
There was a problem hiding this comment.
I know removing the /etc/apt/sources.list.d/cuda* file has been an issue in some cases where we could not find some packages. Can you try doing apt install nvlsm for instance?
No description provided.