Skip to content

Enable GPU operator to install GRID driver on Azure NV instances#6

Open
vinser52 wants to merge 8 commits intodatadogfrom
sergei.vinogradov/nvidia-grid-driver
Open

Enable GPU operator to install GRID driver on Azure NV instances#6
vinser52 wants to merge 8 commits intodatadogfrom
sergei.vinogradov/nvidia-grid-driver

Conversation

@vinser52
Copy link

No description provided.

COPY ubuntu22.04/precompiled/nvidia-driver /opt/nvidia-driver/bin/nvidia-driver
COPY nvidia-driver-wrapper.sh /usr/local/bin/nvidia-driver

ADD download_azure_grid_driver.sh /tmp

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but for consistency reasons could you use COPY (also it is officially recommended to use COPY)

DEP_PACKAGES=$(apt-rdepends $BASE_PACKAGES_NAMES | grep -v "^ " | grep -v "^debconf-2.0$" | grep -v "^linux-image-unsigned-") && \
apt-get install -y --download-only --no-install-recommends --reinstall $BASE_PACKAGES $DEP_PACKAGES

# Remove cuda repository before downloading dkms to avoid version conflicts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you gather all the build required steps in a single block and make them only run on Azure?

echo "Available versions: $AVAILABLE_VERSIONS"
}

get_grid_azure_url() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to support all those versions, especially since they are hardcoded anyway. Only keeping 1 (the latest) per driver branch would shorten the script a little bit

@@ -19,6 +19,7 @@ NVIDIA_PEERMEM_MODULE_PARAMS=()
TARGETARCH=${TARGETARCH:?"Missing TARGETARCH env"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more or less the upstream nvidia-driver script. For the sake of keeping it easy to rebase, could you please reduce to a bare minimum (a line of script import) all changes that are related to Azure specificities and put everything you add in a separate script?

exit 1
fi

# Updating gridd.conf

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a link to the doc here because it's not obvious why we are doing this here

# CUDA repo has dkms 1:3.3.0 but Ubuntu has 2.8.7 - we need Ubuntu version for runtime
# Note: We remove repo files but don't run apt-get update to preserve package cache
# for runtime installation of precompiled driver packages
RUN rm -f /etc/apt/sources.list.d/cuda*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know removing the /etc/apt/sources.list.d/cuda* file has been an issue in some cases where we could not find some packages. Can you try doing apt install nvlsm for instance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments