[rhoai-2.25] RHAIENG-2645: add loop to retry package installation when it fails#1883
[rhoai-2.25] RHAIENG-2645: add loop to retry package installation when it fails#1883daniellutz wants to merge 2 commits intored-hat-data-services:rhoai-2.25from
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/build-konflux |
|
Jiri did make some suggestions and I'm working through some details, an update is coming in a couple of moments, then I will check what is going on with the build |
|
For the record, my suggestion was to somehow avoid duplicating the bash looping code (about 15 lines) for every RUN dnf. |
|
for more clarification, the suggestion was taking the entire loop sections and simplifying it in a shell script, aiming for reuse and avoiding huge changes in the Dockerfile (hundreds of lines because of the loops), i.e., move this: RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF'
set -Eeuxo pipefail
MAX_RETRIES=3
RETRY_COUNT=0
until dnf install -y perl mesa-libGL skopeo || [ $RETRY_COUNT -ge $MAX_RETRIES ]; do
RETRY_COUNT=$((RETRY_COUNT + 1))
...
done
if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
echo "ERROR: dnf install failed after $MAX_RETRIES attempts"
exit 1
fi
EOFto something like this: COPY jupyter/utils utils/
RUN --mount=type=cache,target=/var/cache/dnf /bin/bash -c '
./utils/install_with_retry.sh dnf-install perl mesa-libGL skopeo
'then the same script would be reused across all images, the loops would be simplified and there are numerous possibilities in terms of what commands could use the loop (dnf install, dnf upgrade, wget, texlive-install, etc) thanks again, Jiri |
|
|
||
| # Install useful OS packages | ||
| RUN dnf install -y perl mesa-libGL skopeo && dnf clean all && rm -rf /var/cache/yum | ||
| RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF' |
There was a problem hiding this comment.
@tiran does this with sharing=locked and sets an id=, https://github.com/tiran/instructlab-containers/blob/ff083370d929cb00c599fd751db3e2741f8d5348/containers/rocm/Containerfile.c9s#L39C24-L39C38
There was a problem hiding this comment.
Based on the code snippet and the discussion provided, here is an analysis of the RUN instruction and the suggestion made by jiridanek.
Summary
The original code uses a cache mount to speed up DNF operations, but it lacks safeguards against concurrency issues and cache collisions. The comment by jiridanek correctly identifies that adding sharing=locked and a specific id is a best practice, particularly in CI/CD environments or when building multiple images on the same host.
Detailed Analysis
1. The Original Code
RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF'
- Purpose: This tells the container builder (BuildKit or Buildah) to mount a persistent cache volume at
/var/cache/dnfduring the build step. - Benefit: Subsequent builds can reuse downloaded package metadata and RPMs, significantly speeding up
dnf installoperations. - The Risk: By default, cache mounts usually have a sharing mode of
shared. This means multiple build processes running simultaneously on the same machine could try to write to this cache at the same time.
**2. The Suggestion: sharing=locked**
Package managers like dnf and yum are generally not designed for concurrent access to their databases.
- The Problem: If two parallel builds try to update the DNF metadata or install packages using the same shared cache directory, they may corrupt the RPM database or cause race conditions (e.g., one process deletes a file while another tries to read it).
- The Fix: Setting
sharing=lockedforces the builder to lock the cache for the duration of theRUNcommand. If another build needs that cache, it must wait until the first one finishes. This ensures data integrity at the cost of slight serialization.
**3. The Suggestion: id=...**
The id field defines a unique namespace for the cache.
- The Problem: The default ID for a cache is the value of the
targetpath (e.g.,/var/cache/dnf). If you are building multiple images based on different distributions (e.g., Fedora 40 vs. CentOS Stream 9) on the same host, they will both try to use the same cache folder because the path inside the container is identical. This can lead to conflicts where one OS overwrites the cache of another incompatible OS. - The Fix: setting an explicit
id(e.g.,id=dnf-c9s) ensures that the CentOS Stream 9 build uses a completely different cache bucket than a Fedora build, even if they map to the same directory inside the container.
Recommendation
You should adopt the suggestion to improve stability and reliability.
Improved Code:
RUN --mount=type=cache,target=/var/cache/dnf,sharing=locked,id=dnf-c9s \
/bin/bash <<'EOF'
# ... your dnf commands ...
EOF
Would you like me to generate a specific id convention for your other container files as well?
112c3d0 to
83c4262
Compare
|
/build-konflux |
|
Hey Daniel, the changes looks good to me, however it would be nice to see them in action. |
|
Niah... now it fails with |
| @@ -31,7 +31,7 @@ if [[ "$ARCH" == "amd64" || "$ARCH" == "arm64" ||"$ARCH" == "ppc64le" ]]; then | |||
| # install build dependencies | |||
There was a problem hiding this comment.
On my trials the other time I had timeout issues on npm install step, can you wrap that line with the retry script?
https://github.com/daniellutz/odh-notebooks/blob/83c426252f148afac76f61b5e221ef823e21efdd/codeserver/ubi9-python-3.12/get_code_server_rpm.sh#L68C2-L68C13
There was a problem hiding this comment.
the idea of the script (thanks Jiri, again) was to improve that as well, npm install, dnf upgrade and anything that could require the retry loop
let me wrap it as well
There was a problem hiding this comment.
added the npm install retry loop as well for codeserver image
on purpose, I did not add the loop to npm run build, let's see if on npm install will be enough
c8480b3 to
40198ac
Compare
This PR addresses intermittent build failures (flakiness) caused by transient network issues during dnf install commands.
rhoai-2.25branch, not going forward to ODH main or RHDS main / stable as the team already has in the work hermetic builds to solve this kind of issue in newer versions;rhoai-3.3, and onlyrhoai-3.3for now;Description
Here is a summary of the changes proposed in this PR:
dnf installortexlive-install(install_pdf_deps)MAX_RETRIESlimit (default: 3) with a 30-secondsleepbetween attempts to allow network congestion or CDN hiccups to clear.dnf clean metadatawithin the retry block for subsequent attempts to start with a fresh SSL/OCSP state.set -Eeuxo pipefailin Heredoc blocks to ensure the build fails immediately and loudly if a non-retryable error occurs.RUN --mount=type=cacheto ensure partial downloads are preserved between retries, reducing bandwidth and build time.The logic ensures that commands return an exit code of 0 (success). If a command returns a non-zero exit code after 3 attempts, the script will explicitly exit 1, preventing the creation of a "poisoned" or incomplete Docker image.
Here is an example of the working code in the codeserver build steps:
Due to architectural differences between images, the retry logic was implemented individually to accommodate unique package requirements. Each image has been manually verified through local builds to ensure the proposed changes are stable and reliable.
How Has This Been Tested?
These images have been built manually to ensure that the commands are running properly, with the following instructions to build and test:
codeserver (cpu)
To build the image:
make codeserver-ubi9-python-3.12 \ -e RELEASE="2025b" \ -e IMAGE_REGISTRY="quay.io/$USER/workbench-images" \ -e RELEASE_PYTHON_VERSION="3.12" \ -e CONTAINER_BUILD_CACHE_ARGS="--no-cache" \ -e PUSH_IMAGES="no"To run the image:
jupyter/datascience (cpu)
To build the image:
make jupyter-datascience-ubi9-python-3.12 \ -e RELEASE="2025b" \ -e IMAGE_REGISTRY="quay.io/$USER/workbench-images" \ -e RELEASE_PYTHON_VERSION="3.12" \ -e CONTAINER_BUILD_CACHE_ARGS="--no-cache" \ -e PUSH_IMAGES="no"To run the image:
Self checklist (all need to be checked):
make test(gmakeon macOS) before asking for reviewDockerfile.konfluxfiles should be done inodh/notebooksand automatically synced torhds/notebooks. For Konflux-specific changes, modifyDockerfile.konfluxfiles directly inrhds/notebooksas these require special attention in the downstream repository and flow to the upcoming RHOAI release.Merge criteria: