Skip to content

[rhoai-2.25] RHAIENG-2645: add loop to retry package installation when it fails#1883

Open
daniellutz wants to merge 2 commits intored-hat-data-services:rhoai-2.25from
daniellutz:dnf-loop-retry
Open

[rhoai-2.25] RHAIENG-2645: add loop to retry package installation when it fails#1883
daniellutz wants to merge 2 commits intored-hat-data-services:rhoai-2.25from
daniellutz:dnf-loop-retry

Conversation

@daniellutz
Copy link

@daniellutz daniellutz commented Feb 9, 2026

This PR addresses intermittent build failures (flakiness) caused by transient network issues during dnf install commands.

  • This PR aims to fix the network issues only in the rhoai-2.25 branch, not going forward to ODH main or RHDS main / stable as the team already has in the work hermetic builds to solve this kind of issue in newer versions;
  • This fix will also be applied to rhoai-3.3, and only rhoai-3.3 for now;
  • The fix involved the creation of a shell script that acts as a wrapper, by receives arguments and parameters, to then create a loop section and run all commands under the loop to avoid flakiness;
  • This PR follows the same coding style as other PR, since the functionality aims to do the same thing:

Description

Here is a summary of the changes proposed in this PR:

  • Shell Script Wrapper: Creation of a shell script to ease the retry loops functionality;
  • Implemented Retry Loops: Wrapped retry loops on important commands that can cause flakiness, like dnf install or texlive-install (install_pdf_deps)
  • Fail-Safe Mechanism: Added a MAX_RETRIES limit (default: 3) with a 30-second sleep between attempts to allow network congestion or CDN hiccups to clear.
  • Metadata Reset: Ensure dnf clean metadata within the retry block for subsequent attempts to start with a fresh SSL/OCSP state.
  • Adopted "Strict Mode": Introduced set -Eeuxo pipefail in Heredoc blocks to ensure the build fails immediately and loudly if a non-retryable error occurs.
  • Optimized Caching: Leveraged RUN --mount=type=cache to ensure partial downloads are preserved between retries, reducing bandwidth and build time.

The logic ensures that commands return an exit code of 0 (success). If a command returns a non-zero exit code after 3 attempts, the script will explicitly exit 1, preventing the creation of a "poisoned" or incomplete Docker image.

Here is an example of the working code in the codeserver build steps:

--> 4fc54495ca15
...
[3/6] STEP 6/10: RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF' (set -Eeuxo pipefail...)
+ MAX_RETRIES=3
+ RETRY_COUNT=0
+ dnf install -y perl mesa-libGL skopeo
Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.
...

Due to architectural differences between images, the retry logic was implemented individually to accommodate unique package requirements. Each image has been manually verified through local builds to ensure the proposed changes are stable and reliable.

How Has This Been Tested?

These images have been built manually to ensure that the commands are running properly, with the following instructions to build and test:

Before running the following commands, please, set the environment USER to your username: export USER="myuser"

 codeserver (cpu)


To build the image:

make codeserver-ubi9-python-3.12 \
        -e RELEASE="2025b" \
        -e IMAGE_REGISTRY="quay.io/$USER/workbench-images" \
        -e RELEASE_PYTHON_VERSION="3.12" \
        -e CONTAINER_BUILD_CACHE_ARGS="--no-cache" \
        -e PUSH_IMAGES="no"

To run the image:

export IMG=$(podman images --format "{{.Repository}}:{{.Tag}}" | grep "codeserver-ubi9-python-3.12" | sort -r | head -n1) && \
    (until curl -s localhost:8787 > /dev/null; do sleep 1; done && open http://localhost:8787 &) && \
    podman run --rm --platform linux/amd64 -p 8787:8787 "$IMG"
 jupyter/datascience (cpu)


To build the image:

make jupyter-datascience-ubi9-python-3.12 \
        -e RELEASE="2025b" \
        -e IMAGE_REGISTRY="quay.io/$USER/workbench-images" \
        -e RELEASE_PYTHON_VERSION="3.12" \
        -e CONTAINER_BUILD_CACHE_ARGS="--no-cache" \
        -e PUSH_IMAGES="no"

To run the image:

export IMG=$(podman images --format "{{.Repository}}:{{.Tag}}" | grep "jupyter-datascience-ubi9-python-3.12" | sort -r | head -n1) && \
    (until podman logs jupyter_test 2>&1 | grep -q "token="; do sleep 1; done && \
    open $(podman logs jupyter_test 2>&1 | grep -o "http://localhost:[0-9]*/lab?token=[a-zA-Z0-9]*" | head -n1) &) && \
    podman run --rm --name jupyter_test --platform linux/amd64 -p 8888:8888 "$IMG"

Self checklist (all need to be checked):

  • Ensure that you have run make test (gmake on macOS) before asking for review
  • Changes to everything except Dockerfile.konflux files should be done in odh/notebooks and automatically synced to rhds/notebooks. For Konflux-specific changes, modify Dockerfile.konflux files directly in rhds/notebooks as these require special attention in the downstream repository and flow to the upcoming RHOAI release.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@openshift-ci openshift-ci bot requested a review from dibryant February 9, 2026 03:03
@openshift-ci
Copy link

openshift-ci bot commented Feb 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign caponetto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@daniellutz daniellutz requested review from ysok and removed request for dibryant February 9, 2026 03:11
@daniellutz
Copy link
Author

/build-konflux

@daniellutz
Copy link
Author

Jiri did make some suggestions and I'm working through some details, an update is coming in a couple of moments, then I will check what is going on with the build

@jiridanek jiridanek changed the title RHAIENG-2645: add loop to retry package installation when it fails [rhoai-2.25] RHAIENG-2645: add loop to retry package installation when it fails Feb 9, 2026
@jiridanek
Copy link
Member

jiridanek commented Feb 9, 2026

For the record, my suggestion was to somehow avoid duplicating the bash looping code (about 15 lines) for every RUN dnf.

@daniellutz
Copy link
Author

daniellutz commented Feb 9, 2026

for more clarification, the suggestion was taking the entire loop sections and simplifying it in a shell script, aiming for reuse and avoiding huge changes in the Dockerfile (hundreds of lines because of the loops), i.e.,

move this:

RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF'
set -Eeuxo pipefail

MAX_RETRIES=3
RETRY_COUNT=0

until dnf install -y perl mesa-libGL skopeo || [ $RETRY_COUNT -ge $MAX_RETRIES ]; do
    RETRY_COUNT=$((RETRY_COUNT + 1))
    ...
done

if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
    echo "ERROR: dnf install failed after $MAX_RETRIES attempts"
    exit 1
fi
EOF

to something like this:

COPY jupyter/utils utils/
RUN --mount=type=cache,target=/var/cache/dnf /bin/bash -c '
    ./utils/install_with_retry.sh dnf-install perl mesa-libGL skopeo
'

then the same script would be reused across all images, the loops would be simplified and there are numerous possibilities in terms of what commands could use the loop (dnf install, dnf upgrade, wget, texlive-install, etc)

thanks again, Jiri


# Install useful OS packages
RUN dnf install -y perl mesa-libGL skopeo && dnf clean all && rm -rf /var/cache/yum
RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF'
Copy link
Member

@jiridanek jiridanek Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai analyze and explain

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the code snippet and the discussion provided, here is an analysis of the RUN instruction and the suggestion made by jiridanek.

Summary

The original code uses a cache mount to speed up DNF operations, but it lacks safeguards against concurrency issues and cache collisions. The comment by jiridanek correctly identifies that adding sharing=locked and a specific id is a best practice, particularly in CI/CD environments or when building multiple images on the same host.


Detailed Analysis

1. The Original Code

RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF'
  • Purpose: This tells the container builder (BuildKit or Buildah) to mount a persistent cache volume at /var/cache/dnf during the build step.
  • Benefit: Subsequent builds can reuse downloaded package metadata and RPMs, significantly speeding up dnf install operations.
  • The Risk: By default, cache mounts usually have a sharing mode of shared. This means multiple build processes running simultaneously on the same machine could try to write to this cache at the same time.

**2. The Suggestion: sharing=locked**

Package managers like dnf and yum are generally not designed for concurrent access to their databases.

  • The Problem: If two parallel builds try to update the DNF metadata or install packages using the same shared cache directory, they may corrupt the RPM database or cause race conditions (e.g., one process deletes a file while another tries to read it).
  • The Fix: Setting sharing=locked forces the builder to lock the cache for the duration of the RUN command. If another build needs that cache, it must wait until the first one finishes. This ensures data integrity at the cost of slight serialization.

**3. The Suggestion: id=...**

The id field defines a unique namespace for the cache.

  • The Problem: The default ID for a cache is the value of the target path (e.g., /var/cache/dnf). If you are building multiple images based on different distributions (e.g., Fedora 40 vs. CentOS Stream 9) on the same host, they will both try to use the same cache folder because the path inside the container is identical. This can lead to conflicts where one OS overwrites the cache of another incompatible OS.
  • The Fix: setting an explicit id (e.g., id=dnf-c9s) ensures that the CentOS Stream 9 build uses a completely different cache bucket than a Fedora build, even if they map to the same directory inside the container.

Recommendation

You should adopt the suggestion to improve stability and reliability.

Improved Code:

RUN --mount=type=cache,target=/var/cache/dnf,sharing=locked,id=dnf-c9s \
    /bin/bash <<'EOF'
    # ... your dnf commands ...
EOF

Would you like me to generate a specific id convention for your other container files as well?

@daniellutz
Copy link
Author

/build-konflux

@atheo89
Copy link
Member

atheo89 commented Feb 11, 2026

Hey Daniel, the changes looks good to me, however it would be nice to see them in action.
I would propose to ping devops to check the pipeline issues, and also try to rebase you branch in order to fix the GHAs after that fix: #1889 I will close and reopen maybe it will do the trick

@atheo89 atheo89 closed this Feb 11, 2026
@atheo89 atheo89 reopened this Feb 11, 2026
@atheo89
Copy link
Member

atheo89 commented Feb 11, 2026

Niah... now it fails with
Error: Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/runner/work/notebooks/notebooks/.github/actions/install-podman-action'. Did you forget to run actions/checkout before running your local action? needs some ci syncs

@@ -31,7 +31,7 @@ if [[ "$ARCH" == "amd64" || "$ARCH" == "arm64" ||"$ARCH" == "ppc64le" ]]; then
# install build dependencies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my trials the other time I had timeout issues on npm install step, can you wrap that line with the retry script?
https://github.com/daniellutz/odh-notebooks/blob/83c426252f148afac76f61b5e221ef823e21efdd/codeserver/ubi9-python-3.12/get_code_server_rpm.sh#L68C2-L68C13

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea of the script (thanks Jiri, again) was to improve that as well, npm install, dnf upgrade and anything that could require the retry loop

let me wrap it as well

Copy link
Author

@daniellutz daniellutz Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the npm install retry loop as well for codeserver image

on purpose, I did not add the loop to npm run build, let's see if on npm install will be enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants