[rhoai-2.25] RHAIENG-2645: add loop to retry package installation when it fails by daniellutz · Pull Request #1883 · red-hat-data-services/notebooks

daniellutz · 2026-02-09T03:03:30Z

This PR addresses intermittent build failures (flakiness) caused by transient network issues during dnf install commands.

This PR aims to fix the network issues only in the rhoai-2.25 branch, not going forward to ODH main or RHDS main / stable as the team already has in the work hermetic builds to solve this kind of issue in newer versions;
This fix will also be applied to rhoai-3.3, and only rhoai-3.3 for now;
The fix involved the creation of a shell script that acts as a wrapper, by receives arguments and parameters, to then create a loop section and run all commands under the loop to avoid flakiness;
This PR follows the same coding style as other PR, since the functionality aims to do the same thing:
- RHAIENG-1625, RHAIENG-3047, RHAIENG-3048: bump codeserver IDE version and python and jupyterlab extensions opendatahub-io/notebooks#2897

Description

Here is a summary of the changes proposed in this PR:

Shell Script Wrapper: Creation of a shell script to ease the retry loops functionality;
Implemented Retry Loops: Wrapped retry loops on important commands that can cause flakiness, like dnf install or texlive-install (install_pdf_deps)
Fail-Safe Mechanism: Added a MAX_RETRIES limit (default: 3) with a 30-second sleep between attempts to allow network congestion or CDN hiccups to clear.
Metadata Reset: Ensure dnf clean metadata within the retry block for subsequent attempts to start with a fresh SSL/OCSP state.
Adopted "Strict Mode": Introduced set -Eeuxo pipefail in Heredoc blocks to ensure the build fails immediately and loudly if a non-retryable error occurs.
Optimized Caching: Leveraged RUN --mount=type=cache to ensure partial downloads are preserved between retries, reducing bandwidth and build time.

The logic ensures that commands return an exit code of 0 (success). If a command returns a non-zero exit code after 3 attempts, the script will explicitly exit 1, preventing the creation of a "poisoned" or incomplete Docker image.

Here is an example of the working code in the codeserver build steps:

--> 4fc54495ca15
...
[3/6] STEP 6/10: RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF' (set -Eeuxo pipefail...)
+ MAX_RETRIES=3
+ RETRY_COUNT=0
+ dnf install -y perl mesa-libGL skopeo
Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.
...

Due to architectural differences between images, the retry logic was implemented individually to accommodate unique package requirements. Each image has been manually verified through local builds to ensure the proposed changes are stable and reliable.

How Has This Been Tested?

These images have been built manually to ensure that the commands are running properly, with the following instructions to build and test:

Before running the following commands, please, set the environment USER to your username: export USER="myuser"

codeserver (cpu)

To build the image:

make codeserver-ubi9-python-3.12 \
        -e RELEASE="2025b" \
        -e IMAGE_REGISTRY="quay.io/$USER/workbench-images" \
        -e RELEASE_PYTHON_VERSION="3.12" \
        -e CONTAINER_BUILD_CACHE_ARGS="--no-cache" \
        -e PUSH_IMAGES="no"

To run the image:

export IMG=$(podman images --format "{{.Repository}}:{{.Tag}}" | grep "codeserver-ubi9-python-3.12" | sort -r | head -n1) && \
    (until curl -s localhost:8787 > /dev/null; do sleep 1; done && open http://localhost:8787 &) && \
    podman run --rm --platform linux/amd64 -p 8787:8787 "$IMG"

jupyter/datascience (cpu)

To build the image:

make jupyter-datascience-ubi9-python-3.12 \
        -e RELEASE="2025b" \
        -e IMAGE_REGISTRY="quay.io/$USER/workbench-images" \
        -e RELEASE_PYTHON_VERSION="3.12" \
        -e CONTAINER_BUILD_CACHE_ARGS="--no-cache" \
        -e PUSH_IMAGES="no"

To run the image:

export IMG=$(podman images --format "{{.Repository}}:{{.Tag}}" | grep "jupyter-datascience-ubi9-python-3.12" | sort -r | head -n1) && \
    (until podman logs jupyter_test 2>&1 | grep -q "token="; do sleep 1; done && \
    open $(podman logs jupyter_test 2>&1 | grep -o "http://localhost:[0-9]*/lab?token=[a-zA-Z0-9]*" | head -n1) &) && \
    podman run --rm --name jupyter_test --platform linux/amd64 -p 8888:8888 "$IMG"

Self checklist (all need to be checked):

Ensure that you have run make test (gmake on macOS) before asking for review
Changes to everything except Dockerfile.konflux files should be done in odh/notebooks and automatically synced to rhds/notebooks. For Konflux-specific changes, modify Dockerfile.konflux files directly in rhds/notebooks as these require special attention in the downstream repository and flow to the upcoming RHOAI release.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

openshift-ci · 2026-02-09T03:03:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign caponetto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

daniellutz · 2026-02-09T11:09:59Z

/build-konflux

daniellutz · 2026-02-09T13:29:54Z

Jiri did make some suggestions and I'm working through some details, an update is coming in a couple of moments, then I will check what is going on with the build

jiridanek · 2026-02-09T13:48:52Z

For the record, my suggestion was to somehow avoid duplicating the bash looping code (about 15 lines) for every RUN dnf.

daniellutz · 2026-02-09T14:14:18Z

for more clarification, the suggestion was taking the entire loop sections and simplifying it in a shell script, aiming for reuse and avoiding huge changes in the Dockerfile (hundreds of lines because of the loops), i.e.,

move this:

RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF'
set -Eeuxo pipefail

MAX_RETRIES=3
RETRY_COUNT=0

until dnf install -y perl mesa-libGL skopeo || [ $RETRY_COUNT -ge $MAX_RETRIES ]; do
    RETRY_COUNT=$((RETRY_COUNT + 1))
    ...
done

if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
    echo "ERROR: dnf install failed after $MAX_RETRIES attempts"
    exit 1
fi
EOF

to something like this:

COPY jupyter/utils utils/
RUN --mount=type=cache,target=/var/cache/dnf /bin/bash -c '
    ./utils/install_with_retry.sh dnf-install perl mesa-libGL skopeo
'

then the same script would be reused across all images, the loops would be simplified and there are numerous possibilities in terms of what commands could use the loop (dnf install, dnf upgrade, wget, texlive-install, etc)

thanks again, Jiri

jiridanek · 2026-02-10T10:08:16Z

codeserver/ubi9-python-3.12/Dockerfile.cpu


 # Install useful OS packages
-RUN dnf install -y perl mesa-libGL skopeo && dnf clean all && rm -rf /var/cache/yum
+RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF'


@tiran does this with sharing=locked and sets an id=, https://github.com/tiran/instructlab-containers/blob/ff083370d929cb00c599fd751db3e2741f8d5348/containers/rocm/Containerfile.c9s#L39C24-L39C38

@coderabbitai analyze and explain

Based on the code snippet and the discussion provided, here is an analysis of the RUN instruction and the suggestion made by jiridanek.

Summary

The original code uses a cache mount to speed up DNF operations, but it lacks safeguards against concurrency issues and cache collisions. The comment by jiridanek correctly identifies that adding sharing=locked and a specific id is a best practice, particularly in CI/CD environments or when building multiple images on the same host.

Detailed Analysis

1. The Original Code

RUN --mount=type=cache,target=/var/cache/dnf /bin/bash <<'EOF'

Purpose: This tells the container builder (BuildKit or Buildah) to mount a persistent cache volume at /var/cache/dnf during the build step.

Benefit: Subsequent builds can reuse downloaded package metadata and RPMs, significantly speeding up dnf install operations.

The Risk: By default, cache mounts usually have a sharing mode of shared. This means multiple build processes running simultaneously on the same machine could try to write to this cache at the same time.

**2. The Suggestion: sharing=locked**

Package managers like dnf and yum are generally not designed for concurrent access to their databases.

The Problem: If two parallel builds try to update the DNF metadata or install packages using the same shared cache directory, they may corrupt the RPM database or cause race conditions (e.g., one process deletes a file while another tries to read it).

The Fix: Setting sharing=locked forces the builder to lock the cache for the duration of the RUN command. If another build needs that cache, it must wait until the first one finishes. This ensures data integrity at the cost of slight serialization.

**3. The Suggestion: id=...**

The id field defines a unique namespace for the cache.

The Problem: The default ID for a cache is the value of the target path (e.g., /var/cache/dnf). If you are building multiple images based on different distributions (e.g., Fedora 40 vs. CentOS Stream 9) on the same host, they will both try to use the same cache folder because the path inside the container is identical. This can lead to conflicts where one OS overwrites the cache of another incompatible OS.

The Fix: setting an explicit id (e.g., id=dnf-c9s) ensures that the CentOS Stream 9 build uses a completely different cache bucket than a Fedora build, even if they map to the same directory inside the container.

Recommendation

You should adopt the suggestion to improve stability and reliability.

Improved Code:

RUN --mount=type=cache,target=/var/cache/dnf,sharing=locked,id=dnf-c9s \ /bin/bash <<'EOF' # ... your dnf commands ... EOF

Would you like me to generate a specific id convention for your other container files as well?

daniellutz · 2026-02-10T22:08:13Z

/build-konflux

jupyter/datascience/ubi9-python-3.12/Dockerfile.cpu

atheo89 · 2026-02-11T12:15:53Z

Hey Daniel, the changes looks good to me, however it would be nice to see them in action.
I would propose to ping devops to check the pipeline issues, ~~and also try to rebase you branch in order to fix the GHAs after that fix: #1889~~ I will close and reopen maybe it will do the trick

atheo89 · 2026-02-11T12:24:07Z

Niah... now it fails with
Error: Can't find 'action.yml', 'action.yaml' or 'Dockerfile' under '/home/runner/work/notebooks/notebooks/.github/actions/install-podman-action'. Did you forget to run actions/checkout before running your local action? needs some ci syncs

atheo89 · 2026-02-11T12:38:20Z

codeserver/ubi9-python-3.12/get_code_server_rpm.sh

@@ -31,7 +31,7 @@ if [[ "$ARCH" == "amd64" || "$ARCH" == "arm64" ||"$ARCH" == "ppc64le" ]]; then
 	# install build dependencies


On my trials the other time I had timeout issues on npm install step, can you wrap that line with the retry script?
https://github.com/daniellutz/odh-notebooks/blob/83c426252f148afac76f61b5e221ef823e21efdd/codeserver/ubi9-python-3.12/get_code_server_rpm.sh#L68C2-L68C13

the idea of the script (thanks Jiri, again) was to improve that as well, npm install, dnf upgrade and anything that could require the retry loop

let me wrap it as well

added the npm install retry loop as well for codeserver image

on purpose, I did not add the loop to npm run build, let's see if on npm install will be enough

daniellutz requested review from atheo89 and jiridanek February 9, 2026 03:03

openshift-ci bot requested a review from dibryant February 9, 2026 03:03

github-actions bot added the review-requested label Feb 9, 2026

daniellutz requested review from ysok and removed request for dibryant February 9, 2026 03:11

jiridanek changed the title ~~RHAIENG-2645: add loop to retry package installation when it fails~~ [rhoai-2.25] RHAIENG-2645: add loop to retry package installation when it fails Feb 9, 2026

jiridanek reviewed Feb 10, 2026

View reviewed changes

daniellutz force-pushed the dnf-loop-retry branch from 112c3d0 to 83c4262 Compare February 10, 2026 20:57

atheo89 reviewed Feb 11, 2026

View reviewed changes

jupyter/datascience/ubi9-python-3.12/Dockerfile.cpu Show resolved Hide resolved

atheo89 closed this Feb 11, 2026

atheo89 reopened this Feb 11, 2026

atheo89 reviewed Feb 11, 2026

View reviewed changes

daniellutz added 2 commits February 11, 2026 23:51

RHAIENG-2645: add loop to retry package installation when it fails

362088e

RHAIENG-2645: Add retry loop to npm install command

40198ac

daniellutz force-pushed the dnf-loop-retry branch from c8480b3 to 40198ac Compare February 12, 2026 02:52

		@@ -31,7 +31,7 @@ if [[ "$ARCH" == "amd64" \|\| "$ARCH" == "arm64" \|\|"$ARCH" == "ppc64le" ]]; then
		# install build dependencies

Conversation

daniellutz commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Uh oh!

openshift-ci bot commented Feb 9, 2026

Uh oh!

daniellutz commented Feb 9, 2026

Uh oh!

daniellutz commented Feb 9, 2026

Uh oh!

jiridanek commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniellutz commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiridanek Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiridanek Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

jiridanek Feb 10, 2026

Choose a reason for hiding this comment

Summary

Detailed Analysis

1. The Original Code

**2. The Suggestion: sharing=locked**

**3. The Suggestion: id=...**

Recommendation

Uh oh!

daniellutz commented Feb 10, 2026

Uh oh!

Uh oh!

atheo89 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atheo89 commented Feb 11, 2026

Uh oh!

atheo89 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

daniellutz Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

daniellutz Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

daniellutz commented Feb 9, 2026 •

edited

Loading

jiridanek commented Feb 9, 2026 •

edited

Loading

daniellutz commented Feb 9, 2026 •

edited

Loading

jiridanek Feb 10, 2026 •

edited

Loading

2. The Suggestion: `sharing=locked`

3. The Suggestion: `id=...`

atheo89 commented Feb 11, 2026 •

edited

Loading

daniellutz Feb 12, 2026 •

edited

Loading