Skip to content

[RHAIENG-2819]: chore(actions) Mitigate ARM64 disk space failures and fix .linuxbrew cache permissions#3050

Open
mtchoum1 wants to merge 1 commit intoopendatahub-io:mainfrom
mtchoum1:linux/arm64
Open

[RHAIENG-2819]: chore(actions) Mitigate ARM64 disk space failures and fix .linuxbrew cache permissions#3050
mtchoum1 wants to merge 1 commit intoopendatahub-io:mainfrom
mtchoum1:linux/arm64

Conversation

@mtchoum1
Copy link
Contributor

@mtchoum1 mtchoum1 commented Mar 3, 2026

  • ARM64 / resource-constrained runners (RHAIENG-2819):

    • Pre-flight cleanup: New step at the start of the “Free up disk space” path that runs docker system prune -af --volumes, apt-get clean, and rm -rf /tmp/* to remove orphaned assets from crashed runs and free space before heavy builds (same if condition as the existing free-up-disk-space step).
    • Cache compression: Set job-level env GHA_CACHE_ZSTD_LEVEL: "3" to reduce cache size and speed up restore on constrained runners (e.g. arm64).
  • Install Podman action – .linuxbrew cache permissions:

    • Prepare parent for cache: New step before actions/cache: create /home/linuxbrew and chown -R to the runner so cache restore can set file modes without “Operation not permitted”.
    • Fix after cache restore: When cache hit, run chattr -R -i (clear immutable), chown -R to runner, and chmod -R u+rwX,go=rX on /home/linuxbrew/.linuxbrew so brew/podman work and “Cannot change mode to rwxr-xr-x: Operation not permitted” is resolved.

How Has This Been Tested?

  • Pre-flight cleanup / ARM64 loopback / cache env: Trigger the “Build Notebooks” workflow (e.g. push or workflow_dispatch) and confirm jobs that use the disk-pressure path (cuda, tensorflow, pytorch, rocm, or linux/arm64) run the new steps and complete without “No space left on device” or LVM failures on arm64 where applicable.
  • .linuxbrew permissions: On a matrix that uses the Linuxbrew cache (e.g. install-podman-action with cache hit), confirm the “Prepare .linuxbrew parent for cache” and “Fix .linuxbrew ownership and permissions after cache restore” steps run and that subsequent brew/podman steps do not report “Cannot change mode: Operation not permitted”.

Self checklist (all need to be checked):

  • Ensure that you have run make test (gmake on macOS) before asking for review
  • Changes to everything except Dockerfile.konflux files should be done in odh/notebooks and automatically synced to rhds/notebooks. For Konflux-specific changes, modify Dockerfile.konflux files directly in rhds/notebooks as these require special attention in the downstream repository and flow to the upcoming RHOAI release.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Chores
    • Strengthened pre-build disk cleanup: added docker system prune, apt-get clean, /tmp cleanup and other cache removals; removed background docker image prune.
    • Faster, smaller cache restores via adjusted cache compression level.
    • Prepare and fix ownership/permissions for .linuxbrew before/after cache restore.
    • Safer disk checks with guarded filesystem-size commands to avoid failures.
    • Detect and skip complex volume setup on single-disk runners.

@openshift-ci openshift-ci bot requested review from daniellutz and ysok March 3, 2026 18:27
@openshift-ci openshift-ci bot added size/m and removed size/m labels Mar 3, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Widened disk-cleanup commands in a GitHub Action, added linuxbrew cache pre/post steps, introduced a cache compression env var and guarded btrfs compsize calls in a workflow template, annotated LVM storage config, and added an early-exit when /mnt shares root’s block device to skip LVM setup.

Changes

Cohort / File(s) Summary
Disk/cleanup GitHub Action
\.github/actions/free-up-disk-space/action.yml
Step renamed; removed sudo docker image prune --all --force &; added pre-update cleanup commands (`sudo docker system prune -af
Podman/install cache action
\.github/actions/install-podman-action/action.yml
Added preparatory step to create/chown /home/linuxbrew before cache restore and a cache-hit post-step that removes immutable flags and fixes ownership/permissions of /home/linuxbrew/.linuxbrew.
Workflow template
\.github/workflows/build-notebooks-TEMPLATE.yaml
Added GHA_CACHE_ZSTD_LEVEL=3; renamed disk step; expanded df -h targets; replaced direct compsize calls with guarded btrfs checks that run compsize only on btrfs mountpoints and emit skip messages otherwise.
LVM build script
ci/cached-builds/gha_lvm_overlay.sh
Added early-detection for single-disk runners (when /mnt is on same block device as /): create/fix build mount path and GITHUB_WORKSPACE ownership and exit early to bypass LVM/swap creation.
Storage config
ci/cached-builds/storage.conf
Inserted comments noting graphroot/runroot point to the LVM-backed build volume; no functional changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title follows imperative mood ("Mitigate", "fix"), includes the ticket reference [RHAIENG-2819], uses the preferred format with scope "chore(actions)", and has no trailing period.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Branch Prefix Policy ✅ Passed PR title correctly starts with JIRA reference [RHAIENG-2819], which is explicitly allowed by policy. No prohibited branch prefix detected for main branch.
Description check ✅ Passed The PR description comprehensively covers all required sections with detailed information about changes, testing approach, and merge criteria checklist completion.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel label Mar 3, 2026
@openshift-ci openshift-ci bot added size/m and removed size/m labels Mar 3, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/build-notebooks-TEMPLATE.yaml:
- Around line 198-203: The static 20G fallocate for SCRATCH_IMG can fail on
constrained runners; update the block around SCRATCH_IMG / SCRATCH_SIZE_G to
compute available root free space (e.g., from df or /proc/partition) and set
SCRATCH_SIZE_G to the lesser of that free space and a configured max, but never
below a safe minimum (e.g., 2G); if available free space is below the minimum,
skip creating the loopback image and proceed to the LVM fallback path. Modify
the sequence that calls fallocate, losetup, mkfs.ext4 and mount to use the
computed SCRATCH_SIZE_G, and add clear checks and early exits/error handling
when the computed size is insufficient so the workflow doesn't hard-fail before
LVM setup.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3b069d4 and 292c1e0.

📒 Files selected for processing (4)
  • .github/actions/free-up-disk-space/action.yml
  • .github/actions/install-podman-action/action.yml
  • .github/workflows/build-notebooks-TEMPLATE.yaml
  • ci/cached-builds/storage.conf

Comment on lines +198 to +203
SCRATCH_IMG=/scratch_arm64.img
SCRATCH_SIZE_G=20
sudo touch "$SCRATCH_IMG" && sudo fallocate -l "${SCRATCH_SIZE_G}G" "$SCRATCH_IMG"
LOOP=$(sudo losetup --find --show "$SCRATCH_IMG")
sudo mkfs.ext4 -q "$LOOP"
sudo mount "$LOOP" /mnt
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard loopback image size against available root space.

Using a fixed 20G fallocate can hard-fail on constrained ARM64 runners, which defeats the fallback path before LVM setup runs. Please size dynamically from current free space with a minimum threshold.

Proposed fix
-          SCRATCH_IMG=/scratch_arm64.img
-          SCRATCH_SIZE_G=20
-          sudo touch "$SCRATCH_IMG" && sudo fallocate -l "${SCRATCH_SIZE_G}G" "$SCRATCH_IMG"
+          SCRATCH_IMG=/scratch_arm64.img
+          TARGET_SCRATCH_G=20
+          MIN_SCRATCH_G=10
+          ROOT_AVAIL_G=$(df --block-size=1G --output=avail / | awk 'NR==2 {print int($1)}')
+          # keep 2G headroom on root
+          SCRATCH_SIZE_G=$(( ROOT_AVAIL_G > (TARGET_SCRATCH_G + 2) ? TARGET_SCRATCH_G : ROOT_AVAIL_G - 2 ))
+          if (( SCRATCH_SIZE_G < MIN_SCRATCH_G )); then
+            echo "Insufficient root space for loopback scratch (need >=${MIN_SCRATCH_G}G)." >&2
+            exit 1
+          fi
+          sudo rm -f "$SCRATCH_IMG"
+          sudo fallocate -l "${SCRATCH_SIZE_G}G" "$SCRATCH_IMG"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/build-notebooks-TEMPLATE.yaml around lines 198 - 203, The
static 20G fallocate for SCRATCH_IMG can fail on constrained runners; update the
block around SCRATCH_IMG / SCRATCH_SIZE_G to compute available root free space
(e.g., from df or /proc/partition) and set SCRATCH_SIZE_G to the lesser of that
free space and a configured max, but never below a safe minimum (e.g., 2G); if
available free space is below the minimum, skip creating the loopback image and
proceed to the LVM fallback path. Modify the sequence that calls fallocate,
losetup, mkfs.ext4 and mount to use the computed SCRATCH_SIZE_G, and add clear
checks and early exits/error handling when the computed size is insufficient so
the workflow doesn't hard-fail before LVM setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai explain why the loopback for /mnt helps, it seems to me if the problem is total available disk space being insufficient, no amount of loopback tricks should make any difference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the gha_lvm_overlay.sh script to understand how it uses /mnt
cat ci/cached-builds/gha_lvm_overlay.sh

Unrecoverable error: Failed to clone repository


@jiridanek Great question! You're absolutely right that the loopback doesn't create additional disk space—it's using space already on the root filesystem. The benefit is architectural, not capacity-based.

Why LVM needs two "disks"

The gha_lvm_overlay.sh script sets up LVM thin provisioning for podman's storage, which works best (or in some configurations, requires) multiple physical volumes (PVs) to create a volume group (VG). Here's what's happening:

  1. AMD64 runners have /mnt as a real separate disk/partition—so LVM can span across root and /mnt as two distinct PVs
  2. ARM64 runners have no separate /mnt disk—everything is on root
  3. The loopback creates a second block device (/dev/loopX mounted at /mnt) that LVM recognizes as a distinct physical volume

The actual benefit

  • Device topology: LVM's vgcreate expects multiple PVs for redundancy or thin pool striping; the loopback satisfies this requirement
  • Script compatibility: The same gha_lvm_overlay.sh can run on both AMD64 and ARM64 without conditional logic about device count
  • Podman storage: The LVM thin pool can now be configured with separate metadata and data volumes, which some storage configurations require

You're correct that this doesn't solve a raw space shortage—it solves an LVM device-count requirement. If the root filesystem truly has insufficient space, the loopback won't help; that's why my earlier suggestion to dynamically size based on available space is still important to avoid hard failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai the only reason we have the LVM overlay is to utilize disk capacity in /mnt; is the loopback really the simplest way to make things fit together when /mnt is not a separate partition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Examine the gha_lvm_overlay.sh script to understand what it requires
cat ci/cached-builds/gha_lvm_overlay.sh

Length of output: 3332


@jiridanek You're absolutely right to question this! After examining gha_lvm_overlay.sh, I can see the loopback is not the simplest solution.

What the script actually does

The gha_lvm_overlay.sh script is hardcoded to:

  1. Create two physical volumes (one on root at /pv.img, one on /mnt at /mnt/tmp-pv.img)
  2. Create a RAID0 logical volume striped across both with --stripes 2

On ARM64 where /mnt isn't separate, the loopback creates a fake second disk just to satisfy this two-PV requirement. But RAID0 striping across two loopbacks on the same physical disk provides zero benefit—it's pure complexity.

Simpler alternatives

Since the goal is just to have a larger build volume:

  1. Skip LVM entirely on ARM64 — just use ~/.local/share/containers directly on root with some cleanup

  2. Modify gha_lvm_overlay.sh conditionally — detect single-disk case and create a single-PV LVM without RAID0

  3. Create one large loopback directly at the build mount path — simpler than the two-disk charade

The current loopback approach is essentially jumping through hoops to avoid modifying gha_lvm_overlay.sh. You're correct that if /mnt isn't adding capacity, the complexity isn't justified.

Would you prefer to:

  • Modify the LVM script to handle single-disk ARM64 runners differently?
  • Skip LVM on ARM64 altogether and rely on pre-flight cleanup + careful space management?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Let me pull up the full script to propose a concrete, minimal diff.


🏁 Script executed:

#!/bin/bash
cat ci/cached-builds/gha_lvm_overlay.sh

Length of output: 3332


@jiridanek Agreed — on a single disk there's nothing to combine, so the right answer is to skip LVM entirely and simply ensure the build directory exists on root. Here is the proposed change:

  • Detect single-disk vs dual-disk at the top of gha_lvm_overlay.sh
  • Single-disk branch: swapoff, mkdir -p + chown the build path, handle GITHUB_WORKSPACE — nothing else
  • Dual-disk branch: keep existing code verbatim
  • Remove the "Ensure /mnt scratch for arm64 (loopback)" step from the workflow template entirely

Net result: the loopback step (15+ lines in the workflow) is deleted, the script gains a ~15-line guard that exits early for single-disk runners, and dual-disk AMD64 runners are completely unaffected since their ROOT_DEV != MNT_DEV.


💡 Use @coderabbitai plan to create a plan for code changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the build is later printing some df stats that will break without the LVM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Let me pull up the full script to see what paths are referenced.


🏁 Script executed:

#!/bin/bash
cat ci/cached-builds/gha_lvm_overlay.sh

Length of output: 3332


@jiridanek Good catch! There are two concrete breakages if the single-disk branch does nothing and just uses the root filesystem bare:

  1. sudo compsize -x "${HOME}/.local/share/containers" at the end of the workflow — btrfs-compsize requires a btrfs filesystem; it will error out on the root ext4 partition.
  2. df -h | grep "${HOME}/.local/share/containers" in the 30s monitoring loop — the grep finds no mount-point line and exits 1, silently killing the background subshell (cosmetic but loses disk telemetry).

The fix that keeps those working while still dropping the two-PV RAID0 complexity: in the single-disk branch of gha_lvm_overlay.sh, skip LVM entirely but still create a single loopback and mount it as btrfs at build_mount_path. No VG, no striping, no LVM at all — just a plain loopback + mkfs.btrfs + mount.

if [[ "$ROOT_DEV" == "$MNT_DEV" ]]; then
  echo "Single-disk runner — no RAID0 LVM; using a single loopback btrfs volume."
  sudo swapoff -a || true
  sudo rm -f /mnt/swapfile || true

  # Swap (still useful for codeserver builds)
  SWAP_IMG=/swap.img
  sudo fallocate -l "${swap_size_mb}M" "$SWAP_IMG"
  sudo mkswap "$SWAP_IMG"
  sudo swapon "$SWAP_IMG"

  # Single btrfs loopback at build_mount_path (preserves compsize + df grep)
  ROOT_RESERVE_KB=$(expr ${root_reserve_mb} \* 1024)
  ROOT_FREE_KB=$(df --block-size=1024 --output=avail / | tail -1)
  ROOT_LVM_SIZE_KB=$(expr $ROOT_FREE_KB - $ROOT_RESERVE_KB)
  ROOT_LVM_SIZE_BYTES=$(expr $ROOT_LVM_SIZE_KB \* 1024)
  sudo fallocate -z -l "${ROOT_LVM_SIZE_BYTES}" "${pv_loop_path}"
  LOOP_DEV=$(sudo losetup --find --show "${pv_loop_path}")
  sudo mkfs.btrfs "$LOOP_DEV"
  mkdir -p "${build_mount_path}"
  sudo mount -t btrfs -o defaults,noatime,noautodefrag,nobarrier,compress=zstd:3,nodatasum,nodiscard,nodatacow,space_cache=v2 "$LOOP_DEV" "${build_mount_path}"
  sudo chown -R "${build_mount_path_ownership}" "${build_mount_path}"

  if [[ -n "${GITHUB_WORKSPACE:-}" && ! -d "${GITHUB_WORKSPACE}" ]]; then
    sudo mkdir -p "${GITHUB_WORKSPACE}"
    sudo chown -R "${WORKSPACE_OWNER}" "${GITHUB_WORKSPACE}"
  fi
  exit 0
fi

This means:

  • compsize ✅ (still btrfs)
  • df -h | grep ~/.local/share/containers ✅ (still a real mountpoint)
  • storage.conf graphroot/runroot ✅ (still backed by the loopback volume)
  • No LVM, no striping, no second fake disk — the loopback step in the workflow YAML can be removed entirely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think I'd much rather update the two telemetry/logging commands than have that loopback, all that size arithmetics look complicated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
cat ci/cached-builds/gha_lvm_overlay.sh

Length of output: 3332


@jiridanek Agreed, that's the cleanest approach. Here are the three minimal changes:


1. ci/cached-builds/gha_lvm_overlay.sh — early-exit on single-disk

Insert right after the variable declarations (before swapoff -a):

 VG_NAME=buildvg
 
+# Detect if /mnt is a separate block device from /
+ROOT_DEV=$(df --output=source / | tail -1)
+MNT_DEV=$(df --output=source /mnt | tail -1)
+if [[ "$ROOT_DEV" == "$MNT_DEV" ]]; then
+  echo "Single-disk runner (/mnt on same fs as root) — skipping LVM, using root fs directly."
+  mkdir -p "${build_mount_path}"
+  sudo chown -R "${build_mount_path_ownership}" "${build_mount_path}"
+  if [[ ! -d "${GITHUB_WORKSPACE}" ]]; then
+    sudo mkdir -p "${GITHUB_WORKSPACE}"
+    sudo chown -R "${WORKSPACE_OWNER}" "${GITHUB_WORKSPACE}"
+  fi
+  exit 0
+fi
+
 # github runners have an active swap file in /mnt/swapfile

No swap setup needed for the single-disk case — swapoff -a was only there to free /mnt/swapfile, which doesn't exist on these runners anyway.


2. build-notebooks-TEMPLATE.yaml — fix monitoring loop

-          df -h | grep "${HOME}/.local/share/containers"
+          df -h

3. build-notebooks-TEMPLATE.yaml — fix compsize step

-        run: sudo compsize -x "${HOME}/.local/share/containers"
+        run: |
+          mountpoint -q "${HOME}/.local/share/containers" && \
+            sudo compsize -x "${HOME}/.local/share/containers" || \
+            echo "Skipping compsize: ${HOME}/.local/share/containers is not a btrfs mountpoint."

4. build-notebooks-TEMPLATE.yaml — remove the loopback step entirely

-      # On arm64 there is no separate /mnt scratch disk; create a loopback so LVM can use two "disks". RHAIENG-2819
-      - name: Ensure /mnt scratch for arm64 (loopback)
-        if: inputs.platform == 'linux/arm64'
-        run: |
-          set -Eeuxo pipefail
-          ...
-          sudo mount "$LOOP" /mnt
-          df -h /mnt
-

Net result: the 20-line loopback step in the workflow is gone, gha_lvm_overlay.sh gets a ~10-line guard with no size arithmetic, and the two telemetry commands degrade gracefully on root-fs builds. AMD64 dual-disk runners are completely unaffected since ROOT_DEV != MNT_DEV.

Comment on lines +23 to +24
sudo docker image prune --all --force &
sudo docker system prune --all --force &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai what's the difference going from image to system?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiridanek, great question! The key difference is that docker system prune is more comprehensive than docker image prune:

docker image prune --all --force removes:

  • All unused images (not just dangling ones, due to --all)
  • Images that aren't referenced by any container

docker system prune --all --force removes:

  • All stopped containers
  • All networks not used by at least one container
  • All unused images (same as above, due to --all)
  • All build cache

So docker system prune cleans up containers, networks, and build cache in addition to images. This makes it more effective for freeing up disk space, especially after crashed or interrupted runs that may leave orphaned containers and networks behind. That's why it's a better fit for the pre-flight cleanup step mentioned in the comment above this line.

This comment was marked as duplicate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai ok, i'm convinced, this change in the PR is indeed beneficial, thanks @mtchoum1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiridanek, glad the explanation was helpful! The switch to docker system prune is a solid improvement for freeing up disk space on resource-constrained runners. 🐇

ᓚᘏᗢ (◕‿◕✿)

Comment on lines +71 to +72
# Faster/smaller cache restore on resource-constrained runners (e.g. arm64). RHAIENG-2819
GHA_CACHE_ZSTD_LEVEL: "3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai isn't this the github actions default, actually?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🌐 Web query:

GitHub Actions cache GHA_CACHE_ZSTD_LEVEL default compression level

💡 Result:

actions/cache / @actions/cache does not set a zstd -# level by default (it runs zstd/zstdmt with flags like -T0 and optionally --long=30, but no compression level flag) so the effective compression level is zstd’s own default. In zstd, the default compression level is 3. [1] [2]

Therefore, if GHA_CACHE_ZSTD_LEVEL isn’t set/used in your environment, the default effective zstd level is 3. [1] [2]

References:
[1] @actions/cache tar implementation (no -# passed to zstd): (raw.githubusercontent.com)
[2] zstd manual: “Default level is … ==3”: (facebook.github.io)

Citations:


@jiridanek You're absolutely right! According to the GitHub Actions cache implementation, actions/cache doesn't explicitly set a zstd compression level flag, which means it uses zstd's own default level — and zstd's default compression level is 3.

So setting GHA_CACHE_ZSTD_LEVEL: "3" is redundant and has no actual effect. If the goal is truly to optimize for resource-constrained runners, you might want to either:

  1. Remove this setting entirely (since it's the default anyway), or
  2. Use a different level like "1" or "2" if you want faster compression at the cost of slightly larger cache sizes (lower levels = faster/less compression), or a higher level like "6" if you want better compression but slower processing

The comment says "Faster/smaller cache restore" — but since level 3 is already the default, this change doesn't actually achieve that goal.

runs:
using: "composite"
steps:
# Caller workflow should run pre-flight cleanup (docker prune, apt clean, /tmp) first for crashed-run orphans. RHAIENG-2819
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai this useless comment seems useless?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiridanek, I agree — the comment is misplaced and adds little value in its current location. Here's why:

  1. Wrong location: It documents an expectation about the caller workflow, but lives inside the callee action file (free-up-disk-space/action.yml). A developer maintaining the calling workflow (build-notebooks-TEMPLATE.yaml) would not naturally look here for guidance on ordering.

  2. Already self-documented at the call site: The actual pre-flight step added to the calling workflow is named "Pre-flight cleanup (prune, apt, temp)" and is positioned directly before the free-up-disk-space call, making the ordering obvious from context.

  3. Not actionable: Since this is a composite action, the comment cannot enforce or hint at ordering — it's purely noise for anyone reading this file.

Suggestion: Remove the comment from free-up-disk-space/action.yml. If documentation of the ordering rationale is needed, a brief inline comment in build-notebooks-TEMPLATE.yaml next to the pre-flight step would be more appropriate. The RHAIENG-2819 ticket reference is already captured in the PR itself.

@openshift-ci openshift-ci bot removed the size/m label Mar 4, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.github/actions/free-up-disk-space/action.yml (1)

12-12: Avoid running Docker prune twice in one action run.

Line 12 and Line 26 both prune Docker state; this is mostly duplicate work and can add avoidable runtime on constrained runners.

Proposed simplification
         docker system prune -af --volumes || true
@@
-        sudo docker system prune --all --force &
         wait

Also applies to: 26-27

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/actions/free-up-disk-space/action.yml at line 12, The action
currently runs the exact command "docker system prune -af --volumes" twice (once
at the earlier step and again at lines 26-27); remove the duplicate invocation
and keep a single prune where it's most appropriate (either at the beginning or
end of the job) to avoid redundant work and extra runtime; update the action.yml
to delete the second occurrence of "docker system prune -af --volumes || true"
(or the first one) and ensure any surrounding steps still rely on that single
prune location for cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/build-notebooks-TEMPLATE.yaml:
- Around line 572-575: The current one-liner uses "mountpoint -q ... && sudo
compsize -x ... || echo ..." which conflates a failed compsize with the path not
being a btrfs mountpoint; change the guard to an explicit if/then/else: call
mountpoint -q "${HOME}/.local/share/containers" and only when it returns success
run sudo compsize -x "${HOME}/.local/share/containers"; if compsize fails,
surface a clear error (and non-zero exit) instead of falling through to the
"Skipping compsize" message, otherwise print the skip message in the else
branch. Use the existing commands mountpoint and compsize and the same path
variable to locate the code to update.

---

Nitpick comments:
In @.github/actions/free-up-disk-space/action.yml:
- Line 12: The action currently runs the exact command "docker system prune -af
--volumes" twice (once at the earlier step and again at lines 26-27); remove the
duplicate invocation and keep a single prune where it's most appropriate (either
at the beginning or end of the job) to avoid redundant work and extra runtime;
update the action.yml to delete the second occurrence of "docker system prune
-af --volumes || true" (or the first one) and ensure any surrounding steps still
rely on that single prune location for cleanup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1566a986-f132-4e76-bedb-91bcb24aafb7

📥 Commits

Reviewing files that changed from the base of the PR and between aea40e4 and 93d327b.

📒 Files selected for processing (5)
  • .github/actions/free-up-disk-space/action.yml
  • .github/actions/install-podman-action/action.yml
  • .github/workflows/build-notebooks-TEMPLATE.yaml
  • ci/cached-builds/gha_lvm_overlay.sh
  • ci/cached-builds/storage.conf
🚧 Files skipped from review as they are similar to previous changes (3)
  • ci/cached-builds/gha_lvm_overlay.sh
  • .github/actions/install-podman-action/action.yml
  • ci/cached-builds/storage.conf

@mtchoum1
Copy link
Contributor Author

mtchoum1 commented Mar 4, 2026

@coderabbitai why is ci/prow/images failing

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me fetch the Prow CI log to investigate the failure.


🏁 Script executed:

#!/bin/bash
# Try to fetch the Prow job status and logs
curl -s "https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/opendatahub-io_notebooks/3050/pull-ci-opendatahub-io-notebooks-main-images/2029242580880855040" | head -200

echo "---ARTIFACTS---"
# Try to get the build log from GCS
curl -s "https://storage.googleapis.com/test-platform-results/pr-logs/pull/opendatahub-io_notebooks/3050/pull-ci-opendatahub-io-notebooks-main-images/2029242580880855040/build-log.txt" | tail -200

Length of output: 29642


@mtchoum1 The failure is unrelated to the changes in this PR. Here's the root cause:

Root Cause: PYLOCK_FLAVOR ARG is empty

The build log reveals this error:

error: build error: building at STEP "COPY ${PYTORCH_SOURCE_CODE}/uv.lock.d/pylock.${PYLOCK_FLAVOR}.toml ./pylock.toml":
  copier: stat: "/jupyter/pytorch/ubi9-python-3.12/uv.lock.d/pylock..toml": no such file or directory

Notice pylock..toml — there are two dots, meaning ${PYLOCK_FLAVOR} is resolving to an empty string. The Dockerfile step:

COPY ${PYTORCH_SOURCE_CODE}/uv.lock.d/pylock.${PYLOCK_FLAVOR}.toml ./pylock.toml

…is trying to copy pylock..toml instead of something like pylock.cpu.toml or pylock.cuda.toml.

Impact

This cascades into failures across virtually all images (pytorch, tensorflow, datascience, minimal — for both ubi9-python-3.11 and ubi9-python-3.12), with two error categories:

  • DockerBuildFailed — for images that reached the build step and hit the missing file
  • ManageDockerfileFailed — for downstream images that depend on the failed ones

What to check

  1. Is PYLOCK_FLAVOR a newly required ARG that needs to be wired into the Prow job configuration (e.g., the ci-operator config for this repo)?
  2. Does this failure also exist on main? If so, it's a pre-existing issue independent of this PR. You can verify by checking a recent main branch build.
  3. Is this a merge conflict issue? If main recently added PYLOCK_FLAVOR support to the Dockerfiles but the Prow job config hasn't been updated to pass the build arg, that would explain the empty value.

This PR's changes (disk space, ARM64 loopback, .linuxbrew permissions) are not the cause of this failure.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ci/cached-builds/gha_lvm_overlay.sh (1)

29-37: Preserve swapfile cleanup before the single-disk early exit.

On Line 29, the early-exit path skips the swap cleanup done later in the script. Add a best-effort swapoff/rm here too, so constrained single-disk runners reclaim the same space before build starts.

Proposed patch
 if [[ "$ROOT_DEV" == "$MNT_DEV" ]]; then
   echo "Single-disk runner (/mnt on same fs as root) — skipping LVM, using root fs directly."
+  # Keep disk-reclaim behavior consistent with the dual-disk path.
+  sudo swapoff -a || true
+  sudo rm -f /mnt/swapfile || true
   mkdir -p "${build_mount_path}"
   sudo chown -R "${build_mount_path_ownership}" "${build_mount_path}"
   if [[ ! -d "${GITHUB_WORKSPACE}" ]]; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci/cached-builds/gha_lvm_overlay.sh` around lines 29 - 37, The early-exit
branch skips the later swap cleanup; before the existing exit 0 in that block,
add a best-effort swap cleanup sequence that mirrors the later logic: detect
active swap entries (e.g. parse /proc/swaps or check for a known swapfile path),
run swapoff on each active swap device/file (ignore failures), and remove the
swapfile(s) if present (rm -f), all guarded so failures don't abort the script;
place this cleanup just before the exit 0 in the same block that manipulates
${build_mount_path} and ${GITHUB_WORKSPACE} so single-disk runners reclaim
space.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ci/cached-builds/gha_lvm_overlay.sh`:
- Around line 29-37: The early-exit branch skips the later swap cleanup; before
the existing exit 0 in that block, add a best-effort swap cleanup sequence that
mirrors the later logic: detect active swap entries (e.g. parse /proc/swaps or
check for a known swapfile path), run swapoff on each active swap device/file
(ignore failures), and remove the swapfile(s) if present (rm -f), all guarded so
failures don't abort the script; place this cleanup just before the exit 0 in
the same block that manipulates ${build_mount_path} and ${GITHUB_WORKSPACE} so
single-disk runners reclaim space.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbf1a0de-3428-4295-8269-141b7bf371bf

📥 Commits

Reviewing files that changed from the base of the PR and between 93d327b and 0b15ffd.

📒 Files selected for processing (5)
  • .github/actions/free-up-disk-space/action.yml
  • .github/actions/install-podman-action/action.yml
  • .github/workflows/build-notebooks-TEMPLATE.yaml
  • ci/cached-builds/gha_lvm_overlay.sh
  • ci/cached-builds/storage.conf
🚧 Files skipped from review as they are similar to previous changes (2)
  • ci/cached-builds/storage.conf
  • .github/actions/install-podman-action/action.yml

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci/cached-builds/storage.conf`:
- Around line 6-8: Update the header comment about graphroot/runroot to avoid
implying LVM is always used; mention that the LVM-backed build volume is used
when gha_lvm_overlay.sh configures it, but single-disk runners skip LVM and use
the root filesystem directly. Reference the script name gha_lvm_overlay.sh and
the graphroot/runroot concept so maintainers know the behavior is conditional
rather than unconditional.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 076b970e-44ab-4ffa-a8bf-5f48d9bb29cb

📥 Commits

Reviewing files that changed from the base of the PR and between 0b15ffd and 852229d.

📒 Files selected for processing (5)
  • .github/actions/free-up-disk-space/action.yml
  • .github/actions/install-podman-action/action.yml
  • .github/workflows/build-notebooks-TEMPLATE.yaml
  • ci/cached-builds/gha_lvm_overlay.sh
  • ci/cached-builds/storage.conf
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/actions/install-podman-action/action.yml
  • .github/actions/free-up-disk-space/action.yml

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2026

@mtchoum1: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images 27cc176 link true /test images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot added size/m and removed size/m labels Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel size/m

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants