[RHAIENG-2819]: chore(actions) Mitigate ARM64 disk space failures and fix .linuxbrew cache permissions#3050
[RHAIENG-2819]: chore(actions) Mitigate ARM64 disk space failures and fix .linuxbrew cache permissions#3050mtchoum1 wants to merge 1 commit intoopendatahub-io:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughWidened disk-cleanup commands in a GitHub Action, added linuxbrew cache pre/post steps, introduced a cache compression env var and guarded btrfs compsize calls in a workflow template, annotated LVM storage config, and added an early-exit when /mnt shares root’s block device to skip LVM setup. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/build-notebooks-TEMPLATE.yaml:
- Around line 198-203: The static 20G fallocate for SCRATCH_IMG can fail on
constrained runners; update the block around SCRATCH_IMG / SCRATCH_SIZE_G to
compute available root free space (e.g., from df or /proc/partition) and set
SCRATCH_SIZE_G to the lesser of that free space and a configured max, but never
below a safe minimum (e.g., 2G); if available free space is below the minimum,
skip creating the loopback image and proceed to the LVM fallback path. Modify
the sequence that calls fallocate, losetup, mkfs.ext4 and mount to use the
computed SCRATCH_SIZE_G, and add clear checks and early exits/error handling
when the computed size is insufficient so the workflow doesn't hard-fail before
LVM setup.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (4)
.github/actions/free-up-disk-space/action.yml.github/actions/install-podman-action/action.yml.github/workflows/build-notebooks-TEMPLATE.yamlci/cached-builds/storage.conf
| SCRATCH_IMG=/scratch_arm64.img | ||
| SCRATCH_SIZE_G=20 | ||
| sudo touch "$SCRATCH_IMG" && sudo fallocate -l "${SCRATCH_SIZE_G}G" "$SCRATCH_IMG" | ||
| LOOP=$(sudo losetup --find --show "$SCRATCH_IMG") | ||
| sudo mkfs.ext4 -q "$LOOP" | ||
| sudo mount "$LOOP" /mnt |
There was a problem hiding this comment.
Guard loopback image size against available root space.
Using a fixed 20G fallocate can hard-fail on constrained ARM64 runners, which defeats the fallback path before LVM setup runs. Please size dynamically from current free space with a minimum threshold.
Proposed fix
- SCRATCH_IMG=/scratch_arm64.img
- SCRATCH_SIZE_G=20
- sudo touch "$SCRATCH_IMG" && sudo fallocate -l "${SCRATCH_SIZE_G}G" "$SCRATCH_IMG"
+ SCRATCH_IMG=/scratch_arm64.img
+ TARGET_SCRATCH_G=20
+ MIN_SCRATCH_G=10
+ ROOT_AVAIL_G=$(df --block-size=1G --output=avail / | awk 'NR==2 {print int($1)}')
+ # keep 2G headroom on root
+ SCRATCH_SIZE_G=$(( ROOT_AVAIL_G > (TARGET_SCRATCH_G + 2) ? TARGET_SCRATCH_G : ROOT_AVAIL_G - 2 ))
+ if (( SCRATCH_SIZE_G < MIN_SCRATCH_G )); then
+ echo "Insufficient root space for loopback scratch (need >=${MIN_SCRATCH_G}G)." >&2
+ exit 1
+ fi
+ sudo rm -f "$SCRATCH_IMG"
+ sudo fallocate -l "${SCRATCH_SIZE_G}G" "$SCRATCH_IMG"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/build-notebooks-TEMPLATE.yaml around lines 198 - 203, The
static 20G fallocate for SCRATCH_IMG can fail on constrained runners; update the
block around SCRATCH_IMG / SCRATCH_SIZE_G to compute available root free space
(e.g., from df or /proc/partition) and set SCRATCH_SIZE_G to the lesser of that
free space and a configured max, but never below a safe minimum (e.g., 2G); if
available free space is below the minimum, skip creating the loopback image and
proceed to the LVM fallback path. Modify the sequence that calls fallocate,
losetup, mkfs.ext4 and mount to use the computed SCRATCH_SIZE_G, and add clear
checks and early exits/error handling when the computed size is insufficient so
the workflow doesn't hard-fail before LVM setup.
There was a problem hiding this comment.
@coderabbitai explain why the loopback for /mnt helps, it seems to me if the problem is total available disk space being insufficient, no amount of loopback tricks should make any difference
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the gha_lvm_overlay.sh script to understand how it uses /mnt
cat ci/cached-builds/gha_lvm_overlay.sh
Unrecoverable error: Failed to clone repository
@jiridanek Great question! You're absolutely right that the loopback doesn't create additional disk space—it's using space already on the root filesystem. The benefit is architectural, not capacity-based.
Why LVM needs two "disks"
The gha_lvm_overlay.sh script sets up LVM thin provisioning for podman's storage, which works best (or in some configurations, requires) multiple physical volumes (PVs) to create a volume group (VG). Here's what's happening:
- AMD64 runners have
/mntas a real separate disk/partition—so LVM can span across root and/mntas two distinct PVs - ARM64 runners have no separate
/mntdisk—everything is on root - The loopback creates a second block device (
/dev/loopXmounted at/mnt) that LVM recognizes as a distinct physical volume
The actual benefit
- Device topology: LVM's
vgcreateexpects multiple PVs for redundancy or thin pool striping; the loopback satisfies this requirement - Script compatibility: The same
gha_lvm_overlay.shcan run on both AMD64 and ARM64 without conditional logic about device count - Podman storage: The LVM thin pool can now be configured with separate metadata and data volumes, which some storage configurations require
You're correct that this doesn't solve a raw space shortage—it solves an LVM device-count requirement. If the root filesystem truly has insufficient space, the loopback won't help; that's why my earlier suggestion to dynamically size based on available space is still important to avoid hard failures.
There was a problem hiding this comment.
@coderabbitai the only reason we have the LVM overlay is to utilize disk capacity in /mnt; is the loopback really the simplest way to make things fit together when /mnt is not a separate partition?
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Examine the gha_lvm_overlay.sh script to understand what it requires
cat ci/cached-builds/gha_lvm_overlay.sh
Length of output: 3332
@jiridanek You're absolutely right to question this! After examining gha_lvm_overlay.sh, I can see the loopback is not the simplest solution.
What the script actually does
The gha_lvm_overlay.sh script is hardcoded to:
- Create two physical volumes (one on root at
/pv.img, one on /mnt at/mnt/tmp-pv.img) - Create a RAID0 logical volume striped across both with
--stripes 2
On ARM64 where /mnt isn't separate, the loopback creates a fake second disk just to satisfy this two-PV requirement. But RAID0 striping across two loopbacks on the same physical disk provides zero benefit—it's pure complexity.
Simpler alternatives
Since the goal is just to have a larger build volume:
-
Skip LVM entirely on ARM64 — just use
~/.local/share/containersdirectly on root with some cleanup -
Modify gha_lvm_overlay.sh conditionally — detect single-disk case and create a single-PV LVM without RAID0
-
Create one large loopback directly at the build mount path — simpler than the two-disk charade
The current loopback approach is essentially jumping through hoops to avoid modifying gha_lvm_overlay.sh. You're correct that if /mnt isn't adding capacity, the complexity isn't justified.
Would you prefer to:
- Modify the LVM script to handle single-disk ARM64 runners differently?
- Skip LVM on ARM64 altogether and rely on pre-flight cleanup + careful space management?
There was a problem hiding this comment.
🧩 Analysis chain
Let me pull up the full script to propose a concrete, minimal diff.
🏁 Script executed:
#!/bin/bash
cat ci/cached-builds/gha_lvm_overlay.sh
Length of output: 3332
@jiridanek Agreed — on a single disk there's nothing to combine, so the right answer is to skip LVM entirely and simply ensure the build directory exists on root. Here is the proposed change:
- Detect single-disk vs dual-disk at the top of
gha_lvm_overlay.sh - Single-disk branch:
swapoff,mkdir -p+chownthe build path, handleGITHUB_WORKSPACE— nothing else - Dual-disk branch: keep existing code verbatim
- Remove the "Ensure /mnt scratch for arm64 (loopback)" step from the workflow template entirely
Net result: the loopback step (15+ lines in the workflow) is deleted, the script gains a ~15-line guard that exits early for single-disk runners, and dual-disk AMD64 runners are completely unaffected since their ROOT_DEV != MNT_DEV.
💡 Use @coderabbitai plan to create a plan for code changes
There was a problem hiding this comment.
the build is later printing some df stats that will break without the LVM
There was a problem hiding this comment.
🧩 Analysis chain
Let me pull up the full script to see what paths are referenced.
🏁 Script executed:
#!/bin/bash
cat ci/cached-builds/gha_lvm_overlay.sh
Length of output: 3332
@jiridanek Good catch! There are two concrete breakages if the single-disk branch does nothing and just uses the root filesystem bare:
sudo compsize -x "${HOME}/.local/share/containers"at the end of the workflow —btrfs-compsizerequires a btrfs filesystem; it will error out on the root ext4 partition.df -h | grep "${HOME}/.local/share/containers"in the 30s monitoring loop — the grep finds no mount-point line and exits 1, silently killing the background subshell (cosmetic but loses disk telemetry).
The fix that keeps those working while still dropping the two-PV RAID0 complexity: in the single-disk branch of gha_lvm_overlay.sh, skip LVM entirely but still create a single loopback and mount it as btrfs at build_mount_path. No VG, no striping, no LVM at all — just a plain loopback + mkfs.btrfs + mount.
if [[ "$ROOT_DEV" == "$MNT_DEV" ]]; then
echo "Single-disk runner — no RAID0 LVM; using a single loopback btrfs volume."
sudo swapoff -a || true
sudo rm -f /mnt/swapfile || true
# Swap (still useful for codeserver builds)
SWAP_IMG=/swap.img
sudo fallocate -l "${swap_size_mb}M" "$SWAP_IMG"
sudo mkswap "$SWAP_IMG"
sudo swapon "$SWAP_IMG"
# Single btrfs loopback at build_mount_path (preserves compsize + df grep)
ROOT_RESERVE_KB=$(expr ${root_reserve_mb} \* 1024)
ROOT_FREE_KB=$(df --block-size=1024 --output=avail / | tail -1)
ROOT_LVM_SIZE_KB=$(expr $ROOT_FREE_KB - $ROOT_RESERVE_KB)
ROOT_LVM_SIZE_BYTES=$(expr $ROOT_LVM_SIZE_KB \* 1024)
sudo fallocate -z -l "${ROOT_LVM_SIZE_BYTES}" "${pv_loop_path}"
LOOP_DEV=$(sudo losetup --find --show "${pv_loop_path}")
sudo mkfs.btrfs "$LOOP_DEV"
mkdir -p "${build_mount_path}"
sudo mount -t btrfs -o defaults,noatime,noautodefrag,nobarrier,compress=zstd:3,nodatasum,nodiscard,nodatacow,space_cache=v2 "$LOOP_DEV" "${build_mount_path}"
sudo chown -R "${build_mount_path_ownership}" "${build_mount_path}"
if [[ -n "${GITHUB_WORKSPACE:-}" && ! -d "${GITHUB_WORKSPACE}" ]]; then
sudo mkdir -p "${GITHUB_WORKSPACE}"
sudo chown -R "${WORKSPACE_OWNER}" "${GITHUB_WORKSPACE}"
fi
exit 0
fiThis means:
compsize✅ (still btrfs)df -h | grep ~/.local/share/containers✅ (still a real mountpoint)storage.confgraphroot/runroot ✅ (still backed by the loopback volume)- No LVM, no striping, no second fake disk — the loopback step in the workflow YAML can be removed entirely
There was a problem hiding this comment.
hmm, I think I'd much rather update the two telemetry/logging commands than have that loopback, all that size arithmetics look complicated
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
cat ci/cached-builds/gha_lvm_overlay.sh
Length of output: 3332
@jiridanek Agreed, that's the cleanest approach. Here are the three minimal changes:
1. ci/cached-builds/gha_lvm_overlay.sh — early-exit on single-disk
Insert right after the variable declarations (before swapoff -a):
VG_NAME=buildvg
+# Detect if /mnt is a separate block device from /
+ROOT_DEV=$(df --output=source / | tail -1)
+MNT_DEV=$(df --output=source /mnt | tail -1)
+if [[ "$ROOT_DEV" == "$MNT_DEV" ]]; then
+ echo "Single-disk runner (/mnt on same fs as root) — skipping LVM, using root fs directly."
+ mkdir -p "${build_mount_path}"
+ sudo chown -R "${build_mount_path_ownership}" "${build_mount_path}"
+ if [[ ! -d "${GITHUB_WORKSPACE}" ]]; then
+ sudo mkdir -p "${GITHUB_WORKSPACE}"
+ sudo chown -R "${WORKSPACE_OWNER}" "${GITHUB_WORKSPACE}"
+ fi
+ exit 0
+fi
+
# github runners have an active swap file in /mnt/swapfileNo swap setup needed for the single-disk case — swapoff -a was only there to free /mnt/swapfile, which doesn't exist on these runners anyway.
2. build-notebooks-TEMPLATE.yaml — fix monitoring loop
- df -h | grep "${HOME}/.local/share/containers"
+ df -h3. build-notebooks-TEMPLATE.yaml — fix compsize step
- run: sudo compsize -x "${HOME}/.local/share/containers"
+ run: |
+ mountpoint -q "${HOME}/.local/share/containers" && \
+ sudo compsize -x "${HOME}/.local/share/containers" || \
+ echo "Skipping compsize: ${HOME}/.local/share/containers is not a btrfs mountpoint."4. build-notebooks-TEMPLATE.yaml — remove the loopback step entirely
- # On arm64 there is no separate /mnt scratch disk; create a loopback so LVM can use two "disks". RHAIENG-2819
- - name: Ensure /mnt scratch for arm64 (loopback)
- if: inputs.platform == 'linux/arm64'
- run: |
- set -Eeuxo pipefail
- ...
- sudo mount "$LOOP" /mnt
- df -h /mnt
-Net result: the 20-line loopback step in the workflow is gone, gha_lvm_overlay.sh gets a ~10-line guard with no size arithmetic, and the two telemetry commands degrade gracefully on root-fs builds. AMD64 dual-disk runners are completely unaffected since ROOT_DEV != MNT_DEV.
| sudo docker image prune --all --force & | ||
| sudo docker system prune --all --force & |
There was a problem hiding this comment.
@coderabbitai what's the difference going from image to system?
There was a problem hiding this comment.
@jiridanek, great question! The key difference is that docker system prune is more comprehensive than docker image prune:
docker image prune --all --force removes:
- All unused images (not just dangling ones, due to
--all) - Images that aren't referenced by any container
docker system prune --all --force removes:
- All stopped containers
- All networks not used by at least one container
- All unused images (same as above, due to
--all) - All build cache
So docker system prune cleans up containers, networks, and build cache in addition to images. This makes it more effective for freeing up disk space, especially after crashed or interrupted runs that may leave orphaned containers and networks behind. That's why it's a better fit for the pre-flight cleanup step mentioned in the comment above this line.
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
@coderabbitai ok, i'm convinced, this change in the PR is indeed beneficial, thanks @mtchoum1
There was a problem hiding this comment.
@jiridanek, glad the explanation was helpful! The switch to docker system prune is a solid improvement for freeing up disk space on resource-constrained runners. 🐇
ᓚᘏᗢ (◕‿◕✿)
| # Faster/smaller cache restore on resource-constrained runners (e.g. arm64). RHAIENG-2819 | ||
| GHA_CACHE_ZSTD_LEVEL: "3" |
There was a problem hiding this comment.
@coderabbitai isn't this the github actions default, actually?
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
GitHub Actions cache GHA_CACHE_ZSTD_LEVEL default compression level
💡 Result:
actions/cache / @actions/cache does not set a zstd -# level by default (it runs zstd/zstdmt with flags like -T0 and optionally --long=30, but no compression level flag) so the effective compression level is zstd’s own default. In zstd, the default compression level is 3. [1] [2]
Therefore, if GHA_CACHE_ZSTD_LEVEL isn’t set/used in your environment, the default effective zstd level is 3. [1] [2]
References:
[1] @actions/cache tar implementation (no -# passed to zstd): (raw.githubusercontent.com)
[2] zstd manual: “Default level is … ==3”: (facebook.github.io)
Citations:
- 1: https://raw.githubusercontent.com/actions/toolkit/main/packages/cache/src/internal/tar.ts
- 2: https://facebook.github.io/zstd/zstd_manual.html?utm_source=openai
@jiridanek You're absolutely right! According to the GitHub Actions cache implementation, actions/cache doesn't explicitly set a zstd compression level flag, which means it uses zstd's own default level — and zstd's default compression level is 3.
So setting GHA_CACHE_ZSTD_LEVEL: "3" is redundant and has no actual effect. If the goal is truly to optimize for resource-constrained runners, you might want to either:
- Remove this setting entirely (since it's the default anyway), or
- Use a different level like
"1"or"2"if you want faster compression at the cost of slightly larger cache sizes (lower levels = faster/less compression), or a higher level like"6"if you want better compression but slower processing
The comment says "Faster/smaller cache restore" — but since level 3 is already the default, this change doesn't actually achieve that goal.
| runs: | ||
| using: "composite" | ||
| steps: | ||
| # Caller workflow should run pre-flight cleanup (docker prune, apt clean, /tmp) first for crashed-run orphans. RHAIENG-2819 |
There was a problem hiding this comment.
@jiridanek, I agree — the comment is misplaced and adds little value in its current location. Here's why:
-
Wrong location: It documents an expectation about the caller workflow, but lives inside the callee action file (
free-up-disk-space/action.yml). A developer maintaining the calling workflow (build-notebooks-TEMPLATE.yaml) would not naturally look here for guidance on ordering. -
Already self-documented at the call site: The actual pre-flight step added to the calling workflow is named
"Pre-flight cleanup (prune, apt, temp)"and is positioned directly before thefree-up-disk-spacecall, making the ordering obvious from context. -
Not actionable: Since this is a composite action, the comment cannot enforce or hint at ordering — it's purely noise for anyone reading this file.
Suggestion: Remove the comment from free-up-disk-space/action.yml. If documentation of the ordering rationale is needed, a brief inline comment in build-notebooks-TEMPLATE.yaml next to the pre-flight step would be more appropriate. The RHAIENG-2819 ticket reference is already captured in the PR itself.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/actions/free-up-disk-space/action.yml (1)
12-12: Avoid running Docker prune twice in one action run.Line 12 and Line 26 both prune Docker state; this is mostly duplicate work and can add avoidable runtime on constrained runners.
Proposed simplification
docker system prune -af --volumes || true @@ - sudo docker system prune --all --force & waitAlso applies to: 26-27
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/actions/free-up-disk-space/action.yml at line 12, The action currently runs the exact command "docker system prune -af --volumes" twice (once at the earlier step and again at lines 26-27); remove the duplicate invocation and keep a single prune where it's most appropriate (either at the beginning or end of the job) to avoid redundant work and extra runtime; update the action.yml to delete the second occurrence of "docker system prune -af --volumes || true" (or the first one) and ensure any surrounding steps still rely on that single prune location for cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/build-notebooks-TEMPLATE.yaml:
- Around line 572-575: The current one-liner uses "mountpoint -q ... && sudo
compsize -x ... || echo ..." which conflates a failed compsize with the path not
being a btrfs mountpoint; change the guard to an explicit if/then/else: call
mountpoint -q "${HOME}/.local/share/containers" and only when it returns success
run sudo compsize -x "${HOME}/.local/share/containers"; if compsize fails,
surface a clear error (and non-zero exit) instead of falling through to the
"Skipping compsize" message, otherwise print the skip message in the else
branch. Use the existing commands mountpoint and compsize and the same path
variable to locate the code to update.
---
Nitpick comments:
In @.github/actions/free-up-disk-space/action.yml:
- Line 12: The action currently runs the exact command "docker system prune -af
--volumes" twice (once at the earlier step and again at lines 26-27); remove the
duplicate invocation and keep a single prune where it's most appropriate (either
at the beginning or end of the job) to avoid redundant work and extra runtime;
update the action.yml to delete the second occurrence of "docker system prune
-af --volumes || true" (or the first one) and ensure any surrounding steps still
rely on that single prune location for cleanup.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1566a986-f132-4e76-bedb-91bcb24aafb7
📒 Files selected for processing (5)
.github/actions/free-up-disk-space/action.yml.github/actions/install-podman-action/action.yml.github/workflows/build-notebooks-TEMPLATE.yamlci/cached-builds/gha_lvm_overlay.shci/cached-builds/storage.conf
🚧 Files skipped from review as they are similar to previous changes (3)
- ci/cached-builds/gha_lvm_overlay.sh
- .github/actions/install-podman-action/action.yml
- ci/cached-builds/storage.conf
|
@coderabbitai why is ci/prow/images failing |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chainLet me fetch the Prow CI log to investigate the failure. 🏁 Script executed: Length of output: 29642
Root Cause:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ci/cached-builds/gha_lvm_overlay.sh (1)
29-37: Preserve swapfile cleanup before the single-disk early exit.On Line 29, the early-exit path skips the swap cleanup done later in the script. Add a best-effort
swapoff/rmhere too, so constrained single-disk runners reclaim the same space before build starts.Proposed patch
if [[ "$ROOT_DEV" == "$MNT_DEV" ]]; then echo "Single-disk runner (/mnt on same fs as root) — skipping LVM, using root fs directly." + # Keep disk-reclaim behavior consistent with the dual-disk path. + sudo swapoff -a || true + sudo rm -f /mnt/swapfile || true mkdir -p "${build_mount_path}" sudo chown -R "${build_mount_path_ownership}" "${build_mount_path}" if [[ ! -d "${GITHUB_WORKSPACE}" ]]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/cached-builds/gha_lvm_overlay.sh` around lines 29 - 37, The early-exit branch skips the later swap cleanup; before the existing exit 0 in that block, add a best-effort swap cleanup sequence that mirrors the later logic: detect active swap entries (e.g. parse /proc/swaps or check for a known swapfile path), run swapoff on each active swap device/file (ignore failures), and remove the swapfile(s) if present (rm -f), all guarded so failures don't abort the script; place this cleanup just before the exit 0 in the same block that manipulates ${build_mount_path} and ${GITHUB_WORKSPACE} so single-disk runners reclaim space.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@ci/cached-builds/gha_lvm_overlay.sh`:
- Around line 29-37: The early-exit branch skips the later swap cleanup; before
the existing exit 0 in that block, add a best-effort swap cleanup sequence that
mirrors the later logic: detect active swap entries (e.g. parse /proc/swaps or
check for a known swapfile path), run swapoff on each active swap device/file
(ignore failures), and remove the swapfile(s) if present (rm -f), all guarded so
failures don't abort the script; place this cleanup just before the exit 0 in
the same block that manipulates ${build_mount_path} and ${GITHUB_WORKSPACE} so
single-disk runners reclaim space.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bbf1a0de-3428-4295-8269-141b7bf371bf
📒 Files selected for processing (5)
.github/actions/free-up-disk-space/action.yml.github/actions/install-podman-action/action.yml.github/workflows/build-notebooks-TEMPLATE.yamlci/cached-builds/gha_lvm_overlay.shci/cached-builds/storage.conf
🚧 Files skipped from review as they are similar to previous changes (2)
- ci/cached-builds/storage.conf
- .github/actions/install-podman-action/action.yml
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/cached-builds/storage.conf`:
- Around line 6-8: Update the header comment about graphroot/runroot to avoid
implying LVM is always used; mention that the LVM-backed build volume is used
when gha_lvm_overlay.sh configures it, but single-disk runners skip LVM and use
the root filesystem directly. Reference the script name gha_lvm_overlay.sh and
the graphroot/runroot concept so maintainers know the behavior is conditional
rather than unconditional.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 076b970e-44ab-4ffa-a8bf-5f48d9bb29cb
📒 Files selected for processing (5)
.github/actions/free-up-disk-space/action.yml.github/actions/install-podman-action/action.yml.github/workflows/build-notebooks-TEMPLATE.yamlci/cached-builds/gha_lvm_overlay.shci/cached-builds/storage.conf
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/actions/install-podman-action/action.yml
- .github/actions/free-up-disk-space/action.yml
|
@mtchoum1: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
ARM64 / resource-constrained runners (RHAIENG-2819):
docker system prune -af --volumes,apt-get clean, andrm -rf /tmp/*to remove orphaned assets from crashed runs and free space before heavy builds (sameifcondition as the existing free-up-disk-space step).GHA_CACHE_ZSTD_LEVEL: "3"to reduce cache size and speed up restore on constrained runners (e.g. arm64).Install Podman action – .linuxbrew cache permissions:
actions/cache: create/home/linuxbrewandchown -Rto the runner so cache restore can set file modes without “Operation not permitted”.chattr -R -i(clear immutable),chown -Rto runner, andchmod -R u+rwX,go=rXon/home/linuxbrew/.linuxbrewso brew/podman work and “Cannot change mode to rwxr-xr-x: Operation not permitted” is resolved.How Has This Been Tested?
workflow_dispatch) and confirm jobs that use the disk-pressure path (cuda, tensorflow, pytorch, rocm, orlinux/arm64) run the new steps and complete without “No space left on device” or LVM failures on arm64 where applicable.install-podman-actionwith cache hit), confirm the “Prepare .linuxbrew parent for cache” and “Fix .linuxbrew ownership and permissions after cache restore” steps run and that subsequentbrew/podmansteps do not report “Cannot change mode: Operation not permitted”.Self checklist (all need to be checked):
make test(gmakeon macOS) before asking for reviewDockerfile.konfluxfiles should be done inodh/notebooksand automatically synced torhds/notebooks. For Konflux-specific changes, modifyDockerfile.konfluxfiles directly inrhds/notebooksas these require special attention in the downstream repository and flow to the upcoming RHOAI release.Merge criteria:
Summary by CodeRabbit