Skip to content

Conversation

@Hanarion
Copy link

Description

Fixes: #12122

This PR resolves the random backup failures observed when using a CIFS (SMB) backup repository with NAS backup. The original issue describes how backups appear to complete — files transferred, file remaining = 0 — but the job ends in status FAILED because the subsequent sync + umount step blocks: the mount point remains busy and cannot unmount cleanly.

What was happening:

After the data copy, the script issues sync but because CIFS doesn’t always flush/close all filesystem handles immediately, the mount remains busy.

The script attempting umount $mount_point fails (“target is busy”), the mount and directory remain, leaving resources dangling and causing job to fail even though the backup data is present.

The issue is intermittent (“sometimes it fails, sometimes it doesn’t”) due to timing/race conditions with CIFS.

What this PR implements:

Adds a polling loop (e.g., using fuser ‑m <mount_point>) with a timeout to wait for any active handles on the mount to clear before attempting umount.

If the mount remains busy past the timeout, we show an error text, and still try to umount (We never know, it may work if we are lucky)

We also ensures that on backups of stopped VMs, the umount is also triggered

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

I ran multiple tests by directly calling the script and checking the return code while blocking the umount :

[root@compute01 ~]# /usr/bin/bash /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/nasbackup.sh -o backup -v i-12-606-VM -t cifs -s '/XXXXXX.XXXXX/XXX' -m 'vers=3.0,username=XXXXXX,password=XXXXXX' -p 'i-12-606-VM/test' -q false -d ''

Job type:         Completed   
Operation:        Backup      
Time elapsed:     32208        ms
File processed:   23.000 GiB
File remaining:   0.000 B
File total:       23.000 GiB

2770737887
Timeout for unmounting reached: still busy
Warning: failed to unmount /tmp/csbackup.weorL, skipping rmdir
umount error message: umount: /tmp/csbackup.weorL: target is busy.
[root@compute01 ~]# echo $?
0
[root@compute01 ~]# grep -i unmount /var/log/cloudstack/agent/agent.log
2025-11-25 13-42-17> Warning: failed to unmount /tmp/csbackup.weorL, error: umount: /tmp/csbackup.weorL: target is busy.

How did you try to break this feature and the system with this change?

This change should not break anything as it simply fix the wrong return code when umount fails, and add more details in stdout and logs

@boring-cyborg
Copy link

boring-cyborg bot commented Nov 25, 2025

Congratulations on your first Pull Request and welcome to the Apache CloudStack community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md)
Here are some useful points:

@sureshanaparti
Copy link
Contributor

@blueorangutan package

Copilot finished reviewing on behalf of sureshanaparti November 25, 2025 13:15
@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent backup failures when using CIFS (SMB) repositories by introducing a new umount_operation() function that waits for mount points to become idle before attempting to unmount. The key improvement is adding a polling mechanism with timeout to handle race conditions where CIFS doesn't immediately flush filesystem handles.

  • Adds umount_operation() function with 10-second timeout and busy-wait logic using fuser
  • Replaces direct umount + rmdir calls with umount_operation() in backup functions
  • Adds error logging and warning messages for unmount failures
Comments suppressed due to low confidence (1)

scripts/vm/hypervisor/kvm/nasbackup.sh:206

  • The delete_backup and get_backup_stats functions still use the old unmount pattern (umount $mount_point followed by rmdir $mount_point) instead of the new umount_operation() function. For consistency and to apply the same safety checks across all operations, these functions should also use umount_operation().
delete_backup() {
  mount_operation

  rm -frv $dest
  sync
  umount $mount_point
  rmdir $mount_point
}

get_backup_stats() {
  mount_operation

  echo $mount_point
  df -P $mount_point 2>/dev/null | awk 'NR==2 {print $2, $3}'
  umount $mount_point
  rmdir $mount_point
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

exit 1
fi
}

Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The umount_operation() function lacks documentation explaining its purpose, behavior, and return value. Consider adding a comment block describing: (1) that it waits up to 10 seconds for the mount point to become idle, (2) that it attempts to unmount and remove the directory, and (3) its error handling behavior (currently does not fail the script on unmount failure).

Suggested change
# umount_operation
# ----------------
# Waits up to 10 seconds for the mount point to become idle (not in use).
# Attempts to unmount the specified mount point. If unmounting succeeds,
# removes the mount directory. If unmounting fails, logs a warning and
# does not remove the directory. This function does not cause the script
# to exit on unmount failure; the script continues execution.

Copilot uses AI. Check for mistakes.
else
echo "Warning: failed to unmount $mount_point, skipping rmdir"
echo "umount error message: $umount_output"
log -ne "Warning: failed to unmount $mount_point, error: $umount_output"
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function does not return a non-zero exit code when unmount fails. Since the script uses set -eo pipefail at the top, and this function temporarily disables errors with set +e, a failed unmount will not cause the script to exit with an error status. This means backup jobs will report success even when the unmount fails, which contradicts the PR's goal of fixing "random backup failures." Consider adding return 1 or exit 1 in the else branch (lines 244-248) to ensure proper error propagation.

Suggested change
log -ne "Warning: failed to unmount $mount_point, error: $umount_output"
log -ne "Warning: failed to unmount $mount_point, error: $umount_output"
return 1

Copilot uses AI. Check for mistakes.
Comment on lines +224 to +231
elapsed=0
while fuser -m "$mount_point" >/dev/null 2>&1 && (( elapsed < 10 )); do
sleep 1
elapsed=$((elapsed + 1))
done

# Check if timeout was reached
if (( elapsed >= 10 )); then
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout value 10 is hardcoded twice in the function (line 225 and line 231). If the timeout needs to be changed in the future, it must be updated in both places, increasing the risk of inconsistency. Consider using a local variable at the start of the function (e.g., local timeout=10) and referencing it in both locations.

Suggested change
elapsed=0
while fuser -m "$mount_point" >/dev/null 2>&1 && (( elapsed < 10 )); do
sleep 1
elapsed=$((elapsed + 1))
done
# Check if timeout was reached
if (( elapsed >= 10 )); then
local timeout=10
elapsed=0
while fuser -m "$mount_point" >/dev/null 2>&1 && (( elapsed < timeout )); do
sleep 1
elapsed=$((elapsed + 1))
done
# Check if timeout was reached
if (( elapsed >= timeout )); then

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Nov 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 3.58%. Comparing base (8171d95) to head (895d09b).

❗ There is a different number of reports uploaded between BASE (8171d95) and HEAD (895d09b). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (8171d95) HEAD (895d09b)
uitests 2 1
unittests 2 0
Additional details and impacted files
@@              Coverage Diff              @@
##               main   #12133       +/-   ##
=============================================
- Coverage     17.56%    3.58%   -13.98%     
=============================================
  Files          5912      445     -5467     
  Lines        529383    37536   -491847     
  Branches      64660     6901    -57759     
=============================================
- Hits          92984     1347    -91637     
+ Misses       425941    36025   -389916     
+ Partials      10458      164    -10294     
Flag Coverage Δ
uitests 3.58% <ø> (ø)
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15829

@DaanHoogland
Copy link
Contributor

@Hanarion do you want this on v23 or on the next LTS iteration of 20 or 22? (main will not go in/on those)

@Hanarion
Copy link
Author

@DaanHoogland I personally uses the latest version, and it seems that in 4.20 only NFS is supported anyway, so sync should work fine in this version

@DaanHoogland
Copy link
Contributor

@DaanHoogland I personally uses the latest version, and it seems that in 4.20 only NFS is supported anyway, so sync should work fine in this version

ok, how about 22.1, though? (rebase on the 4.22 branch)

@Hanarion
Copy link
Author

ok, how about 22.1, though? (rebase on the 4.22 branch)

Up to you! I'm not very familiar yet with how things usually work in the project, so I'm fine with it if you think rebasing on the 4.22 branch is the better option.

@DaanHoogland
Copy link
Contributor

ok, how about 22.1, though? (rebase on the 4.22 branch)

Up to you! I'm not very familiar yet with how things usually work in the project, so I'm fine with it if you think rebasing on the 4.22 branch is the better option.

for LTS releases we have release branches that get merged forwards once in a while. For the recent 22.0 release we created 4.20 as release branch. I think your fix is legit to add to that branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CIFS NAS Backups failing "randomly"

5 participants