Skip to content

Conversation

@jbtrystram
Copy link
Member

Although mpath-var-lib-containers.service is set to only run on first boot, it sometime runs twice when the system reboots too early. Sometimes, in low load CI environement, the reboot in this test happens before systemd's first-boot-complete.target is reached. This make ConditionFirstBoot to still be true at the next boot, causing the mpath service to fail, because it already ran during the actual first boot.

A previous attempt[1] at fixing this improved the flake but this happened again and i noticed that systemd didn't reach this target before the reboot:
Reached target first-boot-complete.target - First Boot Complete is only shown after the second boot in the logs.

Likely fixes coreos/rhel-coreos-config#66

[1] abd0c18

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a race condition in multipath tests by waiting for systemd's first-boot-complete.target before rebooting. The approach is sound, but the implementation of the waiting logic has a bug that would prevent it from working as intended. I've provided a detailed comment with a suggested fix for the waiting function. Once that is addressed, the change should effectively resolve the flakiness issue.

@jbtrystram jbtrystram force-pushed the mpath_wait_firstboot_complete branch 2 times, most recently from 87b1989 to 6c14eb0 Compare October 31, 2025 13:49
verifyMultipath(c, m, "/var/lib/containers")
}

func waitForCompleteFirstboot(c cluster.TestCluster) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional:

rather than running a retry here you could possibly run systemd-run --wait and then run that inside a util.WaitUntilReady like is done here.

i.e. rather than worrying about how many times to retry you can just focus on how much time you want to spend waiting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the suggestion. Applied it.

dustymabe
dustymabe previously approved these changes Oct 31, 2025
@jbtrystram jbtrystram force-pushed the mpath_wait_firstboot_complete branch 2 times, most recently from c12a98d to 02bcc86 Compare October 31, 2025 14:27
Although `mpath-var-lib-containers.service` is set to only run on first
boot, it sometime runs twice when the system reboots too early.
Sometimes, in low load CI environement, the reboot in this test happens
before systemd's `first-boot-complete.target` is reached. This make
`ConditionFirstBoot` to still be true at the next boot, causing the
mpath service to fail, because it already ran during the actual first
boot.

A previous attempt[1] at fixing this improved the flake but this happened
again and i noticed that systemd didn't reach this target before the
reboot:
`Reached target first-boot-complete.target - First Boot Complete` is
only shown after the second boot in the logs.

Likely fixes coreos/rhel-coreos-config#66

[1] coreos@abd0c18
@jbtrystram jbtrystram force-pushed the mpath_wait_firstboot_complete branch from 02bcc86 to 50e5cd6 Compare October 31, 2025 14:36
@jbtrystram jbtrystram enabled auto-merge (rebase) October 31, 2025 14:37
Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbtrystram jbtrystram merged commit 82410e1 into coreos:main Oct 31, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

multipath.partition failing on el10

2 participants