Skip to content

Comments

NO-JIRA: common.yaml: apply dracut fix in postcript#37

Merged
joelcapitao merged 2 commits intocoreos:mainfrom
joelcapitao:apply_dracut_fix
Sep 10, 2025
Merged

NO-JIRA: common.yaml: apply dracut fix in postcript#37
joelcapitao merged 2 commits intocoreos:mainfrom
joelcapitao:apply_dracut_fix

Conversation

@joelcapitao
Copy link
Member

This is a follow-up of [1], we have to apply the dracut fix here as the new dracut release containing the same fix is landing in F42 repo. More details in [2].

[1] coreos/fedora-coreos-config#3588
[2] coreos/fedora-coreos-tracker#1937

Copy link
Member

@jbtrystram jbtrystram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joelcapitao
Copy link
Member Author

joelcapitao commented Jul 10, 2025

It's currently failing as the postscript is executed twice 1. the one coming from f-c-c 2. the other one defined here.
To unblock it, we have to merge coreos/fedora-coreos-config#3588 and bump f-c-c submodule.

@joelcapitao joelcapitao changed the title common.yaml: apply dracut fix in postcript NO-JIRA: common.yaml: apply dracut fix in postcript Jul 10, 2025
@openshift-ci-robot
Copy link

@jcapiitao: This pull request explicitly references no jira issue.

Details

In response to this:

This is a follow-up of [1], we have to apply the dracut fix here as the new dracut release containing the same fix is landing in F42 repo. More details in [2].

[1] coreos/fedora-coreos-config#3588
[2] coreos/fedora-coreos-tracker#1937

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@joelcapitao
Copy link
Member Author

/retest

@joelcapitao
Copy link
Member Author

/test scos-10-build-test-metal

@dustymabe
Copy link
Member

@jcapiitao can we remove the denylist entry for the multipath resilient test as part of this change?

@joelcapitao
Copy link
Member Author

@jcapiitao can we remove the denylist entry for the multipath resilient test as part of this change?

Right, I forgot about it, I'll amend my PR once TestISO tests fixed on scos-10 (spotted by Adam while monitoring the pipelines) to save CI resources.

@joelcapitao joelcapitao force-pushed the apply_dracut_fix branch 2 times, most recently from 53311bb to 4fa518f Compare July 18, 2025 09:00
@joelcapitao
Copy link
Member Author

@jcapiitao can we remove the denylist entry for the multipath resilient test as part of this change?

Right, I forgot about it, I'll amend my PR once TestISO tests fixed on scos-10 (spotted by Adam while monitoring the pipelines) to save CI resources.

TestISO tests fixed with #41

@joelcapitao
Copy link
Member Author

@jcapiitao: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/scos-10-build-test-qemu 4fa518f link true /test scos-10-build-test-qemu

Full PR test history. Your PR dashboard.

@dustymabe looks like the dracut patch does not fix the mpath.resilient test failure

@dustymabe
Copy link
Member

@jcapiitao: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
Test name Commit Details Required Rerun command
ci/prow/scos-10-build-test-qemu 4fa518f link true /test scos-10-build-test-qemu
Full PR test history. Your PR dashboard.

@dustymabe looks like the dracut patch does not fix the mpath.resilient test failure

hmm. I thought that was the whole point of this backport:

@jlebon
Copy link
Member

jlebon commented Jul 18, 2025

hmm. I thought that was the whole point of this backport:

Hmm Prow seems down currently so I can't check the logs but yeah, that was the goal. Was the error message the same? It's entirely possible that a separate bug has come in since then that needs fixing.

@joelcapitao
Copy link
Member Author

hmm. I thought that was the whole point of this backport:

The backport was working for f-c-c as we were able to drop the same denylist a few weeks ago [1].
But I think it have never worked in EL.

Was the error message the same? It's entirely possible that a separate bug has come in since then that needs fixing.

coreos-multipath-wait.target is timing out [2], Timed out waiting for device �[0;1;39mdev-d…�[0m- /dev/disk/by-label/dm-mpath-boot.
For some reason, /dev/disk/by-uuid/dm-mpath-<uuid>.device is not used for bootdev, I need to double check why.

[1] coreos/fedora-coreos-config@9a30544
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/coreos_rhel-coreos-config/37/pull-ci-coreos-rhel-coreos-config-main-scos-10-build-test-qemu/1946133223624937472/artifacts/test/artifacts/kola/ext.config.shared.multipath.resilient/f8dbc52d-e675-49b8-93ea-fe4a98c9a838/console.txt

@jlebon
Copy link
Member

jlebon commented Jul 21, 2025

For some reason, /dev/disk/by-uuid/dm-mpath-.device is not used for bootdev, I need to double check why.

Looks like this is happening on first boot. For first boot, we always rely on the boot label in the initramfs.

But yeah, quite odd. It's not clear why multipathd didn't take ownership of the devices.

@joelcapitao
Copy link
Member Author

I found out that dracut is configuring multipath with find_multipaths strict in first boot which prevents multipath to own the devices:

198.706545 | wwid 0x000000000000000b not in wwids file, skipping sda
198.706546 | sda: orphan path, only one path
198.706565 | wwid 0x000000000000000b not in wwids file, skipping sdd
198.706566 | sdd: orphan path, only one path
198.706575 | wwid 0xe9b0c630525961ba not in wwids file, skipping sdc
198.706576 | sdc: orphan path, only one path
198.706586 | wwid 0xe9b0c630525961ba not in wwids file, skipping sdb
198.706587 | sdb: orphan path, only one path

dracut set it to strict by default since dracut-ng/dracut-ng@1e802f1 , but IIUC we are setting it to on in Fedora.
I'm missing something

@joelcapitao
Copy link
Member Author

it looks like the default /etc/mutlipath.conf generated by mpathconf from device-mapper-multipath RPM is overwritten by dracut

# Minimal multipath.conf generated by dracut
# Avoid any devices being multipathed by default
defaults {
        find_multipaths strict
}

instead of

# device-mapper-multipath configuration file

# For a complete list of the default configuration values, run either:
# # multipath -t
# or
# # multipathd show config

# For a list of configuration options with descriptions, see the
# multipath.conf man page.

defaults {
        user_friendly_names no
        find_multipaths on
}

blacklist {
}

@joelcapitao
Copy link
Member Author

joelcapitao commented Jul 22, 2025

@joelcapitao
Copy link
Member Author

joelcapitao commented Jul 22, 2025

We are missing https://src.fedoraproject.org/rpms/dracut/c/3baef943aa236562228b0fc34627f08604dfbe20?branch=rawhide in EL10

I've proposed to apply part of the patch in the postprocess script.
Also, dracut 107 [1][2] is about to land in EL, so we won't need that for too long (hopefully). BTW, I've already built a SCOS locally with this dracut 107 and all the mpath test works (i.e w/o the denylist entries I mean).

[1] https://gitlab.com/redhat/centos-stream/rpms/dracut/-/merge_requests/59
[2] https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=5998758

@joelcapitao
Copy link
Member Author

Hmm, I just realized that for EL9, it's another issue:

[    8.974747] coreos-propagate-multipath-conf[795]: cp: cannot create regular file '/sysroot/etc/': Not a directory

@jlebon
Copy link
Member

jlebon commented Jul 22, 2025

Ahh yes, see also the discussions in dracut-ng/dracut-ng#509.

Also, dracut 107 [1][2] is about to land in EL, so we won't need that for too long (hopefully).

I think we do still need to carry it until it gets backported to 9.6.

Hmm, I just realized that for EL9, it's another issue:

cp: cannot create regular file '/sysroot/etc/': Not a directory

This looks like a race. I saw something similar in coreos/fedora-coreos-config#3572, which we decided to fix with coreos/coreos-installer#1677. One way to check if that patch would fix this locally is to try changing the resilient test to swap the root= karg to root=/dev/disk/by-uuid/dm-mpath-UUID for the second boot.

@joelcapitao
Copy link
Member Author

This looks like a race. I saw something similar in coreos/fedora-coreos-config#3572, which we decided to fix with coreos/coreos-installer#1677. One way to check if that patch would fix this locally is to try changing the resilient test to swap the root= karg to root=/dev/disk/by-uuid/dm-mpath-UUID for the second boot.

I'm not sure I understood fully the check method so I took the other path: backporting the coreos-installer patch into current centos c9s branch, scratch build it [1], build SCOS9 with the build in overrides/rpm/ folder, and run the test ext.config.shared.multipath.resilient and it went well 👍
So I'm gonna create a r-c-c issue to track this, submit the MR to the c9s branch, and kola-deny the ext.config.shared.multipath.resilient test only for EL9.

Thank you for the heads-up here, saving me bunch of time !

@dustymabe dunno if you're aware but the rhel10 tests (added in [2]) are failing. rhel-10.1-baseos and rhel-10.1-appstream are not discovered during the cosa fetch operation. I think it's because of the content_sets.yaml [3] file not reflecting the rhel-10.1 repo but I'm not sure. I compared with what we have in our RCHOS pipeline, and we can see this in logs [4] currently:

[2025-07-18T13:44:34.986Z] Warning: No corresponding entry in content_sets.yaml for rhel-10.1-baseos
[2025-07-18T13:44:34.986Z] Warning: No corresponding entry in content_sets.yaml for rhel-10.1-appstream

[1] https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=6001337
[2] openshift/release#67305
[3] https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/blob/master/content_sets.yaml?ref_type=heads#L87-L101
[4] https://jenkins-rhcos--prod-pipeline.apps.int.prod-stable-spoke1-dc-iad2.itup.redhat.com/blue/organizations/jenkins/build/detail/build/537/pipeline/180/

@joelcapitao
Copy link
Member Author

/test rhcos-9-build-test-metal

@joelcapitao joelcapitao requested a review from jlebon September 4, 2025 08:46
travier
travier previously approved these changes Sep 4, 2025
Copy link
Member

@travier travier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugly but it's a hopefully temporary workaround.
/lgtm

@travier
Copy link
Member

travier commented Sep 4, 2025

Do we have a RHEL bug for the dracut fixes?

@joelcapitao
Copy link
Member Author

/test rhcos-9-build-test-metal

@joelcapitao
Copy link
Member Author

/test rhcos-9-build-test-metal

{ test "rhcos-9-build-test-metal" failed: could not watch pod: the pod ci-op-rj5ktf62/rhcos-9-build-test-metal failed after 57m21s (failed containers: ): Evicted The node was low on resource: ephemeral-storage. Threshold quantity: 20525086511, available: 18289916Ki. Container test was using 22728644Ki, request is 0, has larger consumption of ephemeral-storage. }

Hitting the same infra issue again

@joelcapitao
Copy link
Member Author

Ugly but it's a hopefully temporary workaround.

Yeah, I tried using patch with coreos/fedora-coreos-config#3508 (comment) for better readability, but was not available, so dealt with sed.

Do we have a RHEL bug for the dracut fixes?

@jlebon wanted to enable multipath user-friendly names c.f coreos/fedora-coreos-tracker#1937 (comment). For Fedora and EL10, the latest shipped dracut NVR contains the commits enabling it.

1 similar comment
@joelcapitao
Copy link
Member Author

Ugly but it's a hopefully temporary workaround.

Yeah, I tried using patch with coreos/fedora-coreos-config#3508 (comment) for better readability, but was not available, so dealt with sed.

Do we have a RHEL bug for the dracut fixes?

@jlebon wanted to enable multipath user-friendly names c.f coreos/fedora-coreos-tracker#1937 (comment). For Fedora and EL10, the latest shipped dracut NVR contains the commits enabling it.

@joelcapitao
Copy link
Member Author

/hold
waiting for answer https://issues.redhat.com/browse/RHEL-96101

This is a follow-up of [1], we have to apply the dracut fix here as
the new dracut release containing the same fix landed in F42 and EL10
repos. More details in [2][3]

[1] coreos/fedora-coreos-config#3588
[2] coreos/fedora-coreos-tracker#1937
[3] https://issues.redhat.com/browse/RHEL-87490
Applying the dracut patch in a postprocess script fixes the test
failure for EL10.
For EL9, we still have to ignore the tests while awaiting for the
new coreos-installer NVR to land in repo. See tracker URL.
@joelcapitao
Copy link
Member Author

/test scos-9-build-test-qemu
/test rhcos-9-build-test-qemu

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/approve
/lgtm

@jbtrystram
Copy link
Member

/unhold

Copy link
Member

@jbtrystram jbtrystram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jbtrystram, jcapiitao, jlebon, travier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [jbtrystram,jcapiitao,jlebon,travier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@joelcapitao joelcapitao merged commit c01e124 into coreos:main Sep 10, 2025
13 checks passed
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 10, 2025

@jcapiitao: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/rhcos-9-build-test-metal e7cea39 link unknown /test rhcos-9-build-test-metal
ci/prow/scos-9-build-test-qemu e7cea39 link unknown /test scos-9-build-test-qemu

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants