-
Notifications
You must be signed in to change notification settings - Fork 520
MCO-1891: Add enhancement for install time Image Mode #1823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@umohnani8: This pull request references MCO-1527 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
enhancements/machine-config/install-time-support-for-image-mode-on-openshift.md
Outdated
Show resolved
Hide resolved
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Add an enhancement for supporting install time of image mode on OpenShift. Signed-off-by: Urvashi <[email protected]>
/assign @jlebon |
@umohnani8: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like implicitly needs the eliminate reboots epic to work. Should we link those somehow? If that's not needed, maybe we should clarify in the workflow
creation-date: 2025-07-25 | ||
last-updated: 2025-09-29 | ||
tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement | ||
- https://issues.redhat.com/browse/MCO-1347 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the intent here is to link to the implementation epic for the feature
tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement | ||
- https://issues.redhat.com/browse/MCO-1347 | ||
see-also: | ||
- "https://github.com/openshift/enhancements/pull/1515/files?short_path=9f0c5f1#diff-9f0c5f1adabad0dfbdb3c9a5b66e53d4fc6619274d7a4c260508d148de17f5c1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since the OCL enhancement is merged, maybe we should link to that directly
|
||
We want to bring that support to install time so cluster admins can customize and configure their OS from the beginning instead of having to do it post-install. | ||
|
||
OpenShift's current approach to OS management provides excellent consistency and supportability through RHCOS (Red Hat CoreOS), but it requires cluster admins to cede certain configurability aspects to the platform. As workloads become more specialized there is an increasing need for OS-level customization that can be applied from Day 0. An obvious example today is how AI workloads require specific hardware drivers and configurations on nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this is the only place we refer to "Day 0". Possibly better to be consistent and just call it install time support everywhere
### Goals | ||
|
||
- Enable cluster admins to use Image Mode on OpenShift at install time | ||
- User creates a customized OS container image to be used when a cluster is installed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line feels more like a user requirement in the workflow as opposed to a goal
|
||
- Enable cluster admins to use Image Mode on OpenShift at install time | ||
- User creates a customized OS container image to be used when a cluster is installed | ||
- Maintain OpenShift's single-click upgrade experience maintaining customizations across upgrades |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we don't need to talk about upgrade since this shouldn't touch on that?
- Parse MOSC configurations from the installation directory | ||
- Update the MCD first boot service to use the seeded pre-built custom OS container image | ||
|
||
#### Node Deployment Process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also the same as we have today right (for non layered pools)?
name: worker | ||
annotations: | ||
# Key annotation that triggers hybrid workflow | ||
machineconfiguration.openshift.io/pre-built-image: "registry.example.com/custom-rhcos-worker:latest@sha256:abc123def456..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following up from the question above, is this mandatory to be provided via the user? What happens if they provide a MOSC without this? Should we fail the install?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is mandatory and needs to be provided the user. I plan to add some validations in place to check for this and fail if this is not provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that the case where you want to have a day-1 MOSC defined so that pure OCL kicks in once the cluster is up? (I.e. that annotation is missing, but it's on purpose.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's fine to say "don't put that into the install manifests and do it purely as a day 2 operation" which I guess prevents any race conditions from our end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems odd to impose limits on this though. (Also possibly technically a compatibility break assuming day-1 MOSCs work today?)
But I also understand wanting to try to catch users missing this. I think in practice, they're probably copying from our docs and they just need to fill in the templated bits? I.e. maybe we don't need to worry about this.
|
||
We will use the same MachineOSConfig and MachineOSBuild CRD definitions as already used for post-install customizations with some new annotations. The API for this can be found [here](https://github.com/openshift/api/blob/master/machineconfiguration/v1). | ||
|
||
To know more about the post-install image mode workflow, please refer to the enhancement [here](https://github.com/openshift/enhancements/pull/1515/files?short_path=9f0c5f1#diff-9f0c5f1adabad0dfbdb3c9a5b66e53d4fc6619274d7a4c260508d148de17f5c1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(should probably link to the main enhancement)
- Complete install time image mode on OpenShift installation workflow | ||
- Multiple machine config pool scenarios | ||
|
||
#### E2E Tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing will be pretty involved I think. Maybe we can expand on that a bit more here?
|
||
#### Story 3: Edge Computing Deployment | ||
|
||
As a cluster admin deploying edge computing solutions, I want to include edge-specific drivers and monitoring agents in my OS image during installation, so that my edge nodes are immediately functional without requiring network connectivity for post-installation customization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the rest of the enhancement, I don't think we're covering for this use case (also applies to the Limited support for specialized hardware requirements during installation
point above)
This sounds more like custom ISO or built-in registry support that agent team is working on, and we don't explicitly cover this case without additional support outside of the cluster workflow (but perhaps this is also a pre-requisite for that to work end to end?)
|
||
### Non-Goals | ||
|
||
- Replacing the existing ignition-based configuration system entirely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an outcome? just confirming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a non-goal, so we are definitely not planning on doing this by adding install-time OCL support.
|
||
## Motivation | ||
|
||
Image Mode on OpenShift became GA in OCP 4.18.21+ and has been well received as a post-install OS customization option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe state explicitly this enhancement is OCP ≥ X.Y.Z only (since you reference 4.18.21+ for post-install)
|
||
The install time Image Mode on OpenShift workflow integrates with the existing OpenShift installation process in the following steps. | ||
|
||
#### Step 1: Installation Configuration with Pre-Built Container Image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal assumes users will “just provide” a digest image and a push secret. What happens if they don’t? I’d suggest calling out pre-flight validations (either in openshift-install or bootstrap MCO) that check:
- the registry is reachable
- the digest resolves correctly
- the secret actually works for both pull and push.
This would save admins from failing mid-bootstrap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we will document that these resources need to be added to the manifests directory before install. We also plan to add some validation checks during bootstrap to at least test that we can communicate with the registry and that the image is valid.
1. MCO controller's syncMachineOSConfigs() function reads MachineOSConfig files from `/etc/mcs/bootstrap/machine-os-configs/` | ||
2. MCO creates MachineOSConfig objects in the Kubernetes API if they don't already exist | ||
3. Build controller detects the `machineconfiguration.openshift.io/pre-built-image` annotation on the newly created MachineOSConfig | ||
4. Build controller triggers seeding workflow by calling `seedMachineOSConfigWithExistingImage()`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need a workflow if install-time seeding fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If install-time seeding is not working, the plan here is to just fail completely. For the initial implementation, it doesn't make sense to fall back onto a non-OCL installation.
In future, when we have support to built the image during bootstrap, we can fallback to that if the install-time seeding fails. But for the scope of this enhancement, we just fail and report the errors/failures in a detailed and clear way.
|
||
- Enable cluster admins to use Image Mode on OpenShift at install time | ||
- User creates a customized OS container image to be used when a cluster is installed | ||
- Maintain OpenShift's single-click upgrade experience maintaining customizations across upgrades |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a seeded image was built against an older RHCOS (say 4.17), but the cluster is installing 4.18? Will we fail early, or will nodes come up mismatched?
- `PreBuiltImage` label set to "true" | ||
- Proper metadata including MOSC name, DigestedImagePushSpec, and MCP reference | ||
- Success condition with reason "PreBuiltImageSeeded" | ||
6. MCO updates MachineOSConfig status with `CurrentImagePullSpec` pointing to the pre-built image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens a week later when the admin scales up new workers? Will they land on the pre-seeded image?
@umohnani8: This pull request references MCO-1891 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
- Post-installation customization complexity and timing issues | ||
- Limited support for specialized hardware requirements during installation | ||
|
||
### User Stories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, unless I'm misreading, I think this is missing an important user story, which is if you need drivers to even do the installation, notably storage and network drivers. (I.e. it's not just to optimize bootstrapping, but a hard requirement.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I think there's 2 parts where this kicks in:
- I need the driver to even boot the machine
- I need the driver before the container runtime/application stacks run
Are you talking about the former or the latter? If it's the former, doesn't that have to be built into the ISO?
Edit: I realize you covered that right below 🤦 ignore this comment
5. Cluster admin runs `openshift-install create cluster` which automatically: | ||
- Creates ignition configs by base64 encoding all YAML files from `manifests/` directory | ||
|
||
#### Step 2: Bootstrap Phase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the user story I suggested above, there is a step between step 1 and step 2, call it step 1b, which is to generate e.g. a custom live ISO from the custom container image.
Today, that can be done via https://github.com/coreos/custom-coreos-disk-images, but in the future, this would be done via bootc-image-builder (see coreos/fedora-coreos-tracker#1906).
If you don't strictly need drivers for bootstrapping, then yeah you can use the stock bootimages as implied here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be great in that case (the custom bootimages case) is if we can avoid that MCO pivot reboot at all if the image is already on the same digest as pointed to by the "pre-built" annotation. Would the existing MCD first boot logic already know to skip that pivot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussions with the team, we decided to focus the enhancement on supporting pre-built container images only as the MCO currently doesn't handle anything with ISOs and would be a much more complicated process on integrating it as other components handle the ISOs used at installation.
We plan to add this support in a future phase once we have the whole install time story working with pre-built container images.
The current approach won't handle any customization needed to install the cluster, but the installation workflow will match the behavior of current non-OCL installs where we boot twice only (initial disk boot and then when the node joins the cluster boot).
I will add this to the non-goals section of this enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussions with the team, we decided to focus the enhancement on supporting pre-built container images only as the MCO currently doesn't handle anything with ISOs and would be a much more complicated process on integrating it as other components handle the ISOs used at installation.
There might be some confusion here. I don't think you need to worry about custom ISOs/bootimages directly in this flow. That's left to the user to build using tools external to the MCO if they need it.
And actually, I think it may be completely transparent to the MCO since the MCO today doesn't care what you boot from on first boot (right?).
Skipping unnecessary reboots is an optimization and can be done later. I think the only requirement though is that the system doesn't temporarily reboot into the vanilla non-customized node image (i.e. the literal rhel-coreos
image in the payload). Because those wouldn't have the necessary drivers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're covering the "don't go back to stock image" via a separate epic at this time. This enhancement I think is intended to cover just the install time workflow such that you can provide a pre-built config and have on-cluster layering take over post-install.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that's fine. Probably worth spelling this out more in the enhancement. That said AIUI, the customers asking for install time support today want it exactly because of this. Might want to check with Mark.
name: worker | ||
annotations: | ||
# Key annotation that triggers hybrid workflow | ||
machineconfiguration.openshift.io/pre-built-image: "registry.example.com/custom-rhcos-worker:latest@sha256:abc123def456..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this annotation get automatically cleaned up once the synthetic MOSB is created? It feels weird/could cause issues to have it hang around indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can clean it up, but leaving it around helps with keeping the history of the fact that this was created at install time with this pre-built image leading to a synthetic MOSB generation etc. I am leaning towards keeping that historical context, but open to cleaning it up as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's still reflected in the MOSB annotation at least.
The MOSC is a longer-lived CRD so it feels funny to have an annotation with transient information live there... potentially forever. And especially that it's not just informational. We have code that checks for it. Which means that e.g. if someone blindly does e.g. a copy/paste of the MOSC definition from one cluster to another, it'll kick in.
- Sets `OSImageURL = preBuiltImage` in rendered MachineConfigs for pools with pre-built images and MachineOSConfigs | ||
- Adds tracking annotation to modified MachineConfigs | ||
4. Bootstrap MCO creates ignition files with modified rendered MCs containing `OSImageURL` pointing to the pre-built container image | ||
5. Bootstrap MCO writes MachineOSConfig manifests to `/etc/mcs/bootstrap/machine-os-configs/` for post-bootstrap API creation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuqi-zhang How does the bridging between bootstrap and the final cluster happen for e.g. day-1 MachineConfigs? I would assume that same mechanism should work for MOSCs too.
Add an enhancement for supporting install time of
image mode on OpenShift.