Skip to content

Conversation

umohnani8
Copy link
Contributor

Add an enhancement for supporting install time of
image mode on OpenShift.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2025
Copy link
Contributor

openshift-ci bot commented Jul 25, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Jul 25, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sadasu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@umohnani8 umohnani8 changed the title Add enhancement for install time Image Mode MCO-1527: Add enhancement for install time Image Mode Jul 25, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 25, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 25, 2025

@umohnani8: This pull request references MCO-1527 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Add an enhancement for supporting install time of
image mode on OpenShift.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@umohnani8 umohnani8 marked this pull request as ready for review July 31, 2025 12:03
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 31, 2025
@openshift-ci openshift-ci bot requested review from tremes and zaneb July 31, 2025 12:07
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2025
@umohnani8
Copy link
Contributor Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2025
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2025
@umohnani8
Copy link
Contributor Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 29, 2025
Add an enhancement for supporting install time of
image mode on OpenShift.

Signed-off-by: Urvashi <[email protected]>
@umohnani8
Copy link
Contributor Author

/assign @jlebon

Copy link
Contributor

openshift-ci bot commented Sep 29, 2025

@umohnani8: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like implicitly needs the eliminate reboots epic to work. Should we link those somehow? If that's not needed, maybe we should clarify in the workflow

creation-date: 2025-07-25
last-updated: 2025-09-29
tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
- https://issues.redhat.com/browse/MCO-1347
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the intent here is to link to the implementation epic for the feature

tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
- https://issues.redhat.com/browse/MCO-1347
see-also:
- "https://github.com/openshift/enhancements/pull/1515/files?short_path=9f0c5f1#diff-9f0c5f1adabad0dfbdb3c9a5b66e53d4fc6619274d7a4c260508d148de17f5c1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since the OCL enhancement is merged, maybe we should link to that directly


We want to bring that support to install time so cluster admins can customize and configure their OS from the beginning instead of having to do it post-install.

OpenShift's current approach to OS management provides excellent consistency and supportability through RHCOS (Red Hat CoreOS), but it requires cluster admins to cede certain configurability aspects to the platform. As workloads become more specialized there is an increasing need for OS-level customization that can be applied from Day 0. An obvious example today is how AI workloads require specific hardware drivers and configurations on nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this is the only place we refer to "Day 0". Possibly better to be consistent and just call it install time support everywhere

### Goals

- Enable cluster admins to use Image Mode on OpenShift at install time
- User creates a customized OS container image to be used when a cluster is installed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line feels more like a user requirement in the workflow as opposed to a goal


- Enable cluster admins to use Image Mode on OpenShift at install time
- User creates a customized OS container image to be used when a cluster is installed
- Maintain OpenShift's single-click upgrade experience maintaining customizations across upgrades
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we don't need to talk about upgrade since this shouldn't touch on that?

- Parse MOSC configurations from the installation directory
- Update the MCD first boot service to use the seeded pre-built custom OS container image

#### Node Deployment Process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also the same as we have today right (for non layered pools)?

name: worker
annotations:
# Key annotation that triggers hybrid workflow
machineconfiguration.openshift.io/pre-built-image: "registry.example.com/custom-rhcos-worker:latest@sha256:abc123def456..."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following up from the question above, is this mandatory to be provided via the user? What happens if they provide a MOSC without this? Should we fail the install?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is mandatory and needs to be provided the user. I plan to add some validations in place to check for this and fail if this is not provided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that the case where you want to have a day-1 MOSC defined so that pure OCL kicks in once the cluster is up? (I.e. that annotation is missing, but it's on purpose.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's fine to say "don't put that into the install manifests and do it purely as a day 2 operation" which I guess prevents any race conditions from our end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd to impose limits on this though. (Also possibly technically a compatibility break assuming day-1 MOSCs work today?)

But I also understand wanting to try to catch users missing this. I think in practice, they're probably copying from our docs and they just need to fill in the templated bits? I.e. maybe we don't need to worry about this.


We will use the same MachineOSConfig and MachineOSBuild CRD definitions as already used for post-install customizations with some new annotations. The API for this can be found [here](https://github.com/openshift/api/blob/master/machineconfiguration/v1).

To know more about the post-install image mode workflow, please refer to the enhancement [here](https://github.com/openshift/enhancements/pull/1515/files?short_path=9f0c5f1#diff-9f0c5f1adabad0dfbdb3c9a5b66e53d4fc6619274d7a4c260508d148de17f5c1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(should probably link to the main enhancement)

- Complete install time image mode on OpenShift installation workflow
- Multiple machine config pool scenarios

#### E2E Tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing will be pretty involved I think. Maybe we can expand on that a bit more here?


#### Story 3: Edge Computing Deployment

As a cluster admin deploying edge computing solutions, I want to include edge-specific drivers and monitoring agents in my OS image during installation, so that my edge nodes are immediately functional without requiring network connectivity for post-installation customization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the rest of the enhancement, I don't think we're covering for this use case (also applies to the Limited support for specialized hardware requirements during installation point above)

This sounds more like custom ISO or built-in registry support that agent team is working on, and we don't explicitly cover this case without additional support outside of the cluster workflow (but perhaps this is also a pre-requisite for that to work end to end?)


### Non-Goals

- Replacing the existing ignition-based configuration system entirely

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an outcome? just confirming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-goal, so we are definitely not planning on doing this by adding install-time OCL support.


## Motivation

Image Mode on OpenShift became GA in OCP 4.18.21+ and has been well received as a post-install OS customization option.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe state explicitly this enhancement is OCP ≥ X.Y.Z only (since you reference 4.18.21+ for post-install)


The install time Image Mode on OpenShift workflow integrates with the existing OpenShift installation process in the following steps.

#### Step 1: Installation Configuration with Pre-Built Container Image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal assumes users will “just provide” a digest image and a push secret. What happens if they don’t? I’d suggest calling out pre-flight validations (either in openshift-install or bootstrap MCO) that check:

  1. the registry is reachable
  2. the digest resolves correctly
  3. the secret actually works for both pull and push.

This would save admins from failing mid-bootstrap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we will document that these resources need to be added to the manifests directory before install. We also plan to add some validation checks during bootstrap to at least test that we can communicate with the registry and that the image is valid.

1. MCO controller's syncMachineOSConfigs() function reads MachineOSConfig files from `/etc/mcs/bootstrap/machine-os-configs/`
2. MCO creates MachineOSConfig objects in the Kubernetes API if they don't already exist
3. Build controller detects the `machineconfiguration.openshift.io/pre-built-image` annotation on the newly created MachineOSConfig
4. Build controller triggers seeding workflow by calling `seedMachineOSConfigWithExistingImage()``

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need a workflow if install-time seeding fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If install-time seeding is not working, the plan here is to just fail completely. For the initial implementation, it doesn't make sense to fall back onto a non-OCL installation.
In future, when we have support to built the image during bootstrap, we can fallback to that if the install-time seeding fails. But for the scope of this enhancement, we just fail and report the errors/failures in a detailed and clear way.


- Enable cluster admins to use Image Mode on OpenShift at install time
- User creates a customized OS container image to be used when a cluster is installed
- Maintain OpenShift's single-click upgrade experience maintaining customizations across upgrades

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a seeded image was built against an older RHCOS (say 4.17), but the cluster is installing 4.18? Will we fail early, or will nodes come up mismatched?

- `PreBuiltImage` label set to "true"
- Proper metadata including MOSC name, DigestedImagePushSpec, and MCP reference
- Success condition with reason "PreBuiltImageSeeded"
6. MCO updates MachineOSConfig status with `CurrentImagePullSpec` pointing to the pre-built image

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens a week later when the admin scales up new workers? Will they land on the pre-seeded image?

@umohnani8 umohnani8 changed the title MCO-1527: Add enhancement for install time Image Mode MCO-1891: Add enhancement for install time Image Mode Oct 1, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 1, 2025

@umohnani8: This pull request references MCO-1891 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Add an enhancement for supporting install time of
image mode on OpenShift.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

- Post-installation customization complexity and timing issues
- Limited support for specialized hardware requirements during installation

### User Stories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, unless I'm misreading, I think this is missing an important user story, which is if you need drivers to even do the installation, notably storage and network drivers. (I.e. it's not just to optimize bootstrapping, but a hard requirement.)

Copy link
Contributor

@yuqi-zhang yuqi-zhang Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I think there's 2 parts where this kicks in:

  1. I need the driver to even boot the machine
  2. I need the driver before the container runtime/application stacks run

Are you talking about the former or the latter? If it's the former, doesn't that have to be built into the ISO?

Edit: I realize you covered that right below 🤦 ignore this comment

5. Cluster admin runs `openshift-install create cluster` which automatically:
- Creates ignition configs by base64 encoding all YAML files from `manifests/` directory

#### Step 2: Bootstrap Phase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the user story I suggested above, there is a step between step 1 and step 2, call it step 1b, which is to generate e.g. a custom live ISO from the custom container image.

Today, that can be done via https://github.com/coreos/custom-coreos-disk-images, but in the future, this would be done via bootc-image-builder (see coreos/fedora-coreos-tracker#1906).

If you don't strictly need drivers for bootstrapping, then yeah you can use the stock bootimages as implied here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be great in that case (the custom bootimages case) is if we can avoid that MCO pivot reboot at all if the image is already on the same digest as pointed to by the "pre-built" annotation. Would the existing MCD first boot logic already know to skip that pivot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussions with the team, we decided to focus the enhancement on supporting pre-built container images only as the MCO currently doesn't handle anything with ISOs and would be a much more complicated process on integrating it as other components handle the ISOs used at installation.
We plan to add this support in a future phase once we have the whole install time story working with pre-built container images.
The current approach won't handle any customization needed to install the cluster, but the installation workflow will match the behavior of current non-OCL installs where we boot twice only (initial disk boot and then when the node joins the cluster boot).
I will add this to the non-goals section of this enhancement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussions with the team, we decided to focus the enhancement on supporting pre-built container images only as the MCO currently doesn't handle anything with ISOs and would be a much more complicated process on integrating it as other components handle the ISOs used at installation.

There might be some confusion here. I don't think you need to worry about custom ISOs/bootimages directly in this flow. That's left to the user to build using tools external to the MCO if they need it.

And actually, I think it may be completely transparent to the MCO since the MCO today doesn't care what you boot from on first boot (right?).

Skipping unnecessary reboots is an optimization and can be done later. I think the only requirement though is that the system doesn't temporarily reboot into the vanilla non-customized node image (i.e. the literal rhel-coreos image in the payload). Because those wouldn't have the necessary drivers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're covering the "don't go back to stock image" via a separate epic at this time. This enhancement I think is intended to cover just the install time workflow such that you can provide a pre-built config and have on-cluster layering take over post-install.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that's fine. Probably worth spelling this out more in the enhancement. That said AIUI, the customers asking for install time support today want it exactly because of this. Might want to check with Mark.

name: worker
annotations:
# Key annotation that triggers hybrid workflow
machineconfiguration.openshift.io/pre-built-image: "registry.example.com/custom-rhcos-worker:latest@sha256:abc123def456..."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this annotation get automatically cleaned up once the synthetic MOSB is created? It feels weird/could cause issues to have it hang around indefinitely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can clean it up, but leaving it around helps with keeping the history of the fact that this was created at install time with this pre-built image leading to a synthetic MOSB generation etc. I am leaning towards keeping that historical context, but open to cleaning it up as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still reflected in the MOSB annotation at least.

The MOSC is a longer-lived CRD so it feels funny to have an annotation with transient information live there... potentially forever. And especially that it's not just informational. We have code that checks for it. Which means that e.g. if someone blindly does e.g. a copy/paste of the MOSC definition from one cluster to another, it'll kick in.

- Sets `OSImageURL = preBuiltImage` in rendered MachineConfigs for pools with pre-built images and MachineOSConfigs
- Adds tracking annotation to modified MachineConfigs
4. Bootstrap MCO creates ignition files with modified rendered MCs containing `OSImageURL` pointing to the pre-built container image
5. Bootstrap MCO writes MachineOSConfig manifests to `/etc/mcs/bootstrap/machine-os-configs/` for post-bootstrap API creation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuqi-zhang How does the bridging between bootstrap and the final cluster happen for e.g. day-1 MachineConfigs? I would assume that same mechanism should work for MOSCs too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants