Skip to content

CMP-4117: Expand ownership check for profile bundle controller#1100

Open
rhmdnd wants to merge 1 commit intoComplianceAsCode:masterfrom
rhmdnd:fix-profile-bundle-ownership
Open

CMP-4117: Expand ownership check for profile bundle controller#1100
rhmdnd wants to merge 1 commit intoComplianceAsCode:masterfrom
rhmdnd:fix-profile-bundle-ownership

Conversation

@rhmdnd
Copy link
Collaborator

@rhmdnd rhmdnd commented Feb 28, 2026

The controller previously only watched ProfileBundle objects. When the profileparser Deployment's pods changed state, the controller was never notified.

Adding Owns means any change to the owned Deployment triggers a reconciliation of the parent ProfileBundle, so the controller is responsive to pod lifecycle events.

Also, once the controller found an existing pod with no startup error, it exited the controller reconcilation loop without requeue — regardless of whether the ProfileBundle was still in PENDING state. If the profileparser hadn't finished (or never ran due to a rollout delay), the controller would never check again.

This commit also updates the profile bundle controller to requeues every 10 seconds while the status is still DataStreamPending, ensuring the controller keeps monitoring until the profileparser either succeeds (sets VALID) or fails (sets INVALID / pod startup error detected).

This should improve the resilience of profile bundle parsing, especially in testing, where we delete deployments after modifying the profile bundle image to simulate operator updates.

Assisted-By: Opus 4.6

@openshift-ci
Copy link

openshift-ci bot commented Feb 28, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhmdnd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Feb 28, 2026

Porting @xiaojiey's comment from #1098

I tested with a special scenario, when setting profilebundle to a non-exist image. I can see the requeue working as expected. The only concern is the unconditional requeueing every 10 seconds while PENDING, for example this CrashLoopBackOff scenario when the profileparser container crashes repeatedly.

$ kubectl patch profilebundle ocp4 -n openshift-compliance --type=merge \ \
    -p '{"spec":{"contentImage":"quay.io/nonexistent/invalid:latest"}}'
profilebundle.compliance.openshift.io/ocp4 patched
$ oc get pb -w
NAME     CONTENTIMAGE                                 CONTENTFILE         STATUS
ocp4     quay.io/nonexistent/invalid:latest           ssg-rhcos4-ds.xml   PENDING
rhcos4   ghcr.io/complianceascode/k8scontent:latest   ssg-rhcos4-ds.xml   VALID
$ oc get pod -w
NAME                                             READY   STATUS              RESTARTS      AGE
compliance-operator-69ccf667d-kknvb              1/1     Running             2 (22m ago)   22m
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ErrImagePull   0             29s
ocp4-openshift-compliance-pp-7489f9c4f8-pjtfg    1/1     Running             0             3m42s
rhcos4-openshift-compliance-pp-9d8c7f955-jtc64   1/1     Running             0             21m
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ImagePullBackOff   0             44s
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ErrImagePull       0             59s
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ImagePullBackOff   0             71s
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ErrImagePull       0             109s
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ImagePullBackOff   0             2m1s
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ErrImagePull       0             3m20s
$ oc logs pod/compliance-operator-69ccf667d-kknvb | grep requeueing | tail -n 10
{"level":"info","ts":"2026-02-28T12:02:56.892Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:06.893Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:16.894Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:26.895Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:36.896Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:46.898Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:56.899Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:04:06.900Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:04:16.902Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:05:39.379Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}

Yeah - that's a good question. What's the most appropriate state for a ProfileBundle if the image is incorrect? I would think either PENDING or INVALID.

Without this PR, wouldn't the ProfileBundle be in PENDING state indefinitely since the profile parser container wouldn't be able to pull the content image? Is it essentially the same behavior but just more verbose about retrying every 10 seconds?

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Feb 28, 2026

I rebased this to pull in #1093 - which was affecting the parallel tests.

@github-actions
Copy link

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:1100-61357833d573cc3690f59f3aa5b8ca1a7938a02b

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Feb 28, 2026

/retest-required

2 similar comments
@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Feb 28, 2026

/retest-required

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Feb 28, 2026

/retest-required

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Mar 1, 2026

Images are failing to build in CI for some reason. Looks unrelated to this change, hence all the rechecks.

INFO[2026-03-01T00:00:32Z] Ran for 1h0m10s                              
ERRO[2026-03-01T00:00:32Z] Some steps failed:                           
ERRO[2026-03-01T00:00:32Z] 
  * could not run steps: step src failed: error occurred handling build src-arm64: build didn't start running within 1h0m0s (phase: Pending):

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Mar 1, 2026

/retest-required

2 similar comments
@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Mar 1, 2026

/retest-required

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Mar 1, 2026

/retest-required

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Mar 2, 2026

Got one green run on the serial tests - rechecking to see if this helps with the transient issues we've been seeing recently.

/test e2e-aws-serial
/test e2e-aws-serial-arm

@rhmdnd rhmdnd force-pushed the fix-profile-bundle-ownership branch from 6135783 to 56af613 Compare March 2, 2026 14:16
@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Mar 2, 2026

Updating the controller with some better logging to trace through what's happening with the pods such that the profile bundle is stuck in a pending state, despite the controller requeuing the request when it should.

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:1100-56af6139fb44aef8820d6f08bc519dadd4275055

The controller previously only watched ProfileBundle objects. When the
profileparser Deployment's pods changed state, the controller was never
notified.

Adding Owns means any change to the owned Deployment triggers a
reconciliation of the parent ProfileBundle, so the controller is
responsive to pod lifecycle events.

Also, once the controller found an existing pod with no startup error,
it exited the controller reconcilation loop without requeue — regardless
of whether the ProfileBundle was still in PENDING state. If the
profileparser hadn't finished (or never ran due to a rollout delay), the
controller would never check again.

This commit also updates the profile bundle controller to requeues every
10 seconds while the status is still DataStreamPending, ensuring the
controller keeps monitoring until the profileparser either succeeds
(sets VALID) or fails (sets INVALID / pod startup error detected). In
particular, if the controller detects that the init containers for the
profileparser have completed and the bundle is still in a PENDING state,
we're deadlocked, and it should annotate the profileparser pod to rerun.

This should improve the resilience of profile bundle parsing, especially
in testing, where we delete deployments after modifying the profile
bundle image to simulate operator updates.

Assisted-By: Opus 4.6
@rhmdnd rhmdnd force-pushed the fix-profile-bundle-ownership branch from 56af613 to 918f62d Compare March 2, 2026 21:23
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

🤖 To deploy this PR, run the following command:

make catalog-deploy CATALOG_IMG=ghcr.io/complianceascode/compliance-operator-catalog:1100-918f62d3451f90a128b4a245da77c6dc80db979d

@rhmdnd rhmdnd changed the title Expand ownership check for profile bundle controller CMP-4117: Expand ownership check for profile bundle controller Mar 2, 2026
@openshift-ci-robot
Copy link
Collaborator

@rhmdnd: This pull request references CMP-4117 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

The controller previously only watched ProfileBundle objects. When the profileparser Deployment's pods changed state, the controller was never notified.

Adding Owns means any change to the owned Deployment triggers a reconciliation of the parent ProfileBundle, so the controller is responsive to pod lifecycle events.

Also, once the controller found an existing pod with no startup error, it exited the controller reconcilation loop without requeue — regardless of whether the ProfileBundle was still in PENDING state. If the profileparser hadn't finished (or never ran due to a rollout delay), the controller would never check again.

This commit also updates the profile bundle controller to requeues every 10 seconds while the status is still DataStreamPending, ensuring the controller keeps monitoring until the profileparser either succeeds (sets VALID) or fails (sets INVALID / pod startup error detected).

This should improve the resilience of profile bundle parsing, especially in testing, where we delete deployments after modifying the profile bundle image to simulate operator updates.

Assisted-By: Opus 4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Mar 3, 2026

Clean serial run - which is a good sign. Retesting to see if we can recreate the transient issue.

/test e2e-aws-serial-arm
/test e2e-aws-serial

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Mar 3, 2026

Failed to get a cluster, didn't make it to the serial tests:

/test e2e-aws-serial-arm
/test e2e-aws-serial

@xiaojiey
Copy link
Collaborator

xiaojiey commented Mar 3, 2026

/retest

@openshift-ci
Copy link

openshift-ci bot commented Mar 3, 2026

@rhmdnd: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-rosa 918f62d link true /test e2e-rosa

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Mar 3, 2026

/test e2e-aws-serial-arm
/test e2e-aws-serial

return ctrl.NewControllerManagedBy(mgr).
Named("profilebundle-controller").
For(&compliancev1alpha1.ProfileBundle{}).
Owns(&appsv1.Deployment{}).
Copy link

@Vincent056 Vincent056 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to have this here? we will need to set ownership to the pb deployment for this to work. I am wondering if you could try without this line, I think the logic https://github.com/ComplianceAsCode/compliance-operator/pull/1100/changes#diff-60a126c27481c606ddae6ce2665cca69a035cd1bf59c98ee286752639ada0edaR280 here is enough here

@Vincent056
Copy link

Porting @xiaojiey's comment from #1098

I tested with a special scenario, when setting profilebundle to a non-exist image. I can see the requeue working as expected. The only concern is the unconditional requeueing every 10 seconds while PENDING, for example this CrashLoopBackOff scenario when the profileparser container crashes repeatedly.

$ kubectl patch profilebundle ocp4 -n openshift-compliance --type=merge \ \
    -p '{"spec":{"contentImage":"quay.io/nonexistent/invalid:latest"}}'
profilebundle.compliance.openshift.io/ocp4 patched
$ oc get pb -w
NAME     CONTENTIMAGE                                 CONTENTFILE         STATUS
ocp4     quay.io/nonexistent/invalid:latest           ssg-rhcos4-ds.xml   PENDING
rhcos4   ghcr.io/complianceascode/k8scontent:latest   ssg-rhcos4-ds.xml   VALID
$ oc get pod -w
NAME                                             READY   STATUS              RESTARTS      AGE
compliance-operator-69ccf667d-kknvb              1/1     Running             2 (22m ago)   22m
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ErrImagePull   0             29s
ocp4-openshift-compliance-pp-7489f9c4f8-pjtfg    1/1     Running             0             3m42s
rhcos4-openshift-compliance-pp-9d8c7f955-jtc64   1/1     Running             0             21m
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ImagePullBackOff   0             44s
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ErrImagePull       0             59s
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ImagePullBackOff   0             71s
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ErrImagePull       0             109s
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ImagePullBackOff   0             2m1s
ocp4-openshift-compliance-pp-5489bf48bd-rrmnj    0/1     Init:ErrImagePull       0             3m20s
$ oc logs pod/compliance-operator-69ccf667d-kknvb | grep requeueing | tail -n 10
{"level":"info","ts":"2026-02-28T12:02:56.892Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:06.893Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:16.894Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:26.895Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:36.896Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:46.898Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:03:56.899Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:04:06.900Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:04:16.902Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}
{"level":"info","ts":"2026-02-28T12:05:39.379Z","logger":"profilebundlectrl","msg":"ProfileBundle still pending, requeueing to check status","Request.Namespace":"openshift-compliance","Request.Name":"ocp4"}

Yeah - that's a good question. What's the most appropriate state for a ProfileBundle if the image is incorrect? I would think either PENDING or INVALID.

Without this PR, wouldn't the ProfileBundle be in PENDING state indefinitely since the profile parser container wouldn't be able to pull the content image? Is it essentially the same behavior but just more verbose about retrying every 10 seconds?

maybe we can use exponential backoff here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants