Skip to content

Conversation

@jan-law
Copy link
Contributor

@jan-law jan-law commented Jul 17, 2025

When configuration policies detect desired vs actual object mismatches, the dry run update validation sometimes responds with a "no-op", aka "no changes required" due to apiserver webhooks or mutating admission controllers modifying requests. The policy is assumed to be compliant without indicating any potential mismatch to users.

Since the requests are modified at the apiserver, we do not have other means of validating the policy before making the dry run request. To improve user experience when troubleshooting this scenario, this PR adds the property DryRunNoOpOverride (edit: MatchesAfterDryRun) when the policy detects a mismatch AND the no-op dry run update overrode the policy assessment.

The RelatedObject Properties object will appear for policies in any remediation mode. Previously, all properties were hidden for all compliant objects in Inform mode.

ref: https://issues.redhat.com/browse/ACM-14577

Screenshot From 2025-07-28 14-01-02

@jan-law jan-law mentioned this pull request Jul 17, 2025
@jan-law jan-law force-pushed the notify-on-dryrun-noop branch 7 times, most recently from 571d7b4 to 21bf25b Compare July 21, 2025 18:07
@jan-law jan-law changed the title Add ObjectProperty to indicate compliant dry run overrode policy mismatch Add RelatedObject property when compliant dry run overrides policy mismatch Jul 23, 2025
@jan-law jan-law force-pushed the notify-on-dryrun-noop branch 3 times, most recently from 606467e to 6a59903 Compare July 24, 2025 14:03
@jan-law
Copy link
Contributor Author

jan-law commented Jul 29, 2025

Note: Creating a policy with this dry run behaviour always emits two events, a "violation" followed by an immediate "compliant" notification. Since we intend to modify the rate of policy history evaluation, I did not investigate the cause of these double events.

Screenshot From 2025-07-29 11-05-44

@jan-law jan-law marked this pull request as ready for review July 29, 2025 18:13
@yiraeChristineKim
Copy link
Contributor

/cc @JustinKuli

@openshift-ci openshift-ci bot requested a review from JustinKuli August 5, 2025 18:55
@yiraeChristineKim
Copy link
Contributor

What is no-op mean?

@yiraeChristineKim
Copy link
Contributor

Are you going to add e2e test?

}

r.setEvaluatedObject(obj.policy, obj.existingObj, !throwSpecViolation, "")
r.setEvaluatedObject(obj.policy, obj.existingObj, !throwSpecViolation, "", false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain me "!throwSpecViolation", according to " // The spec didn't require a change but throwSpecViolation indicates the status didn't match. Handle" We set compliant when there is mismatch(changed by kube api, mutator) I am confused

Copy link
Contributor Author

@jan-law jan-law Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throwSpecViolation is initially true when handleKeys() notices a difference between the policy fields and the actual object fields.

Afterwards, the dry run validation handles edge cases where handleKeys found a difference between unset and empty fields. If the dry run Update did not update any fields, the policy is assumed to be compliant, and throwSpecViolation is inverted.

// There are situations where updateNeeded is wrong in either direction: an update might not be

In this case ACM-14577, the bad scc annotation in the policy is technically invalid, but the dry run validation can’t detect the error. During the dry run update request, an admission controller intercepts the request body and silently fixes the invalid annotation before the api server returns a response. This means the dry run update thinks the object matches the policy.

Copy link
Contributor

@yiraeChristineKim yiraeChristineKim Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!throwSpecViolation => noncompliant in setEvaluatedObject func right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you add comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ACM-14577, !throwSpecViolation evaluates to true, compliant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okie Can you add comment for this

// DryRunNoOpOverride indicates that a difference was detected between the policy and the
// object, but a dry run update reported no changes. The dry run result overrides the policy
// assessment, meaning the object is considered compliant despite the initial mismatch.
DryRunNoOpOverride *bool `json:"dryRunNoOpOverride,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe MatchesAfterDryRun would be a better name? "Override" here makes me feel more like it's an option that the user could specify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this might not need to be a pointer.

@yiraeChristineKim
Copy link
Contributor

@JustinKuli What do you think about skipping the E2E test for this? We could use a Gatekeeper mutation to test it instead, but if you think that's too much, I'm okay with skipping the E2E test.

@JustinKuli
Copy link
Member

@yiraeChristineKim if creating an e2e test for this in a KinD cluster is too much of a burden, I would be ok with a test that just runs in our framework tests against an openshift cluster, since the openshift SCC annotation example would be relatively simple to test.

However, I'd want us to try to make a test here first. I think that k8s will default the type field in a Service to ClusterIP. So I think we can make a test that takes an existing Service in the cluster of that type, and copy all of its info into a mustonlyhave policy, except for the type. I would expect that to trigger this behavior, and for the new property to be marked in the related object (although I could be misunderstanding this exact case).

Or, @jan-law could find a better example for a test.

@jan-law jan-law force-pushed the notify-on-dryrun-noop branch 6 times, most recently from 5eca769 to 65d7d38 Compare August 12, 2025 21:07
ref: https://issues.redhat.com/browse/ACM-14577

Tests scenarios where the policy reports a mismatched object,
but the dry run validation reports a compliant object due to
the apiserver adding or modifying fields in the request body.

This commit also prevents timeouts in case40 by wrapping assertions with
Eventually.

Signed-off-by: Janelle Law <[email protected]>
@jan-law jan-law force-pushed the notify-on-dryrun-noop branch from 65d7d38 to f353b3a Compare August 12, 2025 21:42
@jan-law jan-law marked this pull request as ready for review August 13, 2025 15:16
@openshift-ci openshift-ci bot requested a review from JustinKuli August 13, 2025 15:16
Comment on lines 6 to 7
annotations:
test: test
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Could you help me understand why the policy in case8 with metadataComplianceType: musthave is NonCompliant when both the policy and the service yaml do not contain any annotations:?

See diff below when I removed lines 6-7 annotations: test:test from the service.

 compliant: NonCompliant
  lastEvaluated: "2025-08-11T14:41:05Z"
  lastEvaluatedGeneration: 1
  relatedObjects:
  - compliant: NonCompliant
    object:
      apiVersion: v1
      kind: Service
      metadata:
        name: grc-policy-propagator-metrics
        namespace: managed
    properties:
      createdByPolicy: false
      diff: |
        --- managed/grc-policy-propagator-metrics : existing
        +++ managed/grc-policy-propagator-metrics : updated
        @@ -1,9 +1,8 @@
         apiVersion: v1
         kind: Service
         metadata:
        -  annotations: {}
           creationTimestamp: "2025-08-11T14:40:35Z"
           name: grc-policy-propagator-metrics
           namespace: managed
           resourceVersion: "2513"
           uid: ddf71564-cba9-41fa-809f-b742bd85ab50

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very odd!

Maybe https://github.com/open-cluster-management-io/config-policy-controller/blob/main/controllers/configurationpolicy_controller.go#L3524 is to blame... it seems like that could make the existing object "look like" it has an empty annotations object there when it really doesn't. Maybe the SetAnnotations and SetLabels calls should only be done when there are annotations/labels.

I think we should remove the annotation from your test YAML, and make sure that this situation works as we'd expect. But I'm surprised that no other tests are hitting this odd behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to your test, I was able to narrow down what seems to be happening. When the service annotations are empty, the reflect.DeepEqual(dryRunUpdatedObj.Object, existingObjectCopy.Object) is coming back false because one of the objects has an empty annotations map, and the other has no annotations map. I don't exactly know why they are different, and especially why the same situation is not happening with the empty labels.

But adjusting this existing function seems like it could handle it (I've added the last 3 lines):

func removeFieldsForComparison(obj *unstructured.Unstructured) {
	unstructured.RemoveNestedField(obj.Object, "metadata", "managedFields")
	unstructured.RemoveNestedField(
		obj.Object, "metadata", "annotations", "kubectl.kubernetes.io/last-applied-configuration",
	)
	// The generation might actually bump but the API output might be the same.
	unstructured.RemoveNestedField(obj.Object, "metadata", "generation")

	if len(obj.GetAnnotations()) == 0 {
		unstructured.RemoveNestedField(obj.Object, "metadata", "annotations")
	}
}

Copy link
Member

@JustinKuli JustinKuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent tests, and nice work adding that field! I'm suggesting changing many of the return cases in checkAndUpdateResource to default to "false" for the new field, but maybe I'm missing a reason for it to be true?

CreatedByPolicy reports whether the object was created by the configuration policy, which is
important when pruning is configured.
type: boolean
matchesAfterDryRun:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OperatorPolicy doesn't need this field, does it?

We can add an additional patch to the CRD to remove it, see deploy/crds/kustomize_operatorpolicy/remove-diff.json (and it will need to be referenced in deploy/crds/kustomize_operatorpolicy/kustomization.yaml)

@jan-law jan-law force-pushed the notify-on-dryrun-noop branch 2 times, most recently from 17decb0 to d28f3a2 Compare August 13, 2025 21:19
While testing https://issues.redhat.com/browse/ACM-14577 , some objects
reported missing annotations as an empty map while other objects omitted the
annotations key. Handle empty annotations before comparing objects.

Signed-off-by: Janelle Law <[email protected]>
@jan-law jan-law force-pushed the notify-on-dryrun-noop branch from d28f3a2 to 981f733 Compare August 13, 2025 22:02
Copy link
Member

@JustinKuli JustinKuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the diligent work here!

@openshift-ci openshift-ci bot added the lgtm label Aug 14, 2025
@openshift-ci
Copy link

openshift-ci bot commented Aug 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan-law, JustinKuli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit d99383c into open-cluster-management-io:main Aug 14, 2025
15 checks passed
@jan-law jan-law deleted the notify-on-dryrun-noop branch August 14, 2025 17:07
JustinKuli added a commit to stolostron/config-policy-controller that referenced this pull request Oct 30, 2025
Previously, the policy status would incorrectly have a noncompliant
event followed immediately by a compliant event when a dryrun update
would not make any changes on that object. This would cause the policy
status to repeatedly update if the policy was reconciling often - for
example if any of its watched objects were updating.

Now only the correct "compliant" event should be emitted in that case.

NOTE: This is a backport, but not just a cherry-pick. This incorporates
some of the changes from these upstream PRs, with updates for tests:
- open-cluster-management-io/config-policy-controller#404
- open-cluster-management-io/config-policy-controller#377

Refs:
 - https://issues.redhat.com/browse/ACM-25742

Signed-off-by: Justin Kulikauskas <[email protected]>
JustinKuli added a commit to stolostron/config-policy-controller that referenced this pull request Nov 17, 2025
Previously, the policy status would incorrectly have a noncompliant
event followed immediately by a compliant event when a dryrun update
would not make any changes on that object. This would cause the policy
status to repeatedly update if the policy was reconciling often - for
example if any of its watched objects were updating.

Now only the correct "compliant" event should be emitted in that case.

NOTE: This is a backport, but not just a cherry-pick. This incorporates
some of the changes from these upstream PRs, with updates for tests:
- open-cluster-management-io/config-policy-controller#404
- open-cluster-management-io/config-policy-controller#377

Refs:
 - https://issues.redhat.com/browse/ACM-25742

Signed-off-by: Justin Kulikauskas <[email protected]>
JustinKuli added a commit to stolostron/config-policy-controller that referenced this pull request Nov 17, 2025
Previously, the policy status would incorrectly have a noncompliant
event followed immediately by a compliant event when a dryrun update
would not make any changes on that object. This would cause the policy
status to repeatedly update if the policy was reconciling often - for
example if any of its watched objects were updating.

Now only the correct "compliant" event should be emitted in that case.

NOTE: This is a backport, but not just a cherry-pick. This incorporates
some of the changes from these upstream PRs, with updates for tests:
- open-cluster-management-io/config-policy-controller#404
- open-cluster-management-io/config-policy-controller#377

Refs:
 - https://issues.redhat.com/browse/ACM-25742

Signed-off-by: Justin Kulikauskas <[email protected]>
JustinKuli added a commit to stolostron/config-policy-controller that referenced this pull request Nov 17, 2025
Previously, the policy status would incorrectly have a noncompliant
event followed immediately by a compliant event when a dryrun update
would not make any changes on that object. This would cause the policy
status to repeatedly update if the policy was reconciling often - for
example if any of its watched objects were updating.

Now only the correct "compliant" event should be emitted in that case.

NOTE: This is a backport, but not just a cherry-pick. This incorporates
some of the changes from these upstream PRs, with updates for tests:
- open-cluster-management-io/config-policy-controller#404
- open-cluster-management-io/config-policy-controller#377

Refs:
 - https://issues.redhat.com/browse/ACM-25742

Signed-off-by: Justin Kulikauskas <[email protected]>
openshift-merge-bot bot pushed a commit to stolostron/config-policy-controller that referenced this pull request Nov 19, 2025
Previously, the policy status would incorrectly have a noncompliant
event followed immediately by a compliant event when a dryrun update
would not make any changes on that object. This would cause the policy
status to repeatedly update if the policy was reconciling often - for
example if any of its watched objects were updating.

Now only the correct "compliant" event should be emitted in that case.

NOTE: This is a backport, but not just a cherry-pick. This incorporates
some of the changes from these upstream PRs, with updates for tests:
- open-cluster-management-io/config-policy-controller#404
- open-cluster-management-io/config-policy-controller#377

Refs:
 - https://issues.redhat.com/browse/ACM-25742

Signed-off-by: Justin Kulikauskas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants