Aggregates: Use GenerationChangedPredicate and join errors #161

fwiesel · 2025-10-22T13:29:20Z

Prior to this change, we would have to exit the reconcile function on each status update, and continue there. Now with the filter, we only retry on either an error or RequeueAfter.

Joining the errors allows us to complete at least some of the changes, and not always abort on the first.

Return kubernetes errors directly, and return a RequeueAfter on an error from Openstack for now, until the logging situation has been cleared.

notandy

I am fine with the predicate. But do we really need to aggregate the errors? Does it give us any benefit to apply some more aggregates, but still fail ultimately for the cost of a less informative condition message?

Do you have an example where this behavior is benefitial?

internal/controller/aggregates_controller.go

fwiesel · 2025-10-23T08:58:43Z

Does it give us any benefit to apply some more aggregates, but still fail ultimately for the cost of a less informative condition message?
Do you have an example where this behavior is benefitial?

Sure, as background, I stumbled over that pattern in the gardener project (i.e. here), but you can find that pattern also in kubernetes proper:.

The current behavior is first try to add all missing, then remove all superfluous aggregates, but abort on the first error. So it will not execute the other changes, despite that being likely possible or maybe even necessary. Trying them all will solve possible inter-dependencies eventually.

Assuming there are multiple underlying errors that require human intervention, the (human) operator will only see one error at a time, and will only be able to fix that one.

Admittedly, the inter-dependencies of host-aggregates are rather simple, so that is a rather theoretical scenario and more an explanation why I found that pattern appealing.

Concretely with host-aggregates we have the problem that one cannot add a host to two aggregates with different availability-zones.
With the PR you can simply declare the new aggregate membership, and it will first try to add it to the new AZ (and raise an error), but also remove it from the aggregate with the old az. The error is returned, retried, and then the host can be added to the aggregate with the new AZ (assuming no VMs on the host).

I suspect, you will say then let's simply remove the host from aggregates first, but then we have a problem with aggregates with tenant-isolation, as we would have then a period of time, where the tenant-isolation would not be active.

So, in summary, I think it follows a common practice, and solves this particular issue.

fwiesel · 2025-10-23T08:59:51Z

I have added a comment to explain that scenario.

notandy · 2025-10-23T11:54:44Z

internal/controller/aggregates_controller.go

+	if err := ac.trackError(ctx, hv, "failed updating aggregates", errs...); err != nil {
+		return ctrl.Result{}, err
+	}


I took me a while to understand what's happening here. Could you move the errs != nil check here so it's easier to follow the flow of thoughts.

Another thing is, now we have a status update (with errors), a error log with backtrack and a bubbling up error? I mean we can just omit the bubbling up error in this case.

I took me a while to understand what's happening here. Could you move the errs != nil check here so it's easier to follow the flow of thoughts.

Sure, changed that.

Another thing is, now we have a status update (with errors), a error log with backtrack and a bubbling up error? I mean we can just omit the bubbling up error in this case.

We need to bubble up the error to trigger the retry and the rate-limiter for that particular reconciliation.
The error practically won't get logged anymore, unless we do so explicitly. If the stacktrace is shown depends on the logging settings, and can be configured. With the current debug logging, it will be shown on errors.

Prior to this change, we would have to exit the reconcile function on each status update, and continue there. Now with the filter, we only retry on either an error or RequeueAfter. Joining the errors allows us to complete at least some of the changes, and not always abort on the first. Return kubernetes errors directly, and return a RequeueAfter on an error from Openstack for now, until the logging situation has been cleared.

github-actions · 2025-10-23T13:25:14Z

Merging this branch will decrease overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller	33.16% (-0.10%)	👎

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/aggregates_controller.go	63.04% (-4.43%)	92 (+9)	58 (+2)	34 (+7)	👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

This PR modifies the Code of #161 to improve following points: 1. no need for extra error-log since instead of dropping Reconcile Errors, we format them nicely with the Encoder. 2. Function (like rewritten `setErrorCondition`) should not return the errors the've been invoked with - but only return errors if they fail. Also, it's an uneeded roundtrip to return the same error that has been passed by the caller. 3. Introduce `utils.LifecycleEnabledPredicate`, a predicate that will filter event's for hypervisors with LifecycleEnabled == True.

notandy

#166 is a proposal that's based on this PR, but does some things "cleaner".

Especially I am having a problem with trackError, that if it takes errors, unconditionally returns error. So I moved some stuff out into the caller to make it clear what's intended and why.

This PR modifies the Code of #161 to improve following points: 1. no need for extra error-log since instead of dropping Reconcile Errors, we format them nicely with the Encoder. 2. Function (like rewritten `setErrorCondition`) should not return the errors the've been invoked with - but only return errors if they fail. Also, it's an uneeded roundtrip to return the same error that has been passed by the caller. 3. Introduce `utils.LifecycleEnabledPredicate`, a predicate that will filter event's for hypervisors with LifecycleEnabled == True.

fwiesel requested a review from notandy October 22, 2025 13:29

notandy requested changes Oct 22, 2025

View reviewed changes

internal/controller/aggregates_controller.go Outdated Show resolved Hide resolved

fwiesel force-pushed the GenerationChangedPredicateAggregates branch from 724a295 to 266cdc8 Compare October 23, 2025 09:00

fwiesel requested a review from notandy October 23, 2025 09:01

fwiesel force-pushed the GenerationChangedPredicateAggregates branch from 266cdc8 to b05358f Compare October 23, 2025 09:51

notandy requested changes Oct 23, 2025

View reviewed changes

fwiesel force-pushed the GenerationChangedPredicateAggregates branch from b05358f to c706846 Compare October 23, 2025 13:23

fwiesel requested a review from notandy October 23, 2025 13:24

notandy mentioned this pull request Oct 23, 2025

AggregateLoggingProposal #166

Merged

notandy requested changes Oct 23, 2025

View reviewed changes

fwiesel closed this Oct 24, 2025

fwiesel deleted the GenerationChangedPredicateAggregates branch October 24, 2025 07:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aggregates: Use GenerationChangedPredicate and join errors #161

Aggregates: Use GenerationChangedPredicate and join errors #161

Uh oh!

fwiesel commented Oct 22, 2025

Uh oh!

notandy left a comment

Uh oh!

Uh oh!

fwiesel commented Oct 23, 2025

Uh oh!

fwiesel commented Oct 23, 2025

Uh oh!

notandy Oct 23, 2025

Uh oh!

fwiesel Oct 23, 2025

Uh oh!

github-actions bot commented Oct 23, 2025

Changed files (no unit tests)

Uh oh!

notandy left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Aggregates: Use GenerationChangedPredicate and join errors #161

Aggregates: Use GenerationChangedPredicate and join errors #161

Uh oh!

Conversation

fwiesel commented Oct 22, 2025

Uh oh!

notandy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fwiesel commented Oct 23, 2025

Uh oh!

fwiesel commented Oct 23, 2025

Uh oh!

notandy Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

fwiesel Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 23, 2025

Merging this branch will decrease overall coverage

Changed files (no unit tests)

Uh oh!

notandy left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants