Skip to content

Conversation

@fwiesel
Copy link
Contributor

@fwiesel fwiesel commented Oct 22, 2025

Prior to this change, we would have to exit the reconcile function on each status update, and continue there. Now with the filter, we only retry on either an error or RequeueAfter.

Joining the errors allows us to complete at least some of the changes, and not always abort on the first.

Return kubernetes errors directly, and return a RequeueAfter on an error from Openstack for now, until the logging situation has been cleared.

@fwiesel fwiesel requested a review from notandy October 22, 2025 13:29
Copy link
Contributor

@notandy notandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the predicate. But do we really need to aggregate the errors? Does it give us any benefit to apply some more aggregates, but still fail ultimately for the cost of a less informative condition message?

Do you have an example where this behavior is benefitial?

@fwiesel
Copy link
Contributor Author

fwiesel commented Oct 23, 2025

Does it give us any benefit to apply some more aggregates, but still fail ultimately for the cost of a less informative condition message?
Do you have an example where this behavior is benefitial?

Sure, as background, I stumbled over that pattern in the gardener project (i.e. here), but you can find that pattern also in kubernetes proper:.

The current behavior is first try to add all missing, then remove all superfluous aggregates, but abort on the first error. So it will not execute the other changes, despite that being likely possible or maybe even necessary. Trying them all will solve possible inter-dependencies eventually.

Assuming there are multiple underlying errors that require human intervention, the (human) operator will only see one error at a time, and will only be able to fix that one.

Admittedly, the inter-dependencies of host-aggregates are rather simple, so that is a rather theoretical scenario and more an explanation why I found that pattern appealing.

Concretely with host-aggregates we have the problem that one cannot add a host to two aggregates with different availability-zones.
With the PR you can simply declare the new aggregate membership, and it will first try to add it to the new AZ (and raise an error), but also remove it from the aggregate with the old az. The error is returned, retried, and then the host can be added to the aggregate with the new AZ (assuming no VMs on the host).

I suspect, you will say then let's simply remove the host from aggregates first, but then we have a problem with aggregates with tenant-isolation, as we would have then a period of time, where the tenant-isolation would not be active.

So, in summary, I think it follows a common practice, and solves this particular issue.

@fwiesel
Copy link
Contributor Author

fwiesel commented Oct 23, 2025

I have added a comment to explain that scenario.

@fwiesel fwiesel force-pushed the GenerationChangedPredicateAggregates branch from 724a295 to 266cdc8 Compare October 23, 2025 09:00
@fwiesel fwiesel requested a review from notandy October 23, 2025 09:01
@fwiesel fwiesel force-pushed the GenerationChangedPredicateAggregates branch from 266cdc8 to b05358f Compare October 23, 2025 09:51
Comment on lines 111 to 113
if err := ac.trackError(ctx, hv, "failed updating aggregates", errs...); err != nil {
return ctrl.Result{}, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took me a while to understand what's happening here. Could you move the errs != nil check here so it's easier to follow the flow of thoughts.

Another thing is, now we have a status update (with errors), a error log with backtrack and a bubbling up error? I mean we can just omit the bubbling up error in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took me a while to understand what's happening here. Could you move the errs != nil check here so it's easier to follow the flow of thoughts.

Sure, changed that.

Another thing is, now we have a status update (with errors), a error log with backtrack and a bubbling up error? I mean we can just omit the bubbling up error in this case.

We need to bubble up the error to trigger the retry and the rate-limiter for that particular reconciliation.
The error practically won't get logged anymore, unless we do so explicitly. If the stacktrace is shown depends on the logging settings, and can be configured. With the current debug logging, it will be shown on errors.

Prior to this change, we would have to exit the reconcile function on
each status update, and continue there. Now with the filter,
we only retry on either an error or RequeueAfter.

Joining the errors allows us to complete at least some of the changes,
and not always abort on the first.

Return kubernetes errors directly, and return a RequeueAfter on an error
from Openstack for now, until the logging situation has been cleared.
@fwiesel fwiesel force-pushed the GenerationChangedPredicateAggregates branch from b05358f to c706846 Compare October 23, 2025 13:23
@fwiesel fwiesel requested a review from notandy October 23, 2025 13:24
@github-actions
Copy link

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller 33.16% (-0.10%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/aggregates_controller.go 63.04% (-4.43%) 92 (+9) 58 (+2) 34 (+7) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

notandy added a commit that referenced this pull request Oct 23, 2025
This PR modifies the Code of #161 to improve following points:
1. no need for extra error-log since instead of dropping Reconcile
   Errors, we format them nicely with the Encoder.
2. Function (like rewritten `setErrorCondition`) should not return the
   errors the've been invoked with - but only return errors if they
   fail. Also, it's an uneeded roundtrip to return the same error that
   has been passed by the caller.
3. Introduce `utils.LifecycleEnabledPredicate`, a predicate that will
   filter event's for hypervisors with LifecycleEnabled == True.
notandy added a commit that referenced this pull request Oct 23, 2025
This PR modifies the Code of #161 to improve following points:
1. no need for extra error-log since instead of dropping Reconcile
   Errors, we format them nicely with the Encoder.
2. Function (like rewritten `setErrorCondition`) should not return the
   errors the've been invoked with - but only return errors if they
   fail. Also, it's an uneeded roundtrip to return the same error that
   has been passed by the caller.
3. Introduce `utils.LifecycleEnabledPredicate`, a predicate that will
   filter event's for hypervisors with LifecycleEnabled == True.
@notandy notandy mentioned this pull request Oct 23, 2025
Copy link
Contributor

@notandy notandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#166 is a proposal that's based on this PR, but does some things "cleaner".

Especially I am having a problem with trackError, that if it takes errors, unconditionally returns error. So I moved some stuff out into the caller to make it clear what's intended and why.

notandy added a commit that referenced this pull request Oct 23, 2025
This PR modifies the Code of #161 to improve following points:
1. no need for extra error-log since instead of dropping Reconcile
   Errors, we format them nicely with the Encoder.
2. Function (like rewritten `setErrorCondition`) should not return the
   errors the've been invoked with - but only return errors if they
   fail. Also, it's an uneeded roundtrip to return the same error that
   has been passed by the caller.
3. Introduce `utils.LifecycleEnabledPredicate`, a predicate that will
   filter event's for hypervisors with LifecycleEnabled == True.
fwiesel pushed a commit that referenced this pull request Oct 24, 2025
This PR modifies the Code of #161 to improve following points:
1. no need for extra error-log since instead of dropping Reconcile
   Errors, we format them nicely with the Encoder.
2. Function (like rewritten `setErrorCondition`) should not return the
   errors the've been invoked with - but only return errors if they
   fail. Also, it's an uneeded roundtrip to return the same error that
   has been passed by the caller.
3. Introduce `utils.LifecycleEnabledPredicate`, a predicate that will
   filter event's for hypervisors with LifecycleEnabled == True.
@fwiesel fwiesel closed this Oct 24, 2025
@fwiesel fwiesel deleted the GenerationChangedPredicateAggregates branch October 24, 2025 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants