Skip to content

[Feature/Bug] Managed reconciler needs a mechanism to express "async operation in progress" without triggering ReconcileSuccess or Update() #941

@jacob-hudson

Description

@jacob-hudson

This issue is filed against crossplane/crossplane-runtime because the root constraint exists here, not in upjet or any provider.

The managed reconciler in pkg/reconciler/managed unconditionally calls status.MarkConditions(xpv1.ReconcileSuccess()) whenever Observe() returns ResourceUpToDate: true with no error. There is currently no way for an ExternalClient to signal "I am deliberately returning ResourceUpToDate: true to suppress Update(), but reconciliation has not actually succeeded — an async operation is in progress." The Synced condition is therefore always overwritten to True in this path, regardless of what the provider set before returning.

Upjet exploits ResourceUpToDate: true as the mechanism to suppress Update() during async operations, but pays the cost of a permanently false Synced=True signal. This is not a bug in upjet's logic — it is an absence of a contract in crossplane-runtime that would allow the two concerns (suppress Update, signal pending) to be expressed independently.

A corresponding issue has been filed at crossplane/upjet: [link — fill in once filed].


What happened?

When an upjet-based provider (e.g. upbound/provider-aws-kafka) is performing a long-running async operation — such as an MSK cluster broker instance type change taking 60+ minutes — the managed resource's Synced condition remains True for the entire duration.

The AsyncOperation=False (Reason: Ongoing) condition is correctly set by upjet throughout. But Synced continues to read True with Reason: ReconcileSuccess, actively misleading operators, alerting systems, and automation that polls Synced to determine whether a resource has converged.

Expected: Synced=False while an async operation is in progress, transitioning back to True when it completes.

Actual: Synced=True/ReconcileSuccess is actively written on every reconcile pass for the full duration of the operation.

Root cause

In pkg/reconciler/managed/reconciler.go, the reconcile loop unconditionally sets ReconcileSuccess when Observe() returns {ResourceExists: true, ResourceUpToDate: true} with no error:

// Pseudocode — see pkg/reconciler/managed/reconciler.go for the full implementation
observation, err := external.Observe(ctx, managed)
if err != nil {
    // ...error path sets ReconcileError...
}
if observation.ResourceUpToDate {
    status.MarkConditions(xpv1.ReconcileSuccess())
    return reconcile.Result{RequeueAfter: r.pollInterval}, nil
}
// ... else calls Update()

There is no branch for "ResourceUpToDate=true but not actually reconciled successfully." The signal is binary: either the resource is up to date (success) or it needs updating (call Update()).

Upjet's ASyncInProgress branch in pkg/controller/external.go is forced to return ResourceUpToDate: true to prevent the reconciler from calling Update() — which could trigger a redundant or harmful second async operation. The consequence is that crossplane-runtime then overwrites any pending/in-progress condition upjet attempted to set, with ReconcileSuccess.

Why this cannot be fixed in upjet alone

The two naive fixes are both broken:

Returning ResourceUpToDate: false: Causes the reconciler to call Update(), potentially queuing a second async operation against an already in-flight AWS change.

Setting ReconcileError before returning: Fails for two independent reasons. First, crossplane-runtime overwrites it with ReconcileSuccess unconditionally on the same pass (since ResourceUpToDate: true is returned). Second, ReconcileError triggers exponential backoff, which is the opposite of what is needed — frequent requeues are necessary so that async completion is detected promptly.

There is also no ReconcilePending constructor in crossplane-runtime v1.x. The only Synced-related constructors available are ReconcileSuccess, ReconcileError, and ReconcilePaused — none model "deferred pending an in-flight async operation."

Proposed solutions

Option 1: Add ReconcilePending condition constructor

Add a new constructor to crossplane-runtime:

// ReconcilePending returns a condition indicating reconciliation is deferred
// pending an in-flight async operation. Unlike ReconcileError, this does not
// trigger exponential backoff.
func ReconcilePending(msg string) xpv1.Condition {
    return xpv1.Condition{
        Type:    xpv1.TypeSynced,
        Status:  corev1.ConditionFalse,
        Reason:  "ReconcilePending",
        Message: msg,
    }
}

The managed reconciler would need a corresponding hook — for example, checking whether the provider set a ReconcilePending condition after Observe() returns but before the ResourceUpToDate branch overwrites it, and if so, honouring it without calling status.MarkConditions(ReconcileSuccess()).

This correctly models the semantic distinction: "reconciliation failed" vs "reconciliation is deferred pending an in-flight operation." However, it requires condition inspection inside the reconciler, which is slightly awkward — see Option 2 for a cleaner alternative.

Option 2 (preferred): Extend ExternalObservation with AsyncOperationInProgress bool

ExternalObservation is defined in crossplane-runtime. A new field would allow providers to express both signals simultaneously — suppress Update() and suppress ReconcileSuccess — without the reconciler needing to inspect conditions written as side effects during Observe():

type ExternalObservation struct {
    ResourceExists           bool
    ResourceUpToDate         bool
    AsyncOperationInProgress bool // suppress Update() AND suppress ReconcileSuccess
    // ...
}

The managed reconciler would handle this in the ResourceUpToDate branch:

if observation.ResourceUpToDate {
    if observation.AsyncOperationInProgress {
        status.MarkConditions(xpv1.ReconcilePending("Async operation in progress"))
        // r.pollInterval is illustrative; a shorter interval would be preferable
        // to detect async completion promptly
        return reconcile.Result{RequeueAfter: r.pollInterval}, nil
    }
    status.MarkConditions(xpv1.ReconcileSuccess())
    return reconcile.Result{RequeueAfter: r.pollInterval}, nil
}

Upjet would then set AsyncOperationInProgress: true in its ASyncInProgress branch instead of relying on ResourceUpToDate: true alone. This keeps semantics within the observation contract and avoids condition inspection logic in the reconciler.

Option 3 (upjet-only, partial): Dedicated async condition type

Accept that Synced semantics are constrained by crossplane-runtime and document the AsyncOperation=False (Reason: Ongoing) condition as the canonical signal for async-in-progress state. This is the least invasive option but leaves Synced incorrect and does not address the root issue. Not recommended as a resolution.

Operational impact

For operators running alerting on Synced=False persisting beyond a threshold, the current behaviour silently swallows the alert: Synced=True is written on every reconcile pass during a genuine in-progress mutation, so the alert condition never triggers. Any tooling or automation that polls Synced to determine whether a resource has converged will receive a false positive for the full duration of the async update.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions