Skip to content

[Bug] Synced condition remains True during ongoing async operations #609

@jacob-hudson

Description

@jacob-hudson

What happened?

When an async operation is in progress (e.g. an MSK cluster broker instance type change taking 60+ minutes), the managed resource's Synced condition remains True for the entire duration. This gives a false signal that the resource is reconciled and up-to-date, when in reality a long-running mutation is actively in progress in AWS.

Expected behavior: Synced=False while an async operation is ongoing, so that operators and automation can correctly determine that the resource has not yet reached its desired state.

Actual behavior: Synced=True is preserved throughout the async operation. The only Synced=False transition that occurs is via unrelated error paths (e.g. credential expiry), not from the async operation itself.

This was observed with upbound/provider-aws-kafka during an MSK cluster broker instance type change (kafka.m5.largekafka.t3.small). The operation took ~84 minutes. An unrelated AWS token expiry at ~54 minutes triggered a reconcile, which hit a credential error and set Synced=False via the error path — the only Synced=False transition that occurred. That transition was entirely incidental; without it, Synced would have remained True for the full 84 minutes. The AsyncOperation=False (Reason: Ongoing) condition was correctly set throughout.

Environment

  • Crossplane: v1.15.2
  • Upjet: github.com/crossplane/upjet v1.5.2-0.20250612145416-771188e36da2
  • Provider: upbound/provider-aws-kafka v1.23.1
  • AWS region: eu-west-1

Root cause (confirmed in source)

In pkg/controller/external.go, the Observe() path handles res.ASyncInProgress as follows:

case res.ASyncInProgress:
    mg.SetConditions(resource.AsyncOperationOngoingCondition())
    return managed.ExternalObservation{
        ResourceExists:   true,
        ResourceUpToDate: true,
    }, nil

Returning ResourceUpToDate: true with no error causes crossplane-runtime's managed reconciler to unconditionally call status.MarkConditions(xpv1.ReconcileSuccess()) and schedule the next reconcile at the poll interval (typically 10 minutes for provider-aws) — meaning Synced=True is actively written on every reconcile pass for the full duration of the async operation. Upjet does not own the final Synced write.

This is the counterpart to the fix made in upjet v1.3.0 (in response to crossplane-contrib/provider-upjet-aws#1164), which ensured Synced=False is correctly set when an async operation fails. The async in-progress case was never addressed.

There is also an internal acknowledgement of this gap in pkg/terraform, where the Flush function is deprecated in favour of Clear with the deprecation note explicitly referencing the need to implement proper Synced condition handling for asynchronous external clients. This suggests the gap has been known but not yet prioritised.

Operational impact

For operators running alerting on Synced=False persisting beyond a threshold, the current behavior actively suppresses alerts during a genuine in-progress mutation: Synced=True is written on every reconcile pass, meaning the alert never fires. The AsyncOperation=False (Reason: Ongoing) condition exists but is not part of standard Crossplane alerting conventions. Additionally, any tooling or automation that polls Synced to determine whether a resource has converged will receive a false positive for the full duration of the update.

Why Ready should not change

Ready correctly reflects the health of the external resource — the MSK cluster is still up and serving traffic during the update, so Ready=True is accurate. Changing Ready would unnecessarily cascade into XR composition readiness gates and is not the correct signal here. Only Synced needs to change.

How to reproduce

  1. Use upbound/provider-aws-kafka with UseAsync: true for a resource that supports long-running mutations (cluster.kafka.aws.upbound.io)
  2. Apply a change that triggers a long async update — e.g. changing MSK broker instance type
  3. While the async operation is in progress, observe conditions:
kubectl get cluster.kafka.aws.upbound.io <name> -o jsonpath='{.status.conditions}' | jq
  1. Note that AsyncOperation=False (Reason: Ongoing) is correctly set, but Synced remains True throughout — it should be False

AWS-side confirmation that the operation is actively in progress while Synced=True:

aws kafka describe-cluster --cluster-arn "<arn>" --query 'ClusterInfo.State'
# Returns "UPDATING" while Synced=True

Discussion: possible resolutions

A fix requires changes at the crossplane-runtime layer. The following approaches have been considered:

Why simple fixes within upjet do not work

There are two tempting but broken approaches:

Setting xpv1.ReconcileError before returning: This fails for two independent reasons. First, because upjet returns ResourceUpToDate: true, crossplane-runtime's reconciler will call status.MarkConditions(xpv1.ReconcileSuccess()) immediately after Observe() returns, overwriting it unconditionally. Second, even if that were not the case, ReconcileError triggers exponential backoff — the opposite of what is needed, since frequent requeues are desirable during an async operation so that completion is detected promptly and Synced can be correctly set back to True.

There is also no ReconcilePending constructor available in crossplane-runtime v1.x as an alternative. The only Synced-related constructors are ReconcileSuccess, ReconcileError, and ReconcilePaused — none of which correctly model "deferred pending an in-flight operation."

Returning ResourceUpToDate: false: This would cause the managed reconciler to call Update(), potentially triggering a redundant or harmful second async operation against the already in-flight AWS change.

Option 1: Add ReconcilePending to crossplane-runtime

A new condition constructor that sets Synced=False with a non-error reason (e.g. Reason: ReconcilePending, Message: "Async operation in progress"), which the managed reconciler could honour without triggering backoff. This correctly models the semantic distinction between "reconciliation failed" and "reconciliation is deferred pending an in-flight operation." However it requires condition inspection inside the reconciler, which is slightly awkward — see Option 2 for a cleaner alternative. This requires a crossplane-runtime change as a prerequisite, followed by adoption in upjet.

Option 2 (preferred): Extend ExternalObservation in crossplane-runtime

ExternalObservation is defined in crossplane-runtime. Adding a new field (e.g. AsyncOperationInProgress bool) would allow the managed reconciler itself to set Synced=False without triggering Update(), keeping the semantics fully within the reconciler contract. This is cleaner than Option 1 — the signal lives in the return value rather than a side-effectful condition write, avoiding any need for the reconciler to inspect conditions set during Observe(). This also requires a crossplane-runtime change, with upjet then setting the new field in the ASyncInProgress branch.

Option 3: Introduce a dedicated upjet condition

Introduce a condition type separate from Synced as a documented signal for async-in-progress state, accepting that Synced semantics are constrained by crossplane-runtime. The AsyncOperation=False (Reason: Ongoing) condition already partially serves this role but is not part of standard Crossplane alerting conventions. This is the least invasive option but leaves the Synced signal incorrect and does not address the root issue.

This issue should be tracked alongside a corresponding crossplane-runtime issue to add ReconcilePending or extend ExternalObservation.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions