Skip to content

topic: allow finalizer removal when broker is unreachable#1234

Merged
birdayz merged 1 commit intomainfrom
jb/fix-topic-finalizer-on-missing-credentials
Feb 2, 2026
Merged

topic: allow finalizer removal when broker is unreachable#1234
birdayz merged 1 commit intomainfrom
jb/fix-topic-finalizer-on-missing-credentials

Conversation

@birdayz
Copy link
Contributor

@birdayz birdayz commented Jan 22, 2026

What

Allow Topic CR finalizer removal when the Kafka broker is unreachable or credentials are missing during deletion.

Why

Topics were getting stuck in Terminating state indefinitely when:

  • Broker is unreachable (connection refused, DNS failure)
  • Credentials Secret was deleted before the Topic
  • Cloud secret doesn't exist

This blocks namespace deletion and requires manual intervention to remove finalizers.

Implementation details

Extend ignoreAllConnectionErrors() to detect network dial errors via net.OpError in the error chain. When deletion fails due to non-recoverable connection errors, allow the finalizer to be removed instead of retrying forever.

Error types now handled during deletion:

  • net.OpError - connection refused, DNS failures, timeouts
  • secrets.ErrSecretNotFound - cloud secret missing
  • K8s NotFound - credentials Secret missing
  • Terminal client errors - SASL auth failures
  • Invalid cluster reference errors

Also adds operator.redpanda.com/allow-deletion annotation as an escape hatch to force deletion when other connectivity issues exist.

Unit tests added for isNetworkDialError() covering various error wrapping scenarios.

References

Tested on GKE clusters - topics with unreachable brokers now delete successfully.

@birdayz birdayz marked this pull request as draft January 22, 2026 09:14
@birdayz birdayz force-pushed the jb/fix-topic-finalizer-on-missing-credentials branch from 350f9cf to f2ee4bd Compare January 22, 2026 09:39
@birdayz birdayz changed the title topic: allow finalizer removal when credentials unavailable topic: add annotation to allow deletion with missing credentials Jan 22, 2026
@birdayz birdayz marked this pull request as ready for review January 22, 2026 11:00
// We check both ignoreAllConnectionErrors (Kafka protocol/config errors)
// and K8s NotFound (missing secrets/configmaps) since the latter isn't
// covered by the general helper.
if !topic.ObjectMeta.DeletionTimestamp.IsZero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@birdayz just wondering, is there a reason not to always do this (rather than gating it on an annotation and making it separate from ignoreAllConnectionErrors)?

I know that we could potentially leave orphaned topics if say a user deletes some secret that the client resolution code is depending on while still keeping the cluster around, but seeing as we're broken/wedged until they would recreate the secret and connect to the cluster/cleanup, it seems like maybe always skipping the cleanup phase when you can't establish a connection makes sense?

Looks like the addition of ignoreAllConnectionErrors also fixes the case where a cluster ref that the topic points to is invalid, which seems we weren't handling previously.

Just wondering if we should just add an apierrors.IsNotFound(err) to ignoreAllConnectionErrors, especially given it's comment about usage here:

// If we have known errors where we're unable to actually establish
// a connection to the cluster due to say, invalid connection parameters
// we're going to just skip the cleanup phase since we likely won't be
// able to clean ourselves up anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, one more thing, given the use of external secrets in cloud, we may want to also do some error typing when attempting to expand those for establishing connections here:

if v.ExternalSecretRefSelector != nil {
if expander == nil {
return "", errors.New("attempted to expand an external secret without enabling external secrets in the operator")
}
return expander.Expand(ctx, v.ExternalSecretRefSelector.Name)
}

Specifically for if this is hit (basically the 404 of cloud secrets):

return "", errors.Newf("secret %s not found", name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the annotation was overly conservative. Updated to add isNotFoundInChain to ignoreAllConnectionErrors which handles both K8s NotFound errors (missing secrets/configmaps) and cloud secret not found. This applies to all resource controllers during deletion, not just topics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ErrSecretNotFound sentinel error to pkg/secrets/secrets.go and included it in the isNotFoundInChain check. Now both K8s NotFound and cloud secret not found errors are handled.

@birdayz birdayz changed the title topic: add annotation to allow deletion with missing credentials resource: ignore NotFound errors during finalizer cleanup Jan 30, 2026
@birdayz
Copy link
Contributor Author

birdayz commented Jan 30, 2026

@andrewstucki can you please re-review? i've tested by hand on a cloud cluster affected by the bug, and it works. need to get this into cloud until mid next week latest. thank you!

@birdayz
Copy link
Contributor Author

birdayz commented Jan 30, 2026

@andrewstucki Extended this PR to handle network dial errors (connection refused, DNS failures) during topic deletion. Tested on GKE cluster - topics with unreachable brokers now successfully get their finalizers removed instead of hanging in Terminating state.

The fix checks for:

  • net.OpError in error chain (via stdlib errors.As)

The string-based fallback is needed because franz-go wraps dial errors with fmt.Errorf and cockroachdb/errors.As couldn't traverse that chain properly.

@birdayz birdayz force-pushed the jb/fix-topic-finalizer-on-missing-credentials branch from 3855f06 to d432940 Compare January 30, 2026 12:32
@birdayz birdayz changed the title resource: ignore NotFound errors during finalizer cleanup topic: allow finalizer removal when broker is unreachable Jan 30, 2026
Copy link
Contributor

@andrewstucki andrewstucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General approach looks good to me and our other resources (i.e. ShadowLinks, Users, etc.) will also benefit from this. Though I am slightly torn on how many errors we generally want to ignore, but think this is ok.

EDIT:

Looks like you need to fix a linter/formatting issue in one of the new tests.

Topics were getting stuck in Terminating state when the Kafka broker
was unreachable (connection refused, DNS failure, etc.) or when
credentials were missing. The controller kept retrying forever,
blocking namespace deletion.

Fix by detecting non-recoverable connection errors during topic deletion
and allowing the finalizer to be removed. This handles:

- Missing credentials Secret or cloud secret (NotFound errors)
- Network dial errors (net.OpError - connection refused, DNS failures)
- Terminal client errors (SASL auth failures)
- Invalid cluster reference errors

Add isNetworkDialError() helper using errors.As to find net.OpError
in the error chain. Add unit tests covering various error wrapping
scenarios (direct, single-wrapped, double-wrapped).

Also adds operator.redpanda.com/allow-deletion annotation to force
topic deletion even when broker connectivity issues exist.
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

💚 All backports created successfully

Status Branch Result
release/v25.1.x
release/v25.2.x
release/v25.3.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

@redpanda-data redpanda-data deleted a comment from github-actions bot Feb 3, 2026
@redpanda-data redpanda-data deleted a comment from github-actions bot Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants