topic: allow finalizer removal when broker is unreachable#1234
Conversation
350f9cf to
f2ee4bd
Compare
| // We check both ignoreAllConnectionErrors (Kafka protocol/config errors) | ||
| // and K8s NotFound (missing secrets/configmaps) since the latter isn't | ||
| // covered by the general helper. | ||
| if !topic.ObjectMeta.DeletionTimestamp.IsZero() { |
There was a problem hiding this comment.
@birdayz just wondering, is there a reason not to always do this (rather than gating it on an annotation and making it separate from ignoreAllConnectionErrors)?
I know that we could potentially leave orphaned topics if say a user deletes some secret that the client resolution code is depending on while still keeping the cluster around, but seeing as we're broken/wedged until they would recreate the secret and connect to the cluster/cleanup, it seems like maybe always skipping the cleanup phase when you can't establish a connection makes sense?
Looks like the addition of ignoreAllConnectionErrors also fixes the case where a cluster ref that the topic points to is invalid, which seems we weren't handling previously.
Just wondering if we should just add an apierrors.IsNotFound(err) to ignoreAllConnectionErrors, especially given it's comment about usage here:
There was a problem hiding this comment.
Also, one more thing, given the use of external secrets in cloud, we may want to also do some error typing when attempting to expand those for establishing connections here:
redpanda-operator/pkg/ir/staticconfig.go
Lines 312 to 317 in e5dd284
Specifically for if this is hit (basically the 404 of cloud secrets):
redpanda-operator/pkg/secrets/secrets.go
Line 100 in e5dd284
There was a problem hiding this comment.
You're right, the annotation was overly conservative. Updated to add isNotFoundInChain to ignoreAllConnectionErrors which handles both K8s NotFound errors (missing secrets/configmaps) and cloud secret not found. This applies to all resource controllers during deletion, not just topics.
There was a problem hiding this comment.
Added ErrSecretNotFound sentinel error to pkg/secrets/secrets.go and included it in the isNotFoundInChain check. Now both K8s NotFound and cloud secret not found errors are handled.
|
@andrewstucki can you please re-review? i've tested by hand on a cloud cluster affected by the bug, and it works. need to get this into cloud until mid next week latest. thank you! |
|
@andrewstucki Extended this PR to handle network dial errors (connection refused, DNS failures) during topic deletion. Tested on GKE cluster - topics with unreachable brokers now successfully get their finalizers removed instead of hanging in Terminating state. The fix checks for:
The string-based fallback is needed because franz-go wraps dial errors with fmt.Errorf and cockroachdb/errors.As couldn't traverse that chain properly. |
3855f06 to
d432940
Compare
There was a problem hiding this comment.
General approach looks good to me and our other resources (i.e. ShadowLinks, Users, etc.) will also benefit from this. Though I am slightly torn on how many errors we generally want to ignore, but think this is ok.
EDIT:
Looks like you need to fix a linter/formatting issue in one of the new tests.
Topics were getting stuck in Terminating state when the Kafka broker was unreachable (connection refused, DNS failure, etc.) or when credentials were missing. The controller kept retrying forever, blocking namespace deletion. Fix by detecting non-recoverable connection errors during topic deletion and allowing the finalizer to be removed. This handles: - Missing credentials Secret or cloud secret (NotFound errors) - Network dial errors (net.OpError - connection refused, DNS failures) - Terminal client errors (SASL auth failures) - Invalid cluster reference errors Add isNetworkDialError() helper using errors.As to find net.OpError in the error chain. Add unit tests covering various error wrapping scenarios (direct, single-wrapped, double-wrapped). Also adds operator.redpanda.com/allow-deletion annotation to force topic deletion even when broker connectivity issues exist.
d432940 to
0ec52fc
Compare
💚 All backports created successfully
Note: Successful backport PRs will be merged automatically after passing CI. Questions ?Please refer to the Backport tool documentation and see the Github Action logs for details |
What
Allow Topic CR finalizer removal when the Kafka broker is unreachable or credentials are missing during deletion.
Why
Topics were getting stuck in Terminating state indefinitely when:
This blocks namespace deletion and requires manual intervention to remove finalizers.
Implementation details
Extend
ignoreAllConnectionErrors()to detect network dial errors vianet.OpErrorin the error chain. When deletion fails due to non-recoverable connection errors, allow the finalizer to be removed instead of retrying forever.Error types now handled during deletion:
net.OpError- connection refused, DNS failures, timeoutssecrets.ErrSecretNotFound- cloud secret missingAlso adds
operator.redpanda.com/allow-deletionannotation as an escape hatch to force deletion when other connectivity issues exist.Unit tests added for
isNetworkDialError()covering various error wrapping scenarios.References
Tested on GKE clusters - topics with unreachable brokers now delete successfully.