Skip to content

Conversation

@BobVanB
Copy link
Contributor

@BobVanB BobVanB commented Feb 20, 2025

Fix: Expose Soft Errors in Infoblox Webhook to Prevent PTR Lookup Issues

Issue: PTR Record Lookup Fails After Node Replacement

Description of the Problem

We encountered an issue where external-dns fails to handle PTR records correctly when nodes are replaced. The problem occurs in the following sequence:

  1. A node with fqdn1 and IP1 is added.
  2. external-dns creates the corresponding records in Infoblox.
  3. The node with fqdn1 and IP1 is removed.
  4. A new node with the same fqdn1, but IP2, is added.
  5. external-dns still sees fqdn1, but the PTR record is missing for IP2 in Infoblox.

If steps 3 and 4 happen between two external-dns runs, the process breaks at step 5. From that point on, external-dns is unable to recover properly.

Root Cause

  • The Infoblox webhook first attempts an update before performing a delete.
  • If an error occurs during the update, the delete operation is never executed.
  • external-dns reports a soft error, but the Infoblox webhook silently ignores it.
  • This results in an incomplete PTR record state that prevents external-dns from proceeding correctly.

Proposed Solution

This pull request modifies the Infoblox webhook to expose soft errors instead of ignoring them. By making these errors visible, administrators can more easily trace issues in Infoblox and manually correct them if needed.

Impact

  • Prevents silent failures in the Infoblox webhook.
  • Improves debugging when external-dns encounters missing PTR records.
  • Ensures that errors during the update phase do not block subsequent delete operations.

Workaround for Rancher

In Rancher-managed clusters, nodes may be created dynamically when a resource pool lacks sufficient workers. Rancher assigns a sequential number to new nodes, which increases the likelihood of reusing the same FQDN. To mitigate this issue, consider the following workarounds:

  1. Adjust the external-dns interval
    Ensure that the external-dns update interval is shorter than the time Rancher needs to create a new worker node.
  2. Use a new resource pool with different FQDNs
    Instead of reusing FQDNs, create a new resource pool and use a scale-down and add-worker strategy to ensure that each node receives a unique FQDN.

@BobVanB BobVanB force-pushed the better_error_handling branch from db0fbfd to 5898b52 Compare February 21, 2025 10:14
@kuritka kuritka merged commit 323c425 into AbsaOSS:main Apr 1, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants