fix: show errors in webhook #34

BobVanB · 2025-02-20T06:11:33Z

Fix: Expose Soft Errors in Infoblox Webhook to Prevent PTR Lookup Issues

Issue: PTR Record Lookup Fails After Node Replacement

Description of the Problem

We encountered an issue where external-dns fails to handle PTR records correctly when nodes are replaced. The problem occurs in the following sequence:

A node with fqdn1 and IP1 is added.
external-dns creates the corresponding records in Infoblox.
The node with fqdn1 and IP1 is removed.
A new node with the same fqdn1, but IP2, is added.
external-dns still sees fqdn1, but the PTR record is missing for IP2 in Infoblox.

If steps 3 and 4 happen between two external-dns runs, the process breaks at step 5. From that point on, external-dns is unable to recover properly.

Root Cause

The Infoblox webhook first attempts an update before performing a delete.
If an error occurs during the update, the delete operation is never executed.
external-dns reports a soft error, but the Infoblox webhook silently ignores it.
This results in an incomplete PTR record state that prevents external-dns from proceeding correctly.

Proposed Solution

This pull request modifies the Infoblox webhook to expose soft errors instead of ignoring them. By making these errors visible, administrators can more easily trace issues in Infoblox and manually correct them if needed.

Impact

Prevents silent failures in the Infoblox webhook.
Improves debugging when external-dns encounters missing PTR records.
Ensures that errors during the update phase do not block subsequent delete operations.

Workaround for Rancher

In Rancher-managed clusters, nodes may be created dynamically when a resource pool lacks sufficient workers. Rancher assigns a sequential number to new nodes, which increases the likelihood of reusing the same FQDN. To mitigate this issue, consider the following workarounds:

Adjust the external-dns interval
Ensure that the external-dns update interval is shorter than the time Rancher needs to create a new worker node.
Use a new resource pool with different FQDNs
Instead of reusing FQDNs, create a new resource pool and use a scale-down and add-worker strategy to ensure that each node receives a unique FQDN.

BobVanB requested review from Buzzglo, TebogoTS, ampie, januarios, k0da and kuritka as code owners February 20, 2025 06:11

BobVanB force-pushed the better_error_handling branch from 6cb52df to db0fbfd Compare February 20, 2025 07:57

fix: show errors in webhook

5898b52

BobVanB force-pushed the better_error_handling branch from db0fbfd to 5898b52 Compare February 21, 2025 10:14

kuritka approved these changes Feb 21, 2025

View reviewed changes

kuritka merged commit 323c425 into AbsaOSS:main Apr 1, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: show errors in webhook #34

fix: show errors in webhook #34

Uh oh!

BobVanB commented Feb 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: show errors in webhook #34

fix: show errors in webhook #34

Uh oh!

Conversation

BobVanB commented Feb 20, 2025

Fix: Expose Soft Errors in Infoblox Webhook to Prevent PTR Lookup Issues

Issue: PTR Record Lookup Fails After Node Replacement

Description of the Problem

Root Cause

Proposed Solution

Impact

Workaround for Rancher

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants