Skip to content

Replace bind with knot-resolver as the recursive resolver on the routers#1846

Draft
sysvinit wants to merge 13 commits intofc-25.05-devfrom
PL-134065-replace-bind-kresd
Draft

Replace bind with knot-resolver as the recursive resolver on the routers#1846
sysvinit wants to merge 13 commits intofc-25.05-devfrom
PL-134065-replace-bind-kresd

Conversation

@sysvinit
Copy link
Member

@sysvinit sysvinit commented Oct 1, 2025

PL-134065

@flyingcircusio/release-managers

Release process

  • Created changelog entry using ./changelog.sh

PR release workflow (internal)

  • PR has internal ticket
  • internal issue ID (PL-…) part of branch name
  • internal issue ID mentioned in PR description text
  • ticket is on Platform agile board
  • ticket state set to Pull request ready
  • if ticket is more urgent than within the next few days, directly contact a member of the Platform team
  • set urgency and risk labels
  • ensure the merge bot has determined a merge date
  • ensure all checks are green
  • get a review from a colleague

Design notes

  • Provide a feature toggle if the change might need to be adjusted/reverted quickly depending on context. Consider whether the default should be on or off. Example: rate limiting.
  • All customer-facing features and (NixOS) options need to be discoverable from documentation. Add or update relevant documentation such that hosted and guided customers can understand it as well.

Security implications

@sysvinit sysvinit marked this pull request as draft October 1, 2025 17:24
"ndots:1"
"timeout:1"
"attempts:6"
"timeout:5" # in correspondence with rfc8767
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, i remember: this isn't quite in correspondence. rfc8767 suggests 2 as a common option, but we can't use 2 because kresd isn't able to stale timeout in 1.8 seconds due to granularity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b4a1705.

view:addr('127.0.0.0/8', policy.all(policy.PASS))
view:addr('::1/128', policy.all(policy.PASS))

${lib.concatMapStringsSep "\n" (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether this will cause unnecessary hard reloads. I was considering to put this into firewalling code instead if it does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the short answer is yes, this will cause a restart (not a reload) if the list changes. It's possible that kresd might restart fast enough for this to not be a problem, though this is indeed something which has bitten us with bind in the past.

As for the firewalling... if we put the list of allowed IP ranges in the firewall, then changing the allowed IPs will cause a firewall reload, which on the routers usually causes the BGP sessions to go down and cause a failover due to the BFD getting interrupted. I presume by "put this into firewalling" you mean restrict access to DNS to only "downstream" interfaces, and e.g. dropping incoming DNS from the transfer/uplink interfaces?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi instance (4?) + execstop with jittered pause?

-- ensure that the hosts file can be reloaded by sighup at runtime.

local function load_private_hosts()
hints.add_hosts('/etc/nixos/rfc1918-hosts')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this remove outdated hosts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope; fixed in 8b08068.

-- long enough to trigger the stale serving behaviour. we match the
-- rfc for the stale serving timeout though so we get the benefits
-- if the granularity changes in the future.
serve_stale.timeout = 1800
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to me: set this to 3 s

sysvinit and others added 7 commits October 10, 2025 17:07
Add helper script which generates a hosts(5) file for assignments for
private RFC1918 addresses.

PL-134065
Add Cloudflare and Google to the list in addition to Quad9.

PL-134065

Co-authored-by: Christian Theune <ct@flyingcircus.io>
PL-134065

Co-authored-by: Christian Theune <ct@flyingcircus.io>
This removes authoritative DNS for gocept.net from the routers and
replaces bind with kresd as the site resolver.

PL-134065
kresd doesn't natively support runtime reloads, however the Lua
scripting is flexible enough to permit implementing this ourselves.

PL-134065
@ctheune ctheune force-pushed the PL-134065-replace-bind-kresd branch from b4a1705 to e0786b6 Compare October 10, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants