OCPBUGS-66104: Fine tune CoreDNS pod configuration to improve performance by sadasu · Pull Request #5695 · openshift/machine-config-operator

sadasu · 2026-02-24T22:56:22Z

When userProvisionedDNS was enabled we found that after a successful cluster install, we were seeing several i/o timeouts specifically for UDP requests on Day-2. After experimentation with the forward plugins's options prefer_udp, max_concurrent and force_tcp, we settle on using force_tcp.

Also, allowed bufsize to be set to its default value of 1232 bytes thus not limiting packet sizes to be 512 bytes.

- What I did

- How to verify it

- Description for the changelog

openshift-ci-robot · 2026-02-24T22:56:28Z

@sadasu: This pull request references Jira Issue OCPBUGS-66104, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Make updates to the Cloud platform CoreDNS Corefile to increase buffersize for 512 to the default of 1232.
And prefer UDP for contacting upstream platform upstream DNS servers to reduce load on them.

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

gpei · 2026-02-25T02:21:48Z

/testwith openshift/installer/master/e2e-aws-custom-dns-techpreview

sadasu · 2026-02-25T18:35:38Z

/retest-required

sadasu · 2026-02-26T16:32:41Z

/testwith openshift/installer/master/e2e-aws-custom-dns-techpreview

sadasu · 2026-02-26T20:37:18Z

/testwith e2e-aws-custom-dns-techpreview

gpei · 2026-02-27T00:40:26Z

@sadasu seems the command /testwith <installer_pre_submit_job> can't work without an installer PR - https://docs.ci.openshift.org/how-tos/multi-pr-presubmit-testing/#testwith-command, it's requiring the specified test must be defined in a repo to which one of these included PRs belongs.
So I'm running the aws/azure custom-dns e2e job test in openshift/release#73998 separately, but the installations were both failed, will continue the investigating today.

coderabbitai · 2026-03-10T14:13:28Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fb53b760-5f77-4f0c-8afb-17c2f9f23135

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

CoreDNS Corefile updated: removed the bufsize 512 directive and added force_tcp to the forward stanza; no other structural changes.

Changes

Cohort / File(s)	Summary
CoreDNS Configuration `templates/common/cloud-platform-alt-dns/files/coredns-corefile.yaml`	Removed `bufsize 512`; added `force_tcp` to the `forward . { ... }` stanza (changed upstream transport behavior).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title references a Jira bug (OCPBUGS-66104) and mentions CoreDNS pod performance tuning, which aligns with the actual changes to the Corefile configuration for improving DNS performance.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names	✅ Passed	PR modifies only CoreDNS YAML configuration files with no Ginkgo test declarations added or changed.
Test Structure And Quality	✅ Passed	The custom check for Ginkgo test code quality is not applicable to this PR. The PR exclusively modifies a CoreDNS configuration file and contains no test code.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-03-10T14:14:05Z

@sadasu: This pull request references Jira Issue OCPBUGS-66104, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request.

Details

In response to this:

Make updates to the Cloud platform CoreDNS Corefile to increase buffersize from 512 to the default of 1232 (by removing the configuration so the default value could take effect).
And prefer UDP for contacting upstream platform upstream DNS servers to reduce load on them.

- What I did

- How to verify it

- Description for the changelog

Summary by CodeRabbit

Release Notes

Chores

Modified DNS configuration to enforce TCP protocol for upstream DNS forwarding

Updated DNS server list to include an additional resolver for improved redundancy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@templates/common/cloud-platform-alt-dns/files/coredns-corefile.yaml`:
- Around line 8-10: In the CoreDNS Corefile inside the forward block (the stanza
starting with "forward . {{`{{- range $upstream := .DNSUpstreams}}
{{$upstream}}{{- end}}`}} {"), remove the force_tcp directive and either replace
it with prefer_udp or omit the line entirely so upstream queries will prefer UDP
and only fall back to TCP on truncation; update the forward block accordingly to
use prefer_udp if you want explicit behavior.

In
`@templates/common/cloud-platform-alt-dns/files/usr-local-bin-update-dns-server.yaml`:
- Line 17: The script hard-codes a public DNS (8.8.8.8) in the servers variable
assignment (servers=$(ip --json route get 8.8.8.8 | jq -r
".[0].prefsrc"),$1,8.8.8.8) which can break private/disconnected installs and
leak queries; change the logic in the servers assignment to stop appending
8.8.8.8 and instead use a configurable fallback or none: read a fallback from an
environment/config variable (e.g., FALLBACK_DNS or platform-provided upstreams)
and only append it when set and allowed, or simply build servers from the local
preferred source and $1 without the hard-coded public resolver.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eea4363e-a62c-4f35-8d64-147cc309dd13

📥 Commits

Reviewing files that changed from the base of the PR and between 7a8a698 and 7281d8e.

📒 Files selected for processing (2)

templates/common/cloud-platform-alt-dns/files/coredns-corefile.yaml
templates/common/cloud-platform-alt-dns/files/usr-local-bin-update-dns-server.yaml

templates/common/cloud-platform-alt-dns/files/coredns-corefile.yaml

templates/common/cloud-platform-alt-dns/files/usr-local-bin-update-dns-server.yaml

sadasu · 2026-03-16T15:18:29Z

/retest-required

jinyunma · 2026-03-17T02:35:55Z

With this change, openshfit-e2e test cases running on Azure and AWS custom-dns jobs look better now, and there is no impact on GCP.

AWS: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/66584/rehearse-66584-periodic-ci-openshift-verification-tests-main-installer-rehearse-4.22-installer-rehearse-aws/2031914494308913152

Azure: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/66584/rehearse-66584-periodic-ci-openshift-verification-tests-main-installer-rehearse-4.22-installer-rehearse-azure/2031914494376022016

GCP: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/66584/rehearse-66584-periodic-ci-openshift-verification-tests-main-installer-rehearse-4.22-installer-rehearse-gcp/2032010953523990528

/verified by jima

openshift-ci-robot · 2026-03-17T02:36:07Z

@jinyunma: This PR has been marked as verified by jima.

Details

In response to this:

With this change, openshfit-e2e test cases running on Azure and AWS custom-dns jobs look better now.

AWS: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/66584/rehearse-66584-periodic-ci-openshift-verification-tests-main-installer-rehearse-4.22-installer-rehearse-aws/2031914494308913152

Azure: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/66584/rehearse-66584-periodic-ci-openshift-verification-tests-main-installer-rehearse-4.22-installer-rehearse-azure/2031914494376022016

/verified by jima

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

patrickdillon · 2026-03-17T17:24:31Z

templates/common/cloud-platform-alt-dns/files/coredns-corefile.yaml

        health :18080
        forward . {{`{{- range $upstream := .DNSUpstreams}} {{$upstream}}{{- end}}`}} {
            policy sequential
+            force_tcp


@sadasu Your PR description says "prefer udp", is that out of date and you ended up moving to force_tcp? Or should this be prefer_udp?

prefer_udp makes more sense to me (and gemini mentioned it is recommended on aws eks!) but I haven't been following this pr closely

Yes, prefer_udp was one of the first options we tried because that was the preferred solution during high load conditions. Testing revealed that the upstream servers reduced load by not responding to UDP requests.
We tried setting the max_concurrent to a value < 1000 which is the default value as a way to circumvent i/o timeouts due to UDP port exhaustion. But, during our testing we found that udp port exhaustion was not the cause of our i/o timeouts.
force_tcp was providing us with the best results.

openshift-ci-robot · 2026-03-19T14:18:32Z

@sadasu: This pull request references Jira Issue OCPBUGS-66104, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request.

Details

In response to this:

When userProvisionedDNS was enabled we found that after a successful cluster install, we were seeing several i/o timeouts specifically for UDP requests on Day-2. After experimentation with bufsize and the forward plugins's options prefer_udp, max_concurrent and force_tcp, we settle on using force_tcp.

- What I did

- How to verify it

- Description for the changelog

Summary by CodeRabbit

Configuration Updates

Removed a restrictive DNS UDP buffer setting to allow improved handling of larger responses.

Enabled forced TCP for DNS forwarding to improve reliability when responses are truncated.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

patrickdillon · 2026-03-19T14:19:49Z

/lgtm

Make updates to the `forward` plugin in the Cloud platform CoreDNS Corefile to `force_tcp` while making DNS requests. This has been found to reduce i/o timeouts experienced by UDP DNS requests made to the Cloud Upstream servers. In addition changed `bufsize` from 512 to the default 1232 allowing for packet sizes to be larger than 512 bytes.

sadasu · 2026-03-19T14:31:18Z

/verified by jima

Lost the label when I updated the commit message to more accurately represent the code changes. Please see original comment #5695 (comment)

openshift-ci-robot · 2026-03-19T14:31:32Z

@sadasu: This PR has been marked as verified by jima.

Details

In response to this:

/verified by jima

Lost the label when I updated the commit message to more accurately represent the code changes. Please see original comment #5695 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

isabella-janssen

/lgtm

openshift-ci · 2026-03-19T15:10:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: isabella-janssen, patrickdillon, sadasu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [isabella-janssen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sadasu · 2026-03-19T20:33:44Z

/retest-required

sadasu · 2026-03-20T04:33:42Z

/retest-required

openshift-ci · 2026-03-20T07:30:28Z

@sadasu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-op-ocl	`76df85f`	link	false	`/test e2e-gcp-op-ocl`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-03-20T07:33:11Z

@sadasu: Jira Issue Verification Checks: Jira Issue OCPBUGS-66104
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-66104 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

When userProvisionedDNS was enabled we found that after a successful cluster install, we were seeing several i/o timeouts specifically for UDP requests on Day-2. After experimentation with the forward plugins's options prefer_udp, max_concurrent and force_tcp, we settle on using force_tcp.

Also, allowed bufsize to be set to its default value of 1232 bytes thus not limiting packet sizes to be 512 bytes.

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Feb 24, 2026

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Feb 24, 2026

openshift-ci bot requested review from cheesesashimi, gpei and isabella-janssen February 24, 2026 22:56

sadasu force-pushed the cloud-custom-dns branch from 19dccd4 to 6463d12 Compare February 24, 2026 22:57

sadasu mentioned this pull request Feb 27, 2026

DNM: Test only- To test different solutions for Azure, AWS custom-dns performance errors openshift/installer#10342

Open

sadasu force-pushed the cloud-custom-dns branch 2 times, most recently from 80cbba4 to efcf510 Compare March 3, 2026 15:07

sadasu force-pushed the cloud-custom-dns branch from efcf510 to 7281d8e Compare March 10, 2026 14:13

sadasu force-pushed the cloud-custom-dns branch from 7281d8e to 1d6d745 Compare March 10, 2026 14:14

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

templates/common/cloud-platform-alt-dns/files/coredns-corefile.yaml Show resolved Hide resolved

templates/common/cloud-platform-alt-dns/files/usr-local-bin-update-dns-server.yaml Outdated Show resolved Hide resolved

sadasu force-pushed the cloud-custom-dns branch 2 times, most recently from 86e4155 to 8ddb17b Compare March 11, 2026 21:04

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 17, 2026

patrickdillon reviewed Mar 17, 2026

View reviewed changes

openshift-ci bot assigned patrickdillon Mar 19, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2026

sadasu force-pushed the cloud-custom-dns branch from 8ddb17b to 987b061 Compare March 19, 2026 14:25

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Mar 19, 2026

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2026

sadasu force-pushed the cloud-custom-dns branch from 987b061 to 76df85f Compare March 19, 2026 14:28

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 19, 2026

isabella-janssen reviewed Mar 19, 2026

View reviewed changes

openshift-ci bot assigned isabella-janssen Mar 19, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2026

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 19, 2026

openshift-merge-bot bot merged commit 15c41cf into openshift:main Mar 20, 2026
17 of 18 checks passed

Conversation

sadasu commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Feb 24, 2026

Uh oh!

gpei commented Feb 25, 2026

Uh oh!

sadasu commented Feb 25, 2026

Uh oh!

sadasu commented Feb 26, 2026

Uh oh!

sadasu commented Feb 26, 2026

Uh oh!

gpei commented Feb 27, 2026

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci-robot commented Mar 10, 2026

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sadasu commented Mar 16, 2026

Uh oh!

jinyunma commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 17, 2026

Uh oh!

patrickdillon Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

sadasu Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Mar 19, 2026

Summary by CodeRabbit

Uh oh!

patrickdillon commented Mar 19, 2026

Uh oh!

sadasu commented Mar 19, 2026

Uh oh!

openshift-ci-robot commented Mar 19, 2026

Uh oh!

isabella-janssen left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Mar 19, 2026

Uh oh!

sadasu commented Mar 19, 2026

Uh oh!

sadasu commented Mar 20, 2026

Uh oh!

openshift-ci bot commented Mar 20, 2026

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sadasu commented Feb 24, 2026 •

edited

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

jinyunma commented Mar 17, 2026 •

edited

Loading