Skip to content

OCPBUGS-66104: Fine tune CoreDNS pod configuration to improve performance#5695

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
sadasu:cloud-custom-dns
Mar 20, 2026
Merged

OCPBUGS-66104: Fine tune CoreDNS pod configuration to improve performance#5695
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
sadasu:cloud-custom-dns

Conversation

@sadasu
Copy link
Contributor

@sadasu sadasu commented Feb 24, 2026

When userProvisionedDNS was enabled we found that after a successful cluster install, we were seeing several i/o timeouts specifically for UDP requests on Day-2. After experimentation with the forward plugins's options prefer_udp, max_concurrent and force_tcp, we settle on using force_tcp.

Also, allowed bufsize to be set to its default value of 1232 bytes thus not limiting packet sizes to be 512 bytes.

- What I did

- How to verify it

- Description for the changelog

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Feb 24, 2026
@openshift-ci-robot
Copy link
Contributor

@sadasu: This pull request references Jira Issue OCPBUGS-66104, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Make updates to the Cloud platform CoreDNS Corefile to increase buffersize for 512 to the default of 1232.
And prefer UDP for contacting upstream platform upstream DNS servers to reduce load on them.

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Feb 24, 2026
@gpei
Copy link

gpei commented Feb 25, 2026

/testwith openshift/installer/master/e2e-aws-custom-dns-techpreview

@sadasu
Copy link
Contributor Author

sadasu commented Feb 25, 2026

/retest-required

@sadasu
Copy link
Contributor Author

sadasu commented Feb 26, 2026

/testwith openshift/installer/master/e2e-aws-custom-dns-techpreview

@sadasu
Copy link
Contributor Author

sadasu commented Feb 26, 2026

/testwith e2e-aws-custom-dns-techpreview

@gpei
Copy link

gpei commented Feb 27, 2026

@sadasu seems the command /testwith <installer_pre_submit_job> can't work without an installer PR - https://docs.ci.openshift.org/how-tos/multi-pr-presubmit-testing/#testwith-command, it's requiring the specified test must be defined in a repo to which one of these included PRs belongs.
So I'm running the aws/azure custom-dns e2e job test in openshift/release#73998 separately, but the installations were both failed, will continue the investigating today.

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fb53b760-5f77-4f0c-8afb-17c2f9f23135

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

CoreDNS Corefile updated: removed the bufsize 512 directive and added force_tcp to the forward stanza; no other structural changes.

Changes

Cohort / File(s) Summary
CoreDNS Configuration
templates/common/cloud-platform-alt-dns/files/coredns-corefile.yaml
Removed bufsize 512; added force_tcp to the forward . { ... } stanza (changed upstream transport behavior).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title references a Jira bug (OCPBUGS-66104) and mentions CoreDNS pod performance tuning, which aligns with the actual changes to the Corefile configuration for improving DNS performance.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed PR modifies only CoreDNS YAML configuration files with no Ginkgo test declarations added or changed.
Test Structure And Quality ✅ Passed The custom check for Ginkgo test code quality is not applicable to this PR. The PR exclusively modifies a CoreDNS configuration file and contains no test code.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot
Copy link
Contributor

@sadasu: This pull request references Jira Issue OCPBUGS-66104, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request.

Details

In response to this:

Make updates to the Cloud platform CoreDNS Corefile to increase buffersize from 512 to the default of 1232 (by removing the configuration so the default value could take effect).
And prefer UDP for contacting upstream platform upstream DNS servers to reduce load on them.

- What I did

- How to verify it

- Description for the changelog

Summary by CodeRabbit

Release Notes

  • Chores
  • Modified DNS configuration to enforce TCP protocol for upstream DNS forwarding
  • Updated DNS server list to include an additional resolver for improved redundancy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sadasu sadasu force-pushed the cloud-custom-dns branch from 7281d8e to 1d6d745 Compare March 10, 2026 14:14
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@templates/common/cloud-platform-alt-dns/files/coredns-corefile.yaml`:
- Around line 8-10: In the CoreDNS Corefile inside the forward block (the stanza
starting with "forward . {{`{{- range $upstream := .DNSUpstreams}}
{{$upstream}}{{- end}}`}} {"), remove the force_tcp directive and either replace
it with prefer_udp or omit the line entirely so upstream queries will prefer UDP
and only fall back to TCP on truncation; update the forward block accordingly to
use prefer_udp if you want explicit behavior.

In
`@templates/common/cloud-platform-alt-dns/files/usr-local-bin-update-dns-server.yaml`:
- Line 17: The script hard-codes a public DNS (8.8.8.8) in the servers variable
assignment (servers=$(ip --json route get 8.8.8.8 | jq -r
".[0].prefsrc"),$1,8.8.8.8) which can break private/disconnected installs and
leak queries; change the logic in the servers assignment to stop appending
8.8.8.8 and instead use a configurable fallback or none: read a fallback from an
environment/config variable (e.g., FALLBACK_DNS or platform-provided upstreams)
and only append it when set and allowed, or simply build servers from the local
preferred source and $1 without the hard-coded public resolver.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eea4363e-a62c-4f35-8d64-147cc309dd13

📥 Commits

Reviewing files that changed from the base of the PR and between 7a8a698 and 7281d8e.

📒 Files selected for processing (2)
  • templates/common/cloud-platform-alt-dns/files/coredns-corefile.yaml
  • templates/common/cloud-platform-alt-dns/files/usr-local-bin-update-dns-server.yaml

@sadasu sadasu force-pushed the cloud-custom-dns branch 2 times, most recently from 86e4155 to 8ddb17b Compare March 11, 2026 21:04
@sadasu
Copy link
Contributor Author

sadasu commented Mar 16, 2026

/retest-required

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 17, 2026
@openshift-ci-robot
Copy link
Contributor

@jinyunma: This PR has been marked as verified by jima.

Details

In response to this:

With this change, openshfit-e2e test cases running on Azure and AWS custom-dns jobs look better now.

AWS: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/66584/rehearse-66584-periodic-ci-openshift-verification-tests-main-installer-rehearse-4.22-installer-rehearse-aws/2031914494308913152

Azure: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/66584/rehearse-66584-periodic-ci-openshift-verification-tests-main-installer-rehearse-4.22-installer-rehearse-azure/2031914494376022016

/verified by jima

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

health :18080
forward . {{`{{- range $upstream := .DNSUpstreams}} {{$upstream}}{{- end}}`}} {
policy sequential
force_tcp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadasu Your PR description says "prefer udp", is that out of date and you ended up moving to force_tcp? Or should this be prefer_udp?

prefer_udp makes more sense to me (and gemini mentioned it is recommended on aws eks!) but I haven't been following this pr closely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, prefer_udp was one of the first options we tried because that was the preferred solution during high load conditions. Testing revealed that the upstream servers reduced load by not responding to UDP requests.
We tried setting the max_concurrent to a value < 1000 which is the default value as a way to circumvent i/o timeouts due to UDP port exhaustion. But, during our testing we found that udp port exhaustion was not the cause of our i/o timeouts.
force_tcp was providing us with the best results.

@openshift-ci-robot
Copy link
Contributor

@sadasu: This pull request references Jira Issue OCPBUGS-66104, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request.

Details

In response to this:

When userProvisionedDNS was enabled we found that after a successful cluster install, we were seeing several i/o timeouts specifically for UDP requests on Day-2. After experimentation with bufsize and the forward plugins's options prefer_udp, max_concurrent and force_tcp, we settle on using force_tcp.

- What I did

- How to verify it

- Description for the changelog

Summary by CodeRabbit

  • Configuration Updates
  • Removed a restrictive DNS UDP buffer setting to allow improved handling of larger responses.
  • Enabled forced TCP for DNS forwarding to improve reliability when responses are truncated.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@patrickdillon
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2026
@sadasu sadasu force-pushed the cloud-custom-dns branch from 8ddb17b to 987b061 Compare March 19, 2026 14:25
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Mar 19, 2026
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2026
Make updates to the `forward` plugin in the Cloud platform CoreDNS
Corefile to `force_tcp` while making DNS requests. This has been
found to reduce i/o timeouts experienced by UDP DNS requests made
to the Cloud Upstream servers.

In addition changed `bufsize` from 512 to the default 1232 allowing
for packet sizes to be larger than 512 bytes.
@sadasu sadasu force-pushed the cloud-custom-dns branch from 987b061 to 76df85f Compare March 19, 2026 14:28
@sadasu
Copy link
Contributor Author

sadasu commented Mar 19, 2026

/verified by jima

Lost the label when I updated the commit message to more accurately represent the code changes. Please see original comment #5695 (comment)

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 19, 2026
@openshift-ci-robot
Copy link
Contributor

@sadasu: This PR has been marked as verified by jima.

Details

In response to this:

/verified by jima

Lost the label when I updated the commit message to more accurately represent the code changes. Please see original comment #5695 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Member

@isabella-janssen isabella-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: isabella-janssen, patrickdillon, sadasu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 19, 2026
@sadasu
Copy link
Contributor Author

sadasu commented Mar 19, 2026

/retest-required

1 similar comment
@sadasu
Copy link
Contributor Author

sadasu commented Mar 20, 2026

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 20, 2026

@sadasu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-ocl 76df85f link false /test e2e-gcp-op-ocl

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 15c41cf into openshift:main Mar 20, 2026
17 of 18 checks passed
@openshift-ci-robot
Copy link
Contributor

@sadasu: Jira Issue Verification Checks: Jira Issue OCPBUGS-66104
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-66104 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

When userProvisionedDNS was enabled we found that after a successful cluster install, we were seeing several i/o timeouts specifically for UDP requests on Day-2. After experimentation with the forward plugins's options prefer_udp, max_concurrent and force_tcp, we settle on using force_tcp.

Also, allowed bufsize to be set to its default value of 1232 bytes thus not limiting packet sizes to be 512 bytes.

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants