OCPBUGS-66104: Fine tune CoreDNS pod configuration to improve performance#5695
Conversation
|
@sadasu: This pull request references Jira Issue OCPBUGS-66104, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
19dccd4 to
6463d12
Compare
|
/testwith openshift/installer/master/e2e-aws-custom-dns-techpreview |
|
/retest-required |
|
/testwith openshift/installer/master/e2e-aws-custom-dns-techpreview |
|
/testwith e2e-aws-custom-dns-techpreview |
|
@sadasu seems the command |
80cbba4 to
efcf510
Compare
efcf510 to
7281d8e
Compare
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughCoreDNS Corefile updated: removed the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
|
@sadasu: This pull request references Jira Issue OCPBUGS-66104, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
7281d8e to
1d6d745
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@templates/common/cloud-platform-alt-dns/files/coredns-corefile.yaml`:
- Around line 8-10: In the CoreDNS Corefile inside the forward block (the stanza
starting with "forward . {{`{{- range $upstream := .DNSUpstreams}}
{{$upstream}}{{- end}}`}} {"), remove the force_tcp directive and either replace
it with prefer_udp or omit the line entirely so upstream queries will prefer UDP
and only fall back to TCP on truncation; update the forward block accordingly to
use prefer_udp if you want explicit behavior.
In
`@templates/common/cloud-platform-alt-dns/files/usr-local-bin-update-dns-server.yaml`:
- Line 17: The script hard-codes a public DNS (8.8.8.8) in the servers variable
assignment (servers=$(ip --json route get 8.8.8.8 | jq -r
".[0].prefsrc"),$1,8.8.8.8) which can break private/disconnected installs and
leak queries; change the logic in the servers assignment to stop appending
8.8.8.8 and instead use a configurable fallback or none: read a fallback from an
environment/config variable (e.g., FALLBACK_DNS or platform-provided upstreams)
and only append it when set and allowed, or simply build servers from the local
preferred source and $1 without the hard-coded public resolver.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eea4363e-a62c-4f35-8d64-147cc309dd13
📒 Files selected for processing (2)
templates/common/cloud-platform-alt-dns/files/coredns-corefile.yamltemplates/common/cloud-platform-alt-dns/files/usr-local-bin-update-dns-server.yaml
templates/common/cloud-platform-alt-dns/files/usr-local-bin-update-dns-server.yaml
Outdated
Show resolved
Hide resolved
86e4155 to
8ddb17b
Compare
|
/retest-required |
|
With this change, openshfit-e2e test cases running on Azure and AWS custom-dns jobs look better now, and there is no impact on GCP. /verified by jima |
|
@jinyunma: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| health :18080 | ||
| forward . {{`{{- range $upstream := .DNSUpstreams}} {{$upstream}}{{- end}}`}} { | ||
| policy sequential | ||
| force_tcp |
There was a problem hiding this comment.
@sadasu Your PR description says "prefer udp", is that out of date and you ended up moving to force_tcp? Or should this be prefer_udp?
prefer_udp makes more sense to me (and gemini mentioned it is recommended on aws eks!) but I haven't been following this pr closely
There was a problem hiding this comment.
Yes, prefer_udp was one of the first options we tried because that was the preferred solution during high load conditions. Testing revealed that the upstream servers reduced load by not responding to UDP requests.
We tried setting the max_concurrent to a value < 1000 which is the default value as a way to circumvent i/o timeouts due to UDP port exhaustion. But, during our testing we found that udp port exhaustion was not the cause of our i/o timeouts.
force_tcp was providing us with the best results.
|
@sadasu: This pull request references Jira Issue OCPBUGS-66104, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (gpei@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
8ddb17b to
987b061
Compare
Make updates to the `forward` plugin in the Cloud platform CoreDNS Corefile to `force_tcp` while making DNS requests. This has been found to reduce i/o timeouts experienced by UDP DNS requests made to the Cloud Upstream servers. In addition changed `bufsize` from 512 to the default 1232 allowing for packet sizes to be larger than 512 bytes.
987b061 to
76df85f
Compare
|
/verified by jima Lost the label when I updated the commit message to more accurately represent the code changes. Please see original comment #5695 (comment) |
|
@sadasu: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: isabella-janssen, patrickdillon, sadasu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required |
1 similar comment
|
/retest-required |
|
@sadasu: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
15c41cf
into
openshift:main
|
@sadasu: Jira Issue Verification Checks: Jira Issue OCPBUGS-66104 Jira Issue OCPBUGS-66104 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
When
userProvisionedDNSwas enabled we found that after a successful cluster install, we were seeing severali/o timeoutsspecifically for UDP requests on Day-2. After experimentation with theforwardplugins's optionsprefer_udp,max_concurrentandforce_tcp, we settle on usingforce_tcp.Also, allowed
bufsizeto be set to its default value of 1232 bytes thus not limiting packet sizes to be 512 bytes.- What I did
- How to verify it
- Description for the changelog