-
Notifications
You must be signed in to change notification settings - Fork 204
dns-discovery-netty: Add support for DNS backup requests #3369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
dns-discovery-netty: Add support for DNS backup requests #3369
Conversation
...k-dns-discovery-netty/src/main/java/io/servicetalk/dns/discovery/netty/DefaultDnsClient.java
Outdated
Show resolved
Hide resolved
...k-dns-discovery-netty/src/main/java/io/servicetalk/dns/discovery/netty/DefaultDnsClient.java
Outdated
Show resolved
Hide resolved
...k-dns-discovery-netty/src/main/java/io/servicetalk/dns/discovery/netty/DefaultDnsClient.java
Outdated
Show resolved
Hide resolved
...k-dns-discovery-netty/src/main/java/io/servicetalk/dns/discovery/netty/DefaultDnsClient.java
Outdated
Show resolved
Hide resolved
bfcab9e to
7bb39c7
Compare
#### Motivation We want to explore whether we can reduce some tail latencies that seem to be rooted in DNS request timeouts. DNS is most often used over UDP, which is an inherently lossy protocol transport. When a packet is lost we have to wait for timeouts and retries, which can cause tail latencies. #### Modifications Instead of waiting for a full timeout, introduce a backup request which will fire after a duration. If the first request is slow because of packet loss or other unfortunate latency then we may get a faster result from the second. This is an experimental feature for now, but we can later enhance it with adaptive backup request deadlines and token buckets to make sure we're good citizens to the DNS servers.
b96eb2a to
f611635
Compare
|
Note that this is a simplified version of #2918. |
idelpivnitskiy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice approach and abstractions! Most of my comments are minor suggestions, except the one on absolute timer value vs adaptive. I feel like testing might be hard and risky with an absolute value and we will too quickly come to the need in adaptive timer
...k-dns-discovery-netty/src/main/java/io/servicetalk/dns/discovery/netty/DefaultDnsClient.java
Outdated
Show resolved
Hide resolved
...tty/src/main/java/io/servicetalk/dns/discovery/netty/DefaultDnsServiceDiscovererBuilder.java
Outdated
Show resolved
Hide resolved
|
|
||
| // Backup request static configuration: values > 0 mean allow a backup request with fixed delay, disabled otherwise. | ||
| private static final String DNS_BACKUP_REQUEST_DELAY_MS_PROPERTY = | ||
| "io.servicetalk.dns.discovery.netty.experimental.dnsBackupRequestDelayMs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
experimental in the name signals that it's a temporary property, but if the plan is to enable it by default and remove later, then finding the right integer value that fits everyone will be a tricky task. If we plan to give users ability to configure this value and the property is only to control default value, then we may need to add builder API and consider renaming the property to something permanent like io.servicetalk.dns.discovery.netty.defaultDnsBackupRequestDelayMs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the topic of finding the right value: in addition to finding perfect initial value, users (and upstream DNS servers) may be affected if some network (or DNS) conditions change.
WDYT if instead of specific ms value we will ask users to provide a percentile? Then we can track latencies internally and send backup requests only for those that exceed certain percentile, like p999, p99 or p95. Of course, it increases our internal complexity but on the other hand adjusts to a deployment environment and lets users control what percentage of their requests to retry. This may help avoid situations when an accidental spike in latency triggers a storm of backup requests that can kill upstream DNS server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked offline, and decided the fixed number approach is fine so long as we strictly control the experiment. That lets us test some values real fast in a controlled way and that can inform us how to set a dynamic value (eg, what percentile to pick by default) later after we sort out some details of how to compute percentiles. One caveat is that we need to make the parsing of the property 'hot' instead of once on startup, and with the value of 0 skipping it altogether for the life of the process so we don't introduce changes for apps we're not testing with.
...k-dns-discovery-netty/src/main/java/io/servicetalk/dns/discovery/netty/DefaultDnsClient.java
Outdated
Show resolved
Hide resolved
...k-dns-discovery-netty/src/main/java/io/servicetalk/dns/discovery/netty/DefaultDnsClient.java
Outdated
Show resolved
Hide resolved
...s-discovery-netty/src/test/java/io/servicetalk/dns/discovery/netty/DefaultDnsClientTest.java
Outdated
Show resolved
Hide resolved
...s-discovery-netty/src/test/java/io/servicetalk/dns/discovery/netty/DefaultDnsClientTest.java
Outdated
Show resolved
Hide resolved
| verify(backupResolver, times(1)).resolveAll("foo"); | ||
| List<InetAddress> result = new ArrayList<>(); | ||
| backupPromise.setSuccess(result); | ||
| assertEquals(result, resolveFuture.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few ideas for considerations:
- using "same" instead of "equals"
- checking
primaryPromiseis cancelled - parametrizing the test to ensure it works the same regardless of which promise completes first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took 1 and 3, but I was deliberate about not cancelling the losing promise. Primarily, it's more noise and allocations to trigger cancellation of the other and it's also not that expensive to let the query finish.
Motivation
We want to explore whether we can reduce some tail latencies that
seem to be rooted in DNS request timeouts. DNS is most often used
over UDP, which is an inherently lossy protocol transport. When
a packet is lost we have to wait for timeouts and retries, which
can cause tail latencies.
Modifications
Instead of waiting for a full timeout, introduce a backup request
which will fire after a duration. If the first request is slow
because of packet loss or other unfortunate latency then we may
get a faster result from the second.
This is an experimental feature for now, but we can later enhance
it with adaptive backup request deadlines and token buckets to
make sure we're good citizens to the DNS servers.