Skip to content

Conversation

@bryce-anderson
Copy link
Contributor

@bryce-anderson bryce-anderson commented Dec 4, 2025

Motivation

We want to explore whether we can reduce some tail latencies that
seem to be rooted in DNS request timeouts. DNS is most often used
over UDP, which is an inherently lossy protocol transport. When
a packet is lost we have to wait for timeouts and retries, which
can cause tail latencies.

Modifications

Instead of waiting for a full timeout, introduce a backup request
which will fire after a duration. If the first request is slow
because of packet loss or other unfortunate latency then we may
get a faster result from the second.

This is an experimental feature for now, but we can later enhance
it with adaptive backup request deadlines and token buckets to
make sure we're good citizens to the DNS servers.

@bryce-anderson bryce-anderson force-pushed the bl_anderson/backup-dns-request-alt branch from bfcab9e to 7bb39c7 Compare December 18, 2025 17:26
 #### Motivation

We want to explore whether we can reduce some tail latencies that
seem to be rooted in DNS request timeouts. DNS is most often used
over UDP, which is an inherently lossy protocol transport. When
a packet is lost we have to wait for timeouts and retries, which
can cause tail latencies.

 #### Modifications

Instead of waiting for a full timeout, introduce a backup request
which will fire after a duration. If the first request is slow
because of packet loss or other unfortunate latency then we may
get a faster result from the second.

This is an experimental feature for now, but we can later enhance
it with adaptive backup request deadlines and token buckets to
make sure we're good citizens to the DNS servers.
@bryce-anderson bryce-anderson force-pushed the bl_anderson/backup-dns-request-alt branch from b96eb2a to f611635 Compare December 18, 2025 19:11
@bryce-anderson bryce-anderson changed the title WIP: backup DNS requests to reduce tail latencies associated with lost DNS packets dns-discovery-netty: Add support for DNS backup requests Dec 18, 2025
@bryce-anderson bryce-anderson marked this pull request as ready for review December 18, 2025 19:12
@bryce-anderson
Copy link
Contributor Author

Note that this is a simplified version of #2918.

Copy link
Member

@idelpivnitskiy idelpivnitskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice approach and abstractions! Most of my comments are minor suggestions, except the one on absolute timer value vs adaptive. I feel like testing might be hard and risky with an absolute value and we will too quickly come to the need in adaptive timer


// Backup request static configuration: values > 0 mean allow a backup request with fixed delay, disabled otherwise.
private static final String DNS_BACKUP_REQUEST_DELAY_MS_PROPERTY =
"io.servicetalk.dns.discovery.netty.experimental.dnsBackupRequestDelayMs";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

experimental in the name signals that it's a temporary property, but if the plan is to enable it by default and remove later, then finding the right integer value that fits everyone will be a tricky task. If we plan to give users ability to configure this value and the property is only to control default value, then we may need to add builder API and consider renaming the property to something permanent like io.servicetalk.dns.discovery.netty.defaultDnsBackupRequestDelayMs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the topic of finding the right value: in addition to finding perfect initial value, users (and upstream DNS servers) may be affected if some network (or DNS) conditions change.

WDYT if instead of specific ms value we will ask users to provide a percentile? Then we can track latencies internally and send backup requests only for those that exceed certain percentile, like p999, p99 or p95. Of course, it increases our internal complexity but on the other hand adjusts to a deployment environment and lets users control what percentage of their requests to retry. This may help avoid situations when an accidental spike in latency triggers a storm of backup requests that can kill upstream DNS server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline, and decided the fixed number approach is fine so long as we strictly control the experiment. That lets us test some values real fast in a controlled way and that can inform us how to set a dynamic value (eg, what percentile to pick by default) later after we sort out some details of how to compute percentiles. One caveat is that we need to make the parsing of the property 'hot' instead of once on startup, and with the value of 0 skipping it altogether for the life of the process so we don't introduce changes for apps we're not testing with.

verify(backupResolver, times(1)).resolveAll("foo");
List<InetAddress> result = new ArrayList<>();
backupPromise.setSuccess(result);
assertEquals(result, resolveFuture.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few ideas for considerations:

  1. using "same" instead of "equals"
  2. checking primaryPromise is cancelled
  3. parametrizing the test to ensure it works the same regardless of which promise completes first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took 1 and 3, but I was deliberate about not cancelling the losing promise. Primarily, it's more noise and allocations to trigger cancellation of the other and it's also not that expensive to let the query finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants