Skip to content

Conversation

@olicesx
Copy link

@olicesx olicesx commented Jun 23, 2025

Background

本 PR 主要针对 DAE 的 DNS 处理性能进行优化,通过实现无锁并发机制解决高并发场景下的 DNS 性能瓶颈问题。这些变更将传统的互斥锁保护的数据结构替换为 sync.Map,并引入专门的 DNS 组件来提供更好的资源管理。

解决的错误场景:

  • DNS 解析路径中因互斥锁竞争导致的高 CPU 使用率
  • 并发 UDP 流量负载下的性能下降
  • 缺乏状态管理的重复 DNS 处理导致的内存膨胀
  • 处理数千个并发连接时的可扩展性差

Checklist

Full Changelogs

1. DNS Foundation Components (56fb759)

  • feat(dns,udp): add lockless concurrency foundation components
  • 添加 DnsHandlingStateManager 用于使用 sync.Map 进行 DNS 请求去重,解决重复请求问题
  • 添加 UdpHealthMonitor 用于无锁连接健康跟踪,提升 UDP DNS 查询稳定性
  • 添加 DnsForwarderManager 带有引用计数和生命周期管理,优化 DNS 转发器资源使用
  • 所有组件使用 sync.Map 来消除 DNS 处理中的锁竞争

2. DNS Core Optimizations (af2e2c6)

  • perf(dns,udp): replace mutex+map with sync.Map in core components
  • 在 DNS 组件中将 upstream2IndexMu + upstream2Index map 替换为 sync.Map,直接优化 DNS 上游查找性能
  • 在 AnyfromPool 中将 RWMutex + map 替换为 sync.Map,改善 UDP DNS 连接池性能
  • 消除 DNS 上游解析和 UDP 池访问中的锁竞争瓶颈

3. UDP DNS Processing Enhancement (d96dc26)

  • perf(udp): enhance UDP processing with lockless concurrency
  • 使用 sync.Map 和健康监控集成增强 UDP 端点池,提升 DNS 查询连接质量
  • 通过增加容量和无锁状态管理优化 UDP 任务池,改善 DNS 查询并发处理
  • 添加用于 DNS 连接跟踪和超时处理的健康监控集成
  • 改进 DNS 任务调度、错误处理和资源清理

4. DNS System Integration (6ff101c)

  • feat(control): integrate lockless concurrency optimizations
  • 在 DNS 控制层中集成所有新的 DNS 优化组件
  • 为系统级 DNS 处理协调连接优化组件
  • 改进 DNS 错误处理、资源清理和连接管理
  • 在增强 DNS 性能的同时简化 DNS 控制逻辑

Issue Reference

Test Result

kix added 4 commits June 23, 2025 23:52
- Add DnsHandlingStateManager for DNS request deduplication using sync.Map
- Add UdpHealthMonitor for connection health tracking with lockless operations
- Add DnsForwarderManager with reference counting and lifecycle management

These foundational components provide the building blocks for lockless
concurrency optimizations across the DNS and UDP processing pipelines.
All components use sync.Map to eliminate lock contention in high-concurrency
scenarios while maintaining thread safety.

Related to daeuniverse#589, daeuniverse#767 - addressing performance bottlenecks in high-concurrency
DNS/UDP processing scenarios
- Replace upstream2IndexMu + upstream2Index map with sync.Map in DNS component
- Replace RWMutex + map with sync.Map in AnyfromPool for UDP connections
- Eliminate lock contention in DNS upstream resolution and UDP pool access
- Improve concurrent access performance in hot paths

These optimizations target the most frequently accessed data structures
in DNS and UDP processing, providing significant performance improvements
in high-concurrency scenarios.

Helps address daeuniverse#589 - performance issues in DNS resolution under high load
Helps address daeuniverse#767 - UDP connection pool contention
- Enhance UDP endpoint pool with sync.Map and health monitoring integration
- Optimize UDP task pool with increased capacity and lockless state management
- Add health monitoring integration for connection tracking and timeout handling
- Improve task scheduling, error handling, and resource cleanup

This optimization significantly improves UDP processing performance by
integrating health monitoring and eliminating lock contention in task
and endpoint management.

Addresses daeuniverse#589 - high-concurrency UDP performance issues
Helps resolve daeuniverse#767 - UDP task pool bottlenecks under load
- Integrate DnsHandlingState, DnsForwarderManager, and UdpHealthMonitor in DNS control layer
- Wire up optimized components in control plane for system-wide coordination
- Improve error handling, resource cleanup, and connection management
- Simplify DNS control logic while enhancing performance

This final integration brings together all lockless optimization components
to provide cohesive performance improvements across the entire DNS and UDP
processing pipeline.

Closes daeuniverse#589 - resolves high-concurrency DNS/UDP performance bottlenecks
Closes daeuniverse#767 - eliminates lock contention in connection management

The lockless design using sync.Map significantly reduces mutex contention
and improves throughput in high-traffic scenarios, addressing the core
performance issues reported in both issues.
@olicesx olicesx requested a review from a team as a code owner June 23, 2025 16:25
kix added 2 commits June 24, 2025 11:20
- Add DnsServerConfig struct to config/config.go for DNS server configuration
- Extend Dns config struct with Server field for DNS server settings
- Add dialArgument struct to control_plane.go for DNS dialing decisions
- Implement DNS server startup logic in control plane with graceful error handling
- Refactor dns_control.go to support both transparent proxy and DNS server modes
- Add dual-mode udpRequest struct supporting both proxy and server contexts
- Implement DNS forwarder manager with connection pooling and lifecycle management
- Add DNS server configuration example to example.dae with detailed comments
- Optimize DNS cache using sync.Map to reduce lock contention
- Ensure DNS server startup failure does not block main DAE functionality

This enables DAE to function as a complete DNS solution while maintaining
backward compatibility and transparent proxy capabilities.
@LostAttractor
Copy link
Contributor

该PR较为潦草,我在尝试重新设计PR中指出存在问题的模块
故暂时没有合并计划?
推荐DNS存在问题的话可以尝试该PR

@olicesx olicesx force-pushed the lockless-concurrency-optimization branch from 0c9baa4 to b1fdf6c Compare June 25, 2025 09:50
- Improve AnyfromPool concurrent control with exponential backoff retry
- Add timeout protection to prevent infinite waiting in high concurrency
- Enhance error handling in sendPkt to prevent crashes on binding failures
- Replace fatal errors with warning logs for DNS response sending failures
- Maintain service availability when UDP port conflicts occur

Fixes the occasional crashes with 'bind: address already in use' errors
that occurred during high concurrent DNS requests on port 53.
@olicesx olicesx force-pushed the lockless-concurrency-optimization branch 5 times, most recently from 65cc5ef to d3f5890 Compare July 16, 2025 08:54
@ppdragon16
Copy link
Contributor

ppdragon16 commented Jul 17, 2025

我patch这个pr,在immortalwrt上的dae有大量本机发出的dns请求解析...ip6.arpa,dae的cpu占用持续30%~50%,但不影响使用(service dae restart 后恢复了)。

log像这样:
time="Jul 17 02:44:25" level=info msg="....:46858 <-> 123.123.123.123:53" _qname=f.e.f.8.3.1....1.7.0.2.8.8.0.4.2.ip6.arpa. dialer=direct dscp=0 mac="00:00:00:00:00:00" network="udp4(DNS)" outbound=direct pid=1830 pname=rpcd policy=fixed qtype=PTR

不知道还会不会再repro。

@dae-intelligence
Copy link

我patch这个pr,在immortalwrt上的dae有大量本机发出的dns请求解析...ip6.arpa,dae的cpu占用持续30%~50%,但不影响使用(service dae restart 后恢复了)。

不知道还会不会再repro。

Note

The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.

I patch this PR and there are many local DNS requests being resolved for IPv6.arpa in dae on ImmortalWrt. The CPU utilization of dae remains at 30% to 50%, but it does not affect the usability (the issue is resolved by restarting the dae service).

I'm not sure if the issue will still occur.

@ppdragon16
Copy link
Contributor

repro了:

Screenshot 2025-07-17 at 10 55 07

@dae-intelligence
Copy link

repro了:

Screenshot 2025-07-17 at 10 55 07

Note

The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.

Repro:

Screenshot 2025-07-17 at 10 55 07

@ppdragon16
Copy link
Contributor

我为rpcd加上了must_direct再试:

routing {
    pname(dnsmasq, uwsgi, rpcd) -> must_direct
...

@dae-intelligence
Copy link

我为rpcd加上了must_direct再试:

routing {
    pname(dnsmasq, uwsgi, rpcd) -> must_direct
...

Note

The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.

I added must_direct to rpcd and tried again:

routing {
    pname(dnsmasq, uwsgi, rpcd) -> must_direct
...

@ppdragon16
Copy link
Contributor

又repro了,貌似跟rpcd无关

@dae-intelligence
Copy link

又repro了,貌似跟rpcd无关

Note

The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.

It seems like it's been repro'd, probably not related to rpcd.

@ppdragon16
Copy link
Contributor

放弃此pr换到main,此问题就not repro了。

@dae-intelligence
Copy link

放弃此pr换到main,此问题就not repro了。

Note

The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.

The input text has been translated to English. The original text was not in Markdown format, so no formatting or structure preservation was required.

Translated text: "Give up this PR to switch to main, and this issue will no longer reproduce."

@olicesx
Copy link
Author

olicesx commented Jul 17, 2025

昨天可能给我干烂了,等我修复把

@dae-intelligence
Copy link

昨天可能给我干烂了,等我修复把

Note

The following content has been translated from its original language using an automated process powered by a proprietary API. Segments originally written in English have been preserved, while non-English portions have been machine-translated for readability. Please be aware that minor inaccuracies may exist due to the automated nature of the translation.

It seems that you've shared a piece of text that doesn't fully translate to English. However, I can help translate parts of it or provide explanations for it.

The given text contains a mix of Chinese and English:

  • "昨天可能给我干烂了" translates to "It might have been messed up for me yesterday."
  • "等我修复把" is a bit unclear in standard Chinese, but it might be interpreted as "When I fix it," or "Wait for me to fix it,".

If there's a specific part you would like to translate or need more context, please let me know!

@olicesx olicesx force-pushed the lockless-concurrency-optimization branch from d3f5890 to ee6c5b4 Compare July 17, 2025 07:27
@tomaegg
Copy link
Contributor

tomaegg commented Aug 15, 2025

放弃此pr换到main,此问题就not repro了。

不知道为什么你用main不能reproduce, 我在使用v1.0.0仍然是这样的.

这些反向查询来源ImmortalWrt, 一部分来源于ipv4的PTR记录查询反向查询局域网主机名, 另一部分是局域网的ipv6的PTR查询. 我倒是不明白为什么你使用main不会出现这些呢? 或许这要看你规则怎么写了.

@olicesx olicesx closed this Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug Report] DNS 可能出现卡死问题 [Bug Report] 上游个别DNS超时会影响其余DNS速度

4 participants