Skip to content

compute-domain-daemon: skip SIGUSR1 when DNS mapping update only removes nodes#919

Open
AkshatDudeja77 wants to merge 3 commits intoNVIDIA:mainfrom
AkshatDudeja77:fix-skip-sigusr1-on-node-removal
Open

compute-domain-daemon: skip SIGUSR1 when DNS mapping update only removes nodes#919
AkshatDudeja77 wants to merge 3 commits intoNVIDIA:mainfrom
AkshatDudeja77:fix-skip-sigusr1-on-node-removal

Conversation

@AkshatDudeja77
Copy link

Fixes #913

The IMEXDaemonUpdateLoopWithDNSNames currently sends SIGUSR1 to the IMEX
daemon whenever the DNS/IP mapping changes.

However, when the update only removes nodes (and does not add new ones),
forcing the daemon to re-resolve and reconnect is unnecessary.

This change tracks the previous mapping size and skips sending SIGUSR1 when
the updated mapping strictly removes nodes.

Before:
SIGUSR1 triggered whenever the mapping changed.

After:
SIGUSR1 triggered only when nodes are added or mappings otherwise change,
but not when nodes are only removed.

This implements the TODO in the code suggesting that reconnects can be
skipped when the new IP set only removes addresses compared to the old set.

Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@AkshatDudeja77 AkshatDudeja77 force-pushed the fix-skip-sigusr1-on-node-removal branch from a620c5d to f8ad4c0 Compare March 6, 2026 07:20
@jgehrcke
Copy link
Collaborator

jgehrcke commented Mar 7, 2026

when the update only removes nodes (and does not add new ones)

That's a great goal. But how do we detect this situation with sets being our central data structures?

Write two sets down, on a piece of paper. This could look like this:

Set A:  item1, item2, item3
Set B:  item1, item2, item4

What do you need to do in order to find out if set A has an item removed compared to set B?

Do you need to look at more than just set size?

I genuinely wonder what your goals and incentives are. This is I think your sixth pull request of this style against this repository, and I believe we need to think about providing candid feedback soon :). These are wild times!

Copy link
Collaborator

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logically incorrect, as expressed in regular comment

@jgehrcke
Copy link
Collaborator

jgehrcke commented Mar 7, 2026

For context, I still find it quite likely that this is AI-generated noise. Last time I looked into this, I noted:
#863 (comment)

@AkshatDudeja77
Copy link
Author

AkshatDudeja77 commented Mar 8, 2026

@jgehrcke thank you for your point about comparing sets rather than just sizes, I finally do understand what you're saying.

My intention with the current change was only to avoid sending SIGUSR1 when the update strictly removes nodes (i.e. when the mapping shrinks). In cases where
the size stays the same but the elements differ (like old set: A,B,C and new set: A,B,D), we would still send SIGUSR1 because a new peer appeared.

For transparency: I used AI to draft the initial idea here, but I’ve reviewed and adjusted the code and tests myself.

…ves nodes (fixes NVIDIA#913)

Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>
@AkshatDudeja77 AkshatDudeja77 force-pushed the fix-skip-sigusr1-on-node-removal branch from f8ad4c0 to a7ddddf Compare March 8, 2026 19:05
@AkshatDudeja77 AkshatDudeja77 requested a review from jgehrcke March 8, 2026 19:18
@rajatchopra
Copy link

@AkshatDudeja77 Can you write some unit testcases around it

Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>
@AkshatDudeja77 AkshatDudeja77 force-pushed the fix-skip-sigusr1-on-node-removal branch from 13461a6 to e40c7ed Compare March 24, 2026 06:21
@AkshatDudeja77
Copy link
Author

@AkshatDudeja77 Can you write some unit testcases around it

@rajatchopra Refactored the SIGUSR1 decision into a helper and added unit tests covering no-change, pure-removal, additions, replacements, remove+add, and fresh-process cases.

This now explicitly checks for newly added peers rather than relying on size or subset assumptions.

Please let me know if there are additional edge cases you'd like covered

// output didn't work.
func StartDebugSignalHandlers() {
go func() {
if runtime.GOOS == "windows" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AkshatDudeja77 As commented previously by @jgehrcke @rajatchopra please remove this change.

// addresses compared to the old set (then we don't need to force
// the daemon to re-resolve & re-connect).
if !updated || fresh {
if !shouldSendSIGUSR1(oldIPs, dnsNameManager.ipToDNSName, updated, fresh) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please retain rest of the comment here other than the TODO part and update based on the current logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Avoid sending SIGUSR1 when ComputeDomain update only removes nodes

4 participants