Skip to content

Conversation

@tstamler
Copy link
Contributor

What?

Add a heartbeat mechanism to NIXL ETCD metadata to invalidate stale metadata if an agent dies.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

👋 Hi tstamler! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@tstamler tstamler marked this pull request as ready for review November 12, 2025 22:24
@tstamler tstamler requested a review from a team as a code owner November 12, 2025 22:24
@aranadive
Copy link
Contributor

/ok to test c2065fd

@aranadive
Copy link
Contributor

/build

}

void
startHeartbeatThread(uint64_t lease_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a queue of lease ids. If an agent can store multiple metadata within etcd we will have multiple lease ids. We should add all those to a queue (like the commqueue) and the heartbeat thread should just loop over all of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way would be add some special known key to deem an agent active and create a lease_id only for that. The agent keeps this key alive, and the other agents watcher only checks this key. If this key is deleted that means the agent is gone, so delete all MD for it.

NIXL_DEBUG << "Successfully loaded metadata for agent: " << remote_agent;

etcdClient->setupAgentWatcher(remote_agent);
etcdClient->setupAgentWatcher(remote_agent, metadata_label);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a remote agent sends multiple metadata, this watch will only be installed for the first MD since we only allow 1 watch per agent. Would we have a case where we need to only invalidate a certain MD in this flow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants