-
Notifications
You must be signed in to change notification settings - Fork 183
ETCD: initial heartbeat thread #1010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi tstamler! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
|
/ok to test c2065fd |
|
/build |
| } | ||
|
|
||
| void | ||
| startHeartbeatThread(uint64_t lease_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be a queue of lease ids. If an agent can store multiple metadata within etcd we will have multiple lease ids. We should add all those to a queue (like the commqueue) and the heartbeat thread should just loop over all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way would be add some special known key to deem an agent active and create a lease_id only for that. The agent keeps this key alive, and the other agents watcher only checks this key. If this key is deleted that means the agent is gone, so delete all MD for it.
src/core/nixl_listener.cpp
Outdated
| NIXL_DEBUG << "Successfully loaded metadata for agent: " << remote_agent; | ||
|
|
||
| etcdClient->setupAgentWatcher(remote_agent); | ||
| etcdClient->setupAgentWatcher(remote_agent, metadata_label); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a remote agent sends multiple metadata, this watch will only be installed for the first MD since we only allow 1 watch per agent. Would we have a case where we need to only invalidate a certain MD in this flow?
What?
Add a heartbeat mechanism to NIXL ETCD metadata to invalidate stale metadata if an agent dies.