Search before asking
KubeRay Component
Others
What happened + What you expected to happen
- Observation: The
NODE_DEFINITION_EVENT and the first NODE_LIFECYCLE_EVENT of the restarted worker (triggered by OOMKilled) are flushed to the old session.
- Expected behavior: The new worker's
NODE_*_EVENT should be flushed to the new session.
Considering a cluster session is the lifecycle of a cluster instance, the collector must flush events to the correct session to facilitate accurate post-mortem analysis.
Reproduction script
Events can be created by:
- Deploy a Ray cluster
- Submit a Ray job to the cluster
- Trigger
OOMKilled on both the worker and head
Then, we can analyze the flushed events.
Anything else
node_event_new_sess.md
node_event_old_sess.md
Are you willing to submit a PR?