Skip to content

[Bug] [history server] [collector] NODE_*_EVENT of restarted workers are flushed to the old session #4442

@JiangJiaWei1103

Description

@JiangJiaWei1103

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

Others

What happened + What you expected to happen

  • Observation: The NODE_DEFINITION_EVENT and the first NODE_LIFECYCLE_EVENT of the restarted worker (triggered by OOMKilled) are flushed to the old session.
  • Expected behavior: The new worker's NODE_*_EVENT should be flushed to the new session.

Considering a cluster session is the lifecycle of a cluster instance, the collector must flush events to the correct session to facilitate accurate post-mortem analysis.

Reproduction script

Events can be created by:

  1. Deploy a Ray cluster
  2. Submit a Ray job to the cluster
  3. Trigger OOMKilled on both the worker and head

Then, we can analyze the flushed events.

Anything else

node_event_new_sess.md
node_event_old_sess.md

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions