Replies: 1 comment
-
This is a textbook distributed connection/healthcheck failure—exactly the sort of cross-process bug that keeps popping up in vLLM + Ray setups. (In our issue map it’s classified under “distributed infra: stale connection pool / cluster node health desync”.) Most of the time, your Ray cluster may pass the built-in healthcheck but still fail when vLLM tries to schedule or allocate resources—due to socket state, firewall, or subtle config drift between nodes. Quick things to check:
If you want the step-by-step diagnosis checklist or a full breakdown of these connection issues, let me know and I’ll share the reference. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been attempting to connect a vLLM engine (as part of KubeAI) to a Ray Cluster (deployed by Kuberay) and have not had much success. For some reason it is unable to generate the file node_ip_address.json. I can confirm that if I run
ray status
in the vLLM engine pod I see exactly the same output as I can see in the Ray cluster head pod, so vLLM is able to communicate with ray. These are the logs from vLLM.Executing a health check from the vLLM engine pod returns an exit code of 0, which means the ray cluster health is allegedly ok.
Has anyone seen the same behaviour before but successfully connected vLLM to an external ray cluster?
Engine Config:
Versions:
Platform:
Stack Trace:
Beta Was this translation helpful? Give feedback.
All reactions