-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Description
What happened + What you expected to happen
When the ray cluster manager/autoscaler runs applications (frequently docker inspect) via ssh to inspect remote state, it can break when remote shell scripts emit output (i.e. MOTD/quota commands). In the case of docker inspect it expects JSON and gets confused when other output is present.
Related issues:
- test_autoscaling_policy.py prints out huge pile of JsonErrors #13433
- [docker][on-premise clusters][local] work node can't start successfully #20711
- [Bug] Launch Ray cluster on GCP via macOS hit the issue of cannot change locale #22535
- https://discuss.ray.io/t/exception-when-ray-up/11756/6
In the case of the ssh docker cluster I created it was possible to work around this by appending --use-normal-shells, possibly in conjunction with placing a .hushlogin file on the remotes.
$ ray up cluster.yaml --use-normal-shellsHowever, this workaround did not work for the internal logic used by the autoscaler (which also needs to ssh into remote workers to inspect their state/bring up ray workers).
Patching command_runner and subprocess_output in the cluster setup_commands produced a successful workaround.
setup_commands:
# By default Ray uses login shells (bash --login -i) and pseudo-TTYs (ssh -tt)
# when SSHing into nodes. This causes .bash_profile output and terminal
# progress bars to pollute stdout, breaking Ray's JSON parsing of command
# output. These patches switch the autoscaler to plain non-interactive shells.
# Two flags must be patched together: use_login_shells (controls -tt and
# bash --login) and _allow_interactive (controls whether stdin is a pipe or
# inherited from the parent). If only use_login_shells is patched, the Popen
# path tries to close p.stdin which is None, causing SSH readiness checks to
# never succeed.
- |
python3 -c "
import pathlib;
import ray.autoscaler._private.command_runner as cr;
import ray.autoscaler._private.subprocess_output_util as su;
p = pathlib.Path(cr.__file__);
p.write_text(p.read_text().replace('\"use_login_shells\": True', '\"use_login_shells\": False'));
p = pathlib.Path(su.__file__);
p.write_text(p.read_text().replace('_allow_interactive = True', '_allow_interactive = False'))"AFAICT this brittle behaviour is caused by the use of bash --login. bash --noprofile might avoid this.
Versions / Dependencies
OS: Ubuntu 24.04
ray 2.54
Reproduction script
N/A this would involve setting up a representative cluster
Issue Severity
Medium: It is a significant difficulty but I can work around it.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status