Skip to content

[Clusters] ssh inspection logic is brittle in the presence of profile script output #62045

@sjperkins

Description

@sjperkins

What happened + What you expected to happen

When the ray cluster manager/autoscaler runs applications (frequently docker inspect) via ssh to inspect remote state, it can break when remote shell scripts emit output (i.e. MOTD/quota commands). In the case of docker inspect it expects JSON and gets confused when other output is present.

Related issues:

In the case of the ssh docker cluster I created it was possible to work around this by appending --use-normal-shells, possibly in conjunction with placing a .hushlogin file on the remotes.

$ ray up cluster.yaml --use-normal-shells

However, this workaround did not work for the internal logic used by the autoscaler (which also needs to ssh into remote workers to inspect their state/bring up ray workers).

Patching command_runner and subprocess_output in the cluster setup_commands produced a successful workaround.

setup_commands:
    # By default Ray uses login shells (bash --login -i) and pseudo-TTYs (ssh -tt)
    # when SSHing into nodes. This causes .bash_profile output and terminal
    # progress bars to pollute stdout, breaking Ray's JSON parsing of command
    # output. These patches switch the autoscaler to plain non-interactive shells.
    # Two flags must be patched together: use_login_shells (controls -tt and
    # bash --login) and _allow_interactive (controls whether stdin is a pipe or
    # inherited from the parent). If only use_login_shells is patched, the Popen
    # path tries to close p.stdin which is None, causing SSH readiness checks to
    # never succeed.
    - |
        python3 -c "
        import pathlib;
        import ray.autoscaler._private.command_runner as cr;
        import ray.autoscaler._private.subprocess_output_util as su;
        p = pathlib.Path(cr.__file__);
        p.write_text(p.read_text().replace('\"use_login_shells\": True', '\"use_login_shells\": False'));
        p = pathlib.Path(su.__file__);
        p.write_text(p.read_text().replace('_allow_interactive = True', '_allow_interactive = False'))"

AFAICT this brittle behaviour is caused by the use of bash --login. bash --noprofile might avoid this.

Versions / Dependencies

OS: Ubuntu 24.04
ray 2.54

Reproduction script

N/A this would involve setting up a representative cluster

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogcoreIssues that should be addressed in Ray CorestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions