Skip to content

ray job skip instead of run when I use colocate mode #1151

@whq-pku1

Description

@whq-pku1

ray job submit --address="http://127.0.0.1:8265" \

when I run the on policy distillation script, I meet a very strange error. If I start a teacher sglang task, when I ray submit job, the job can not running and is skipped. Although the status is running, there are nothing to output.

[32mJob 'raysubmit_insRv575QEYZUFgt' submitted successfully39m

Tailing logs until the job exits (disable with --no-wait):
2025-12-19 09:41:34,661	INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_insRv575QEYZUFgt: python3 train.py --actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 1 --colocate --swiglu --num-layers 28 --hidden-size 1024 --ffn-hidden-size 3072 --...
AMEM [INFO] amem_nccl.cpp:x_init:685 groupID:0 pid:314637 build:Nov 27 2025 09:15:06 NCCL plugin loaded. pause func:Off offload_free_tag:-1 cuMemEnabled:1
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Status for job 'raysubmit_insRv575QEYZUFgt': RUNNING
Status message: Job is currently running.

The GPU is not used even when I wait for 1 hours. However, If I do not raise the teacher sglang server, when I ray submit a job, it can run and output some log. Can anyone know the reason?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions