ray job skip instead of run when I use colocate mode

https://github.com/THUDM/slime/blob/b4399e8c42fae9d30de427ab56e4fbab9979bea6/examples/on_policy_distillation/run-qwen3-8B-opd.sh#L149

when I run the on policy distillation script, I meet a very strange error. If I start a teacher sglang task, when I ray submit job, the job can not running and is skipped. Although the status is running, there are nothing to output.
```
[32mJob 'raysubmit_insRv575QEYZUFgt' submitted successfully39m

Tailing logs until the job exits (disable with --no-wait):
2025-12-19 09:41:34,661	INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_insRv575QEYZUFgt: python3 train.py --actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 1 --colocate --swiglu --num-layers 28 --hidden-size 1024 --ffn-hidden-size 3072 --...
AMEM [INFO] amem_nccl.cpp:x_init:685 groupID:0 pid:314637 build:Nov 27 2025 09:15:06 NCCL plugin loaded. pause func:Off offload_free_tag:-1 cuMemEnabled:1
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Status for job 'raysubmit_insRv575QEYZUFgt': RUNNING
Status message: Job is currently running.
```
The GPU is not used even when I wait for 1 hours. However, If I do not raise the teacher sglang server, when I ray submit a job, it can run and output some log. Can anyone know the reason?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ray job skip instead of run when I use colocate mode #1151

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ray job skip instead of run when I use colocate mode #1151

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions