-
Notifications
You must be signed in to change notification settings - Fork 392
Open
Description
| ray job submit --address="http://127.0.0.1:8265" \ |
when I run the on policy distillation script, I meet a very strange error. If I start a teacher sglang task, when I ray submit job, the job can not running and is skipped. Although the status is running, there are nothing to output.
[32mJob 'raysubmit_insRv575QEYZUFgt' submitted successfully39m
Tailing logs until the job exits (disable with --no-wait):
2025-12-19 09:41:34,661 INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_insRv575QEYZUFgt: python3 train.py --actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 1 --colocate --swiglu --num-layers 28 --hidden-size 1024 --ffn-hidden-size 3072 --...
AMEM [INFO] amem_nccl.cpp:x_init:685 groupID:0 pid:314637 build:Nov 27 2025 09:15:06 NCCL plugin loaded. pause func:Off offload_free_tag:-1 cuMemEnabled:1
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
Status for job 'raysubmit_insRv575QEYZUFgt': RUNNING
Status message: Job is currently running.
The GPU is not used even when I wait for 1 hours. However, If I do not raise the teacher sglang server, when I ray submit a job, it can run and output some log. Can anyone know the reason?
Metadata
Metadata
Assignees
Labels
No labels