-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Hello, I am working with a setup that uses three V100-32GB GPUs.
I first started PhOS Daemon in a shell:
root@gpu2:~/scripts/build_scripts# pos_cli --start --target daemon
From the output, it appears that the cricket-rpc-server is started correctly on the GPU as expected:
Next, I ran the ResNet training script in another shell:
It seems that ResNet is executed correctly (After the batch iteration reaches 64 times, the training is interrupted in advance according to the logic in the code). However, when I run pos_cli --dump --dir /root/ckpt --pid 41228 (which is the PID of python train.py) in the third shell while ResNet's train.py is being executed, I encounter the following error:
Additionally, there is no output from the PhOS Daemon:
Could you please help me understand what might be causing this issue and how to resolve it? Any assistance would be greatly appreciated!
Thank you very much!




