Skip to content

Failure with pos_cli --dump Command When Running ResNet Training #19

@LiuMicheal

Description

@LiuMicheal

Hello, I am working with a setup that uses three V100-32GB GPUs.

I first started PhOS Daemon in a shell:

root@gpu2:~/scripts/build_scripts# pos_cli --start --target daemon

Image

From the output, it appears that the cricket-rpc-server is started correctly on the GPU as expected:

Image

Next, I ran the ResNet training script in another shell:

Image

It seems that ResNet is executed correctly (After the batch iteration reaches 64 times, the training is interrupted in advance according to the logic in the code). However, when I run pos_cli --dump --dir /root/ckpt --pid 41228 (which is the PID of python train.py) in the third shell while ResNet's train.py is being executed, I encounter the following error:

Image

Additionally, there is no output from the PhOS Daemon:

Image

Could you please help me understand what might be causing this issue and how to resolve it? Any assistance would be greatly appreciated!

Thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions