trainer.fit() stuck and cannot interrupt kernel #5947
Unanswered
ifsheldon
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment 5 replies
-
You mention Jupyter Lab, did you run this in a cell? |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi! I am now transferring from "old" PyTorch to pytorch-lightning, but when I did some trivial training integrating existing models, I found trainer.fit() is stuck even before GPUs run.
By "stuck" I mean I waited for 5 minutes, but nothing seems to be running, since I checked using
htop
andnvidia-smi
, CPUs and GPUs are idle.My code is just one-pager as below
I used Jupyter-lab to run the code, and I requested 32 cores, 512GB memory and 4 V100 on a shared cluster. But, when the trainer is stuck, I saw none of GPUs were running and no processes were shown on
nvidia-smi
. And I could not interrupt the kernel, so the only thing I could do is to restart the kernel.I have read the tutorials, and the code seems good to me, but I am not sure whether it's good to go. Did I miss something?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions