Replies: 2 comments 1 reply
-
It seems that there are some problems with your Hardware. torch-scatter 2.1.0
torch-sparse 0.6.16
pyg-lib 0.1.0+pt111cu113
torch 1.11.0
torch-geometric 2.3.0 /root/share/pytorch_geometric |
Beta Was this translation helpful? Give feedback.
-
Cross-checked with company, using another GPUs (A4000 * 4), found my code ran normally. And when using my 2 GPUs(A500 * 2), code ran normally, but failed when using my 3 GPUs. Also I checked many other cases from NVIDIA Developer forum and talked with company, they concluded it seems main board's power issue 😵 Sorry for discussing my own HW issue, at first I thought my modified code and arguments( I'll close this discussion, thanks! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, and also very thanks for excellent works :)
I fixed some lines of distributed_sampling.py and tested in my workstation environment. (My fixed code is here)
My workstation environment is like below.
And my main packages are like below.
As similar with PyG's official example, I used
Reddit
dataset, and usedSAGEConv
,NeighborLoader
,DistributedDataParallel
.But in each experiments, I used 3-layer
SAGEConv
withnum_neighbors
as[15,10,5], [25,15,5], [35,20,5]
andhidden_channels=128
, andbatch_size=1024
.When I was running
num_neighbors=[35,20,5]
, one of my GPU has broken, andnvidia-smi
command not worked.After rebooted my workstation, device was seemed to be recovered, since
nvidia-smi
command worked normally.But when I ran experiment with
num_neighbors=[15,10,5]
, GPU has broken again (but before start training, not during training) and same situation (like above) was appeared.I ran
nvidia-bug-report.sh
and checked the error, the error was like below.After some web searching, I found it was about some HW issues, from here and here.
I sent my workstation to company which assembled it, and they said it looks like my code didn't thought about GPU specs.
I think it was my workstation's main board issue (especially lack of power), but company said there seems to be no HW issues when they tested with
gpu-burn
.So, my main question is :
num_neighbors=[30,20,10]
withhidden_channels=256
andbatch_size=1024
, experiment has exited normally.Thanks a lot in advance for any! :)
Beta Was this translation helpful? Give feedback.
All reactions