-
|
Hello, I wanted to add optimal training across 8 GPUs on a single machine. I followed the documentation details here and looked at the CIFAR example. Is there a core list of steps required to appropriately convert Ignite to support What I tried:
Doing all of this, I get the following TypeError. Not sure what's causing this so any ideas are appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 8 replies
-
|
Hi @aksg87 , looks like you are using Also please try to use # spawn method
python train.py --args (training args)
# launch method
python -m torch.distributed.launch --nproc_per_node 8 --use_env -m train.py --backend nccl --args (training args)If you use if __name__ == "__main__":
with idist.Parallel(backend=backend) as parallel: # no need for `nproc_per_node` as it is handled by `torch.distributed.launch`
parallel.run(training, config) |
Beta Was this translation helpful? Give feedback.
-
|
@aksg87 does it work now your baseline with |
Beta Was this translation helpful? Give feedback.
Hi @aksg87 , looks like you are using
spawnmethod to run distributed training and the error is coming from PyTorch DataLoader unable to pickle theSwigPyObject. Make sure DataLoader can pickleSwigPyObject.Also please try to use
launchmethod to run distributed training, it is faster thanspawnmethod.If you use
launch, calling the training loop will become: