-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Hi, I have installed the code in python3.8, pytorch 1.8.0 and cuda11. And the debug_relationformer.ipynb runs well about Debug Dataloader and Debug Model part.
However, when I run the train.py using "nohup python3 train.py --config configs/scene_2d.yaml --cuda_visible_device 0 1 2 --exp_name VGtest1 --nproc_per_node 3 --b 16 &> log/Muti.out& ", there is an error:
*** Config file
configs/scene_2d.yaml
Experiment Name : VGtest1
Batch size : 16
Running Distributed: True ; GPU: 0 ; RANK: 0
Number of parameters : 92944451
ERROR:ignite.engine.engine.RelationformerTrainer:Current run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
ERROR:ignite.engine.engine.RelationformerTrainer:Current run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
ERROR:ignite.engine.engine.RelationformerTrainer:Current run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
ERROR:ignite.engine.engine.RelationformerTrainer:Engine run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
ERROR:ignite.engine.engine.RelationformerTrainer:Engine run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
ERROR:ignite.engine.engine.RelationformerTrainer:Engine run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
Traceback (most recent call last):
File "train.py", line 292, in
parallel.run(main, args)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/launcher.py", line 275, in run
idist.spawn(self.backend, func, args=args, kwargs_dict=kwargs, **self._spawn_params)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/utils.py", line 323, in spawn
comp_model_cls.spawn(
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/comp_models/native.py", line 304, in spawn
start_processes(
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/comp_models/native.py", line 272, in _dist_worker_task_fn
fn(local_rank, *args, **kw_dict)
File "/home/ymf/dockerFile/relationformer/train.py", line 282, in main
trainer.run()
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/monai/engines/trainer.py", line 56, in run
super().run()
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/monai/engines/workflow.py", line 250, in run
super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 702, in run
return self._internal_run()
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 775, in _internal_run
self._handle_exception(e)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
raise e
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 745, in _internal_run
time_taken = self._run_once_on_dataset()
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 850, in _run_once_on_dataset
self._handle_exception(e)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception
raise e
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 833, in _run_once_on_dataset
self.state.output = self._process_function(self, self.state.batch)
File "/home/ymf/dockerFile/relationformer/trainer.py", line 40, in _iteration
h, out = self.network(images)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 711, in forward
output = self.module(*inputs, **kwargs)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ymf/dockerFile/relationformer/models/relationformer_2D.py", line 108, in forward
features, pos = self.backbone(samples)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ymf/dockerFile/relationformer/models/deformable_detr_backbone.py", line 117, in forward
xs = self0
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ymf/dockerFile/relationformer/models/deformable_detr_backbone.py", line 84, in forward
xs = self.body(tensor_list.tensors)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torchvision/models/_utils.py", line 63, in forward
x = module(x)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
Do you have any clue about this error and how to fix it? Thanks!