Skip to content

Could you open-source the implementation code of train_one_epoch_ddp? Data parallel #15

@1759780295

Description

@1759780295

我自己实现数据并行会报错如下:
Traceback (most recent call last):
File "/home/user01/hzz/MLIC++/train.py", line 158, in
main()
File "/home/user01/hzz/MLIC++/train.py", line 119, in main
current_step = train_one_epoch(
File "/home/user01/hzz/MLIC++/utils/training.py", line 18, in train_one_epoch
out_net = model(d)
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/user01/anaconda3/envs/tcm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/user01/hzz/MLIC++/models/mlicpp.py", line 93, in forward
self.update_resolutions(x.size(2) // 16, x.size(3) // 16)
File "/home/user01/hzz/MLIC++/models/mlicpp.py", line 191, in update_resolutions
self.local_context[i].update_resolution(H, W, next(self.parameters()).device, mask=None)
StopIteration
作者你能开源一下你的train_one_epoch_ddp实现吗?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions