Skip to content

dataloader error #8

@niutransWZY

Description

@niutransWZY

When I used moby_main for training, Linux memory grew until it crashed. What is the reason and how to solve it

The error is:
Traceback (most recent call last):
File "moby_main.py", line 236, in
main(config)
File "moby_main.py", line 121, in main
train_one_epoch(config, model, data_loader_train, optimizer, epoch, lr_scheduler)
File "moby_main.py", line 151, in train_one_epoch
scaled_loss.backward()
File "/root/anaconda3/envs/transformer-ssl/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/root/anaconda3/envs/transformer-ssl/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
File "/root/anaconda3/envs/transformer-ssl/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 2605) is killed by signal: Killed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions