Skip to content

OutOfMemoryError #71

@shih-huai

Description

@shih-huai

Hello, when I training the code in my single RTX-4090, it said that OOM. Even I set the batch size to 2, it still has this problem. Anyone know how to solve it or anything I forget to do it?

thanks for anyone read this issue. I can run the Rectifiedflow successfully, but can't work on this. It make me upset and send this issue.

Traceback (most recent call last):
  File "./main.py", line 68, in <module>
    app.run(main)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "./main.py", line 59, in main
    run_lib.train(FLAGS.config, FLAGS.workdir)
  File "/media/user/2tb/score_sde_pytorch/run_lib.py", line 131, in train
    loss = train_step_fn(state, batch)
  File "/media/user/2tb/score_sde_pytorch/losses.py", line 195, in step_fn
    loss = loss_fn(model, batch)
  File "/media/user/2tb/score_sde_pytorch/losses.py", line 118, in loss_fn
    score = model_fn(perturbed_data, labels)
  File "/media/user/2tb/score_sde_pytorch/models/utils.py", line 124, in model_fn
    return model(x, labels)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/user/2tb/score_sde_pytorch/models/ncsnpp.py", line 275, in forward
    h = modules[m_idx](hs[-1], temb)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/user/2tb/score_sde_pytorch/models/layerspp.py", line 265, in forward
    h = self.Dropout_0(h)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/modules/dropout.py", line 59, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/home/user/anaconda3/envs/rectflow/lib/python3.8/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.64 GiB total capacity; 1.86 GiB already allocated; 62.00 MiB free; 1.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2024-12-15 16:50:22.245628: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
[ble: exit 1][ble: elapsed 3.615s (CPU 452.1%)] python ./main.py --config ./configs/ve/cifar10_ncsnpp.py --eval_folder eval --mode train --workdir ./logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions