Skip to content

OOM error on NVIDIA A100 #24

@iamysk

Description

@iamysk

The smallest model (simplefold_100M) is unable to generate even 5 samples at a time on a NVIDIA A100-SXM4-40GB.

simplefold --simplefold_model simplefold_100M  \
            --num_steps 500 --tau 0.01 --nsample_per_protein 5  \
            --plddt --fasta_path test.fasta --output_dir testout  \
            --backend torch

[OOM ERROR]

File "ml-simplefold/src/simplefold/model/torch/layers.py", line 142, in forward
    return self.w2(F.silu(self.w1(x)) * self.w3(x))
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 30.31 MiB is free. Including non-PyTorch memory, this process has 39.36 GiB memory in use. Of the allocated memory 36.93 GiB is allocated by PyTorch, and 1.94 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

test.fasta has only 1 protein, and it works fine with nsample_per_protein=1
Its interesting that mlx is able to handle a higher nsample_per_protein on an m2pro machine (16GB RAM). I can confirm that no other processes are running on the GPU.

Am I missing something obvious? Kindly share any suggestions.

Metadata

Metadata

Labels

questionFurther information is requested

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions