OOM error on NVIDIA A100

The smallest model (simplefold_100M) is unable to generate even 5 samples at a time on a NVIDIA A100-SXM4-40GB. 



```
simplefold --simplefold_model simplefold_100M  \
            --num_steps 500 --tau 0.01 --nsample_per_protein 5  \
            --plddt --fasta_path test.fasta --output_dir testout  \
            --backend torch
```
[OOM ERROR]
```
File "ml-simplefold/src/simplefold/model/torch/layers.py", line 142, in forward
    return self.w2(F.silu(self.w1(x)) * self.w3(x))
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 30.31 MiB is free. Including non-PyTorch memory, this process has 39.36 GiB memory in use. Of the allocated memory 36.93 GiB is allocated by PyTorch, and 1.94 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

```
test.fasta has only 1 protein,  and it works fine with nsample_per_protein=1
Its interesting that mlx is able to handle a higher nsample_per_protein on an m2pro machine (16GB RAM). I can confirm that no other processes are running on the GPU. 

Am I missing something obvious? Kindly share any suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM error on NVIDIA A100 #24

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM error on NVIDIA A100 #24

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions