Skip to content

[Bug] Memory Leak in Training CTC_1b arch #64

@AKA-Abdol

Description

@AKA-Abdol

Environment

  • OS: Ubuntu 24.04
  • Python: 3.12
  • Pytorch: 2.8.0
  • CUDA: 12.6
  • GPU: H100 SXM

Bug Description

I am trying to train the omni_CTC_1b model on about 1k hours data in a single specific language.
I have about 1100 hours train data, 200 hrs dev for training.
I have planned 500_000 steps and for every 25_000 steps validating the model on dev split.
I have prepared data using the data workflow provided inside the repo, and change some params for more data throughput and faster pre-processing.
The issue is here that when I use cache for fragment loading the disk storage the model using during training is exploding, like every cache arrow file created for fragment loading is around 100MB, and it doesn't reduce at any time just increasing the size it is using during training.
I have tried to train the model, not using cache for fragment loading and the same thing happend for the RAM, it was just increasing RAM memory usage.
I though the memory usage should may offload the file descriptors that are open, after the dev split validation on every 25_000 steps but it doesn't.
I am arriving at the memory leak problem during training. I can see the fairseq2 garbage collector logs, but there is no reduction in RAM/Disk(cache) usage during training and just increasing.
I have changed my configuration yaml file for training like the ctc_recommended yaml file but still the problem exists, and after about 90k steps the training script fails because of lack of free memory to continue (using about 290GB of RAM). I also tried to change the cache directory to have around 100GB free disk and still the problem exists.

My final Idea is that my pre-processed parquet files are bigger than they should be, like my parquet files are around 500MB each. it may cause the problem or not?

it completly blocked my training progress and doesn't let me train at least 100K steps.

The error is like below when the training is based on fragment loading cache:
[OSERROR] No Space left of device.

Severity

High: Blocks my work

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions