-
Notifications
You must be signed in to change notification settings - Fork 243
Description
Environment
- OS: Ubuntu 24.04
- Python: 3.12
- Pytorch: 2.8.0
- CUDA: 12.6
- GPU: H100 SXM
Bug Description
I am trying to train the omni_CTC_1b model on about 1k hours data in a single specific language.
I have about 1100 hours train data, 200 hrs dev for training.
I have planned 500_000 steps and for every 25_000 steps validating the model on dev split.
I have prepared data using the data workflow provided inside the repo, and change some params for more data throughput and faster pre-processing.
The issue is here that when I use cache for fragment loading the disk storage the model using during training is exploding, like every cache arrow file created for fragment loading is around 100MB, and it doesn't reduce at any time just increasing the size it is using during training.
I have tried to train the model, not using cache for fragment loading and the same thing happend for the RAM, it was just increasing RAM memory usage.
I though the memory usage should may offload the file descriptors that are open, after the dev split validation on every 25_000 steps but it doesn't.
I am arriving at the memory leak problem during training. I can see the fairseq2 garbage collector logs, but there is no reduction in RAM/Disk(cache) usage during training and just increasing.
I have changed my configuration yaml file for training like the ctc_recommended yaml file but still the problem exists, and after about 90k steps the training script fails because of lack of free memory to continue (using about 290GB of RAM). I also tried to change the cache directory to have around 100GB free disk and still the problem exists.
My final Idea is that my pre-processed parquet files are bigger than they should be, like my parquet files are around 500MB each. it may cause the problem or not?
it completly blocked my training progress and doesn't let me train at least 100K steps.
The error is like below when the training is based on fragment loading cache:
[OSERROR] No Space left of device.
Severity
High: Blocks my work