Training on Custom dataset fails #641

pmudgal-Intel · 2022-10-13T17:29:57Z

pmudgal-Intel
Oct 13, 2022

Describe the bug

Training PaDiM on custom dataset fails abruptly. Shows no errors but gives "Killed" message.
Attaching the screenshots with error and config modification.
Dataset - https://www.kaggle.com/datasets/arunrk7/surface-crack-detection

Expected behavior

A clear and concise description of what you expected to happen.
Ideally, the training should conclude successfully.

Screenshots

The config file is changed by following instructions on github.

Hardware and Software Configuration

OS: [Ubuntu 20.04]
NVIDIA Driver Version [470.141.03]
CUDA Version [e.g. 10.2]
CUDNN Version [e.g. V10.2.89]
OpenVINO Version [Optional e.g. v2022.1.0]

Answered by pmudgal-Intel

Oct 27, 2022

Training on custom dataset works after trimming the dataset.

View full answer

ashwinvaidya17 · 2022-10-14T08:11:46Z

ashwinvaidya17
Oct 14, 2022
Maintainer

Can you run again with htop open on a separate terminal? My suspicion is that the RAM usage is high. Linux kills jobs that take up the whole ram. What does sudo dmesg after the job is killed?

0 replies

abelBEDOYA · 2022-10-14T10:05:17Z

abelBEDOYA
Oct 14, 2022

Can you run again with htop open on a separate terminal? My suspicion is that the RAM usage is high. Linux kills jobs that take up the whole ram. What does sudo dmesg after the job is killed?

In my case, similarly, the training is killed. It is able to complete only one epoch.

Trying sudo dmesg , it show up this message:

I've tried reducing the batch sizes but still

0 replies

ashwinvaidya17 · 2022-10-14T10:15:01Z

ashwinvaidya17
Oct 14, 2022
Maintainer

Batch size is not the issue here then. If you look at this line https://github.com/openvinotoolkit/anomalib/blob/c1f51a6ccdb7cb26cd201a846f8049ac11b4e5cc/anomalib/models/padim/lightning_model.py#L78 it stores the embeddings for the entire training data with append. You can try playing around with the embedding dimension in the torch model here.https://github.com/openvinotoolkit/anomalib/blob/c1f51a6ccdb7cb26cd201a846f8049ac11b4e5cc/anomalib/models/padim/torch_model.py#L18 I am not sure if there are other ways to reduce memory consumption. @samet-akcay or @djdameln do you guys have any ideas?

0 replies

ngadhvi · 2022-10-17T23:59:59Z

ngadhvi
Oct 17, 2022

@pmudgal-Intel As you know, unlike traditional deep learning model training, the training with Anomalib PaDiM model happens in just one epoch (it's actually fitting a latent space than training for PaDiM). This will try to fit your whole data set into the memory. So, if your data set size increases by the total RAM you have, the training crashes.
Some suggestions:

Terminate all processes that consume the RAM before you start training.
Resize isn't an option for you as you are training with 256x256 images, try a GPU with 16 GB RAM.
Set the num_workers to 1 or 2. You will have to know how many cores your CPU has for this hyperparameter.
I saw 40k images in the dataset you are using. I would trim my dataset.

Hope this helps.

0 replies

abelBEDOYA · 2022-10-18T08:19:34Z

abelBEDOYA
Oct 18, 2022

@pmudgal-Intel As you know, unlike traditional deep learning model training, the training with Anomalib PaDiM model happens in just one epoch (it's actually fitting a latent space than training for PaDiM). This will try to fit your whole data set into the memory. So, if your data set size increases by the total RAM you have, the training crashes. Some suggestions:
1. Terminate all processes that consume the RAM before you start training.

2. Resize isn't an option for you as you are training with 256x256 images, try a GPU with 16 GB RAM.

3. Set the num_workers to 1 or 2. You will have to know how many cores your CPU has for this hyperparameter.

4. I saw 40k images in the dataset you are using. I would trim my dataset.
Hope this helps.

Thanks for the info, it is good to know that. Do the other models available in Anomalib work differently? That is, can they carry out an arbitrary number of epochs? (PatchCore, CFlow, GANomaly, ...)

0 replies

yonafalinie · 2022-10-19T11:02:36Z

yonafalinie
Oct 19, 2022

Hi, similiar in my case, GANomaly, Reverse Distillation, Fastflow and STFPM works best for me with that large number of images in a dataset, and can run with the default number of epochs given in the config file. I also modify the default resizing option, from 256 to 128 and as low as 64, which may hurt the performance in the end.

0 replies

pmudgal-Intel · 2022-10-27T17:32:11Z

pmudgal-Intel
Oct 27, 2022
Author

Training on custom dataset works after trimming the dataset.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training on Custom dataset fails #641

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Training on Custom dataset fails #641

Uh oh!

Uh oh!

pmudgal-Intel Oct 13, 2022

Replies: 7 comments

Uh oh!

ashwinvaidya17 Oct 14, 2022 Maintainer

Uh oh!

abelBEDOYA Oct 14, 2022

Uh oh!

ashwinvaidya17 Oct 14, 2022 Maintainer

Uh oh!

Uh oh!

ngadhvi Oct 17, 2022

Uh oh!

abelBEDOYA Oct 18, 2022

Uh oh!

yonafalinie Oct 19, 2022

Uh oh!

pmudgal-Intel Oct 27, 2022 Author

pmudgal-Intel
Oct 13, 2022

ashwinvaidya17
Oct 14, 2022
Maintainer

abelBEDOYA
Oct 14, 2022

ashwinvaidya17
Oct 14, 2022
Maintainer

ngadhvi
Oct 17, 2022

abelBEDOYA
Oct 18, 2022

yonafalinie
Oct 19, 2022

pmudgal-Intel
Oct 27, 2022
Author