PADIM cannot fit large datasets into its memory bank. DefaultCPUAllocator: can't allocate memory #1123

yahav6893 · 2023-05-15T13:52:13Z

yahav6893
May 15, 2023

Describe the bug

Hi,
when trying to train on a custom dataset, around halfway through the epoch I get this CPU memory error:

RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 35896320000 bytes. Error code 12 (Cannot allocate memory)

If I try to decrease batch size it just happens slower, you can see how the memory allocation grows during the training epoch without reducing back.

Now I don't want to decrease the dataset size. If anything I want to increase it.

I would appreciate your help.

Thanks ahead,
Yahav

It's worth noting that I create a costume dataset during running and pass it to the Trainer.

this is the datamodule initialization"
folder_datamodule = Folder( root=folder_dataset_root, normal_dir="anomalib_ds/clean", abnormal_dir="anomalib_ds/obs", mask_dir=folder_dataset_root + "/anomalib_ds/mask", task="segmentation", image_size=IMG_SIZE, normalization=InputNormalizationMethod.NONE, # don't apply normalization, as we want to visualize the images eval_batch_size=16, train_batch_size=16 )

Dataset

Folder

Model

PADiM

Steps to reproduce the behavior

run train on big enough custom dataset

OS information

OS information:

OS: Ubuntu 18.04.6
Python version: 3.8.16
Anomalib version: 1.0.0
CUDA/cuDNN version: 11.7
GPU models and configuration: GeForce RTX 3090
Any other relevant information: I'm using a custom dataset

Expected behavior

Memory consumption rises at the beginning of every step and decreases at the end of it.

Screenshots

No response

Pip/GitHub

GitHub

What version/branch did you use?

No response

Configuration YAML

dataset:
  name: mvtec
  format: mvtec
  path: ./datasets/MVTec
  category: bottle
  task: segmentation
  train_batch_size: 4
  eval_batch_size: 4
  num_workers: 8
  image_size: 480 # dimensions to which images are resized (mandatory)
  center_crop: null # dimensions to which images are center-cropped after resizing (optional)
  normalization: imagenet # data distribution to which the images will be normalized: [none, imagenet]
  transform_config:
    train: null
    eval: null
  test_split_mode: from_dir # options: [from_dir, synthetic]
  test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
  val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
  val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
  tiling:
    apply: false
    tile_size: null
    stride: null
    remove_border_count: 0
    use_random_tiling: False
    random_tile_count: 16

model:
  name: padim
  backbone: resnet18
  pre_trained: true
  layers:
    - layer1
    - layer2
    - layer3
  normalization_method: min_max # options: [none, min_max, cdf]

metrics:
  image:
    - F1Score
  pixel:
    - F1Score
  threshold:
    method: adaptive #options: [adaptive, manual]
    manual_image: null
    manual_pixel: null

visualization:
  show_images: False # show images on the screen
  save_images: True # save images to the file system
  log_images: True # log images to the available loggers (if any)
  image_save_path: 'some/path/to/data' # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 42
  path: ./results
  unique_dir: true

logging:
  logger: [tensorboard] # options: [comet, tensorboard, wandb, csv] or combinations.
  log_graph: false # Logs the model graph to respective logger.

optimization:
  export_mode: torch # options: torch, onnx, openvino

# PL Trainer Args. Don't add extra parameter here.
trainer:
  enable_checkpointing: true
  default_root_dir: null
  gradient_clip_val: 0
  gradient_clip_algorithm: norm
  num_nodes: 1
  devices: 1
  enable_progress_bar: true
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1 # Don't validate before extracting features.
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 2
  min_epochs: null
  max_steps: -1
  min_steps: null
  max_time: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  limit_predict_batches: 1.0
  val_check_interval: 1.0 # Don't validate before extracting features.
  log_every_n_steps: 50
  accelerator: "gpu" # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  strategy: null
  sync_batchnorm: false
  precision: 32
  enable_model_summary: true
  num_sanity_val_steps: 0
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_n_epochs: 0
  auto_lr_find: false
  replace_sampler_ddp: true
  detect_anomaly: false
  auto_scale_batch_size: false
  plugins: null
  move_metrics_to_cpu: false
  multiple_trainloader_mode: max_size_cycle

Logs

create ds from folder...
setup folder ds...
getting model and callbacks...
/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Metric `PrecisionRecallCurve` will save all targets and predictions in buffer. For large datasets this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
FeatureExtractor is deprecated. Use TimmFeatureExtractor instead. Both FeatureExtractor and TimmFeatureExtractor will be removed in a future release.
create Trainer...
fitting Trainer...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
`Trainer(limit_val_batches=1.0)` was configured so 100% of the batches will be used..
`Trainer(limit_test_batches=1.0)` was configured so 100% of the batches will be used..
`Trainer(limit_predict_batches=1.0)` was configured so 100% of the batches will be used..
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py:183: UserWarning: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer
  rank_zero_warn(

  | Name                  | Type                     | Params
-------------------------------------------------------------------
0 | image_threshold       | AnomalyScoreThreshold    | 0     
1 | pixel_threshold       | AnomalyScoreThreshold    | 0     
2 | model                 | PadimModel               | 2.8 M 
3 | image_metrics         | AnomalibMetricCollection | 0     
4 | pixel_metrics         | AnomalibMetricCollection | 0     
5 | normalization_metrics | MinMax                   | 0     
-------------------------------------------------------------------
2.8 M     Trainable params
0         Non-trainable params
2.8 M     Total params
11.131    Total estimated model params size (MB)
Epoch 0:   0%|          | 1/683 [00:01<15:43,  1.38s/it, loss=nan, v_num=0]/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py:138: UserWarning: `training_step` returned `None`. If this was on purpose, ignore this warning...
  self.warning_cache.warn("`training_step` returned `None`. If this was on purpose, ignore this warning...")
Epoch 0:  57%|█████▋    | 390/683 [00:32<00:24, 11.97it/s, loss=nan, v_num=0]
Validation: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/home/niart/Documents/yahav/anomalib/run_anomalib.py", line 113, in <module>
    trainer.fit(model=model, datamodule=folder_datamodule)
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.on_advance_end()
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 250, in on_advance_end
    self._run_validation()
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 308, in _run_validation
    self.val_loop.run()
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
    self.on_run_start(*args, **kwargs)
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 132, in on_run_start
    self._on_evaluation_start()
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 246, in _on_evaluation_start
    self.trainer._call_lightning_module_hook(hook_name, *args, **kwargs)
  File "/home/niart/anaconda3/envs/anomalib_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/niart/Documents/yahav/anomalib/src/anomalib/models/padim/lightning_model.py", line 92, in on_validation_start
    embeddings = torch.vstack(self.embeddings)
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 35896320000 bytes. Error code 12 (Cannot allocate memory)
Epoch 0:  57%|█████▋    | 390/683 [00:33<00:24, 11.76it/s, loss=nan, v_num=0]

                              
Process finished with exit code 1

Code of Conduct

I agree to follow this project's Code of Conduct

samet-akcay · 2023-06-09T19:50:15Z

samet-akcay
Jun 9, 2023
Maintainer

@yahav6893, have you tried other methods? My guess is that memory-based algorithms such as PADIM and Patchcore might fail if you have a large dataset. This is mainly because these models try to fit the entire feature map into a memory bank, which becomes quite huge for large datasets that CPU cannot handle.

0 replies

samet-akcay · 2023-06-09T19:51:40Z

samet-akcay
Jun 9, 2023
Maintainer

Since this is probably algorithm-specific issue, there is nothing anomalib can do unfortunately. I will therefore convert this to Q&A in discussions. Feel free to convert it to an issue if you have a similar behaviour when you try other algorithms that are not memory bank-based.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PADIM cannot fit large datasets into its memory bank. DefaultCPUAllocator: can't allocate memory #1123

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PADIM cannot fit large datasets into its memory bank. DefaultCPUAllocator: can't allocate memory #1123

Uh oh!

yahav6893 May 15, 2023

Describe the bug

Dataset

Model

Steps to reproduce the behavior

OS information

Expected behavior

Screenshots

Pip/GitHub

What version/branch did you use?

Configuration YAML

Logs

Code of Conduct

Replies: 2 comments

Uh oh!

samet-akcay Jun 9, 2023 Maintainer

Uh oh!

samet-akcay Jun 9, 2023 Maintainer

yahav6893
May 15, 2023

samet-akcay
Jun 9, 2023
Maintainer

samet-akcay
Jun 9, 2023
Maintainer