EfficientAD: RuntimeError: quantile() input tensor is too large. #1189

Palettenbrett · 2023-07-18T13:27:32Z

Palettenbrett
Jul 18, 2023

Hey,

i encountered an Error while trying to train EfficientAD on my dataset.

**The Error is as follows: 
RuntimeError: quantile() input tensor is too large.**

It appears as the script finishes calculation Validation Dataset Quantiles it crashes.
The dataset contains 1000 images for normal and 250 images each for normal/anormal testing.
If i reduce the number of training images to 200 and the testing images to 100 each this dosnt happen.
Is this a VRAM issue? It seems EfficientAD uses all my VRAM when calculating Teacher Channel Mean.

PC Specs:
GTX 1080 8gb
i7 8700k
32gb DDR4 RAM
OS: Windows 10 Build 19045

I trained only for one epoch to check if the model works.
Here is the complete error log:

Training: 0it [00:00, ?it/s]2023-07-18 14:58:10,131 - anomalib.models.efficientad.lightning_model - INFO - Calculate teacher channel mean and std
Calculate teacher channel mean: 100%|██████████████████████████████████████████████| 1000/1000 [01:14<00:00, 13.51it/s]
Calculate teacher channel std: 100%|████████████████████████████████████████████| 1000/1000 [00:00<00:00, 11235.60it/s]
Epoch 0:   0%|                                                                                | 0/1032 [00:00<?, ?it/s]C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called `self.log('train_st', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
  rank_zero_warn(
C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called `self.log('train_ae', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
  rank_zero_warn(
C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called `self.log('train_stae', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
  rank_zero_warn(
C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called `self.log('train_loss', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
  rank_zero_warn(
Epoch 0:  88%|██████████████████████████████████████████████████████████████████████          | 903/1032 [03:14<00:27,  Epoch 0:  88%|▉| 903/1032 [03:14<00:27,  4.64it/s, loss=7.12, train_st_step=5.800, train_ae_step=0.478, train_stae_step Epoch 0:  97%|▉| 1000/1032 [032023-07-18 15:02:53,201 - anomalib.models.efficientad.lightning_model - INFO - Calculate Validation Dataset Quantiles/s]
Calculate Validation Dataset Quantiles: 100%|██████████████████████████████████████| 1000/1000 [01:28<00:00, 11.35it/s]
Traceback (most recent call last):
  File "tools/train.py", line 79, in <module>
    train(args)
  File "tools/train.py", line 64, in train
    trainer.fit(model=model, datamodule=datamodule)
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1112, in _run
    results = self._run_stage()
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1191, in _run_stage
    self._run_train()
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\loop.py", line 200, in run
    self.on_advance_end()
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 250, in on_advance_end
    self._run_validation()
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 308, in _run_validation
    self.val_loop.run()
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\loop.py", line 194, in run
    self.on_run_start(*args, **kwargs)
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 132, in on_run_start
    self._on_evaluation_start()
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 246, in _on_evaluation_start
    self.trainer._call_lightning_module_hook(hook_name, *args, **kwargs)
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1356, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\anomalib\models\efficientad\lightning_model.py", line 231, in on_validation_start
    map_norm_quantiles = self.map_norm_quantiles(self.trainer.datamodule.train_dataloader())
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\anomalib\models\efficientad\lightning_model.py", line 175, in map_norm_quantiles
    qa_st = torch.quantile(maps_st, q=0.9).to(self.device)
RuntimeError: quantile() input tensor is too large
Epoch 0:  97%|█████████▋| 1000/1032 [04:58<00:09,  3.35it/s, loss=6.18, train_st_step=5.600, train_ae_step=0.432, train_stae_step=0.0497, train_loss_step=6.080]

The config i used is as follows:

dataset:
  name: NanoBlades
  format: folder
  path: ./datasets/NanoBlades
  task: classification
  normal_dir: IO
  abnormal_dir: NIO_Test
  normal_test_dir: IO_Test
  train_batch_size: 1
  eval_batch_size: 16
  num_workers: 10
  image_size: 256 # dimensions to which images are resized (mandatory)
  center_crop: null # dimensions to which images are center-cropped after resizing (optional)
  normalization: none # data distribution to which the images will be normalized: [none, imagenet]
  mask_dir: null
  extensions: null
  transform_config:
    train: null
    eval: null
  test_split_mode: from_dir # options: [from_dir, synthetic]
  test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
  val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
  val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)

model:
  name: efficientad
  teacher_out_channels: 384
  model_size: small # options: [small, medium]
  lr: 0.0001
  weight_decay: 0.00001
  padding: true
  generic params
  normalization_method: min_max # options: [null, min_max, cdf]

metrics:
  image:
    - F1Score
  pixel: null
  threshold:
    method: adaptive #options: [adaptive, manual]
    manual_image: null
    manual_pixel: null

visualization:
  show_images: False # show images on the screen
  save_images: True # save images to the file system
  log_images: False # log images to the available loggers (if any)
  image_save_path: null # path to which images will be saved
  mode: full # options: ["full", "simple"]

project:
  seed: 42
  path: ./results/efficientad_NB_v2

logging:
  logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
  log_graph: false # Logs the model graph to respective logger.

optimization:
  export_mode: # options: torch, onnx, openvino
PL Trainer Args. Don't add extra parameter here.
trainer:
  enable_checkpointing: true
  default_root_dir: null
  gradient_clip_val: 0
  gradient_clip_algorithm: norm
  num_nodes: 1
  devices: 1
  enable_progress_bar: true
  overfit_batches: 0.0
  track_grad_norm: -1
  check_val_every_n_epoch: 1
  fast_dev_run: false
  accumulate_grad_batches: 1
  max_epochs: 1
  min_epochs: null
  max_steps: -1
  min_steps: null
  max_time: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  limit_predict_batches: 1.0
  val_check_interval: 1.0
  log_every_n_steps: 50
  accelerator: gpu # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
  strategy: null
  sync_batchnorm: false
  precision: 32
  enable_model_summary: true
  num_sanity_val_steps: 0
  profiler: null
  benchmark: false
  deterministic: false
  reload_dataloaders_every_n_epochs: 0
  auto_lr_find: false
  replace_sampler_ddp: true
  detect_anomaly: false
  auto_scale_batch_size: false
  plugins: null
  move_metrics_to_cpu: true # changed
  multiple_trainloader_mode: max_size_cycle>

Answered by alexriedel1

Jul 19, 2023

You are not using this branch https://github.com/alexriedel1/anomalib/blob/efficientad_quantile or the current main branch of anomalib. Please make a new environment and install from source as described here https://github.com/openvinotoolkit/anomalib/tree/main#local-install

View full answer

blaz-r · 2023-07-18T20:47:30Z

blaz-r
Jul 18, 2023

Hello. This is an issue with the way quantile works. Check PR #1182 on how to change it so it works even for larger data.

From your error line number you also don't have the latest version where I believe this issue is already solved. (PR above solves it for torch model, while for lighting model the fix was already merged so you should update your code.)

Also set all batch sizes to 1.

6 replies

Palettenbrett Jul 19, 2023
Author

Thank you for your answer and patience.

I downloaded the latest version, (v0.6.0) that got published on June 15 and tried using a batchsize of 1 for training and testing.
I tried diffrent forks from @alexriedel1, in peticular his main branch and his efficientad_quantile branch.
But all this dosn't seem to resolve the problem at all.
As is, i still get the RuntimeError no matter what version i use.
What else could cause this?
I would be glad if someone could shed some light into this.

blaz-r Jul 19, 2023

Can you send the error message. Because in your original question you can see that modified quantile is not used line 175, in map_norm_quantiles qa_st = torch.quantile(maps_st, q=0.9).to(self.device). This is not the current code.

This could happen if you installed anomalib as a package, that might cause a problem, so try uninstalling it and do manual install from git.

Palettenbrett Jul 19, 2023
Author

Thank you for your reply.
I installed anomalib by downloading the .zip file.
You might be right with your theory, that i still dont use the current code, as the error is still in line 175.
Here is the error massage when i run the script.

Training: 0it [00:00, ?it/s]2023-07-19 12:25:46,510 - anomalib.models.efficientad.lightning_model - INFO - Calculate teacher channel mean and std
Calculate teacher channel mean: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:16<00:00, 60.91it/s]
Calculate teacher channel std: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1803.65it/s]
Epoch 0: 0%| | 0/1500 [00:00<?, ?it/s]C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called self.log('train_st', ..., logger=True) but have no logger configured. You can enable one by doing Trainer(logger=ALogger(...))
rank_zero_warn(
C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called self.log('train_ae', ..., logger=True) but have no logger configured. You can enable one by doing Trainer(logger=ALogger(...))
rank_zero_warn(
C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called self.log('train_stae', ..., logger=True) but have no logger configured. You can enable one by doing Trainer(logger=ALogger(...))
rank_zero_warn(
C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called self.log('train_loss', ..., logger=True) but have no logger configured. You can enable one by doing Trainer(logger=ALogger(...))
rank_zero_warn(
Epoch 0: 67%|████████████████2023-07-19 12:29:19,618 - anomalib.models.efficientad.lightning_model - INFO - Calculate Validation Dataset Quantilesst_step=5.600, train_ae_step=0.413, train_stae_step=0.0438, train_loss_step=6.060]
Calculate Validation Dataset Quantiles: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:22<00:00, 44.18it/s]
Traceback (most recent call last):
File "tools/train.py", line 79, in
train(args)
File "tools/train.py", line 64, in train
trainer.fit(model=model, datamodule=datamodule)
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1112, in _run
results = self._run_stage()
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1191, in _run_stage
self._run_train()
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
self.advance(*args, **kwargs)
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\loop.py", line 200, in run
self.on_advance_end()
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 250, in on_advance_end
self._run_validation()
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 308, in _run_validation
self.val_loop.run()
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\loop.py", line 194, in run
self.on_run_start(*args, **kwargs)
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 132, in on_run_start
self._on_evaluation_start()
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 246, in _on_evaluation_start
self.trainer._call_lightning_module_hook(hook_name, *args, **kwargs)
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\anomalib\models\efficientad\lightning_model.py", line 231, in on_validation_start
map_norm_quantiles = self.map_norm_quantiles(self.trainer.datamodule.train_dataloader())
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\PaulR.conda\envs\anomalib_env\lib\site-packages\anomalib\models\efficientad\lightning_model.py", line 175, in map_norm_quantiles
qa_st = torch.quantile(maps_st, q=0.9).to(self.device)
RuntimeError: quantile() input tensor is too large
Epoch 0: 67%|██████▋ | 1000/1500 [03:39<01:49, 4.55it/s, loss=6.19, train_st_step=5.600, train_ae_step=0.413, train_stae_step=0.0438, train_loss_step=6.060]

alexriedel1 Jul 19, 2023

You are not using this branch https://github.com/alexriedel1/anomalib/blob/efficientad_quantile or the current main branch of anomalib. Please make a new environment and install from source as described here https://github.com/openvinotoolkit/anomalib/tree/main#local-install

Answer selected by Palettenbrett

blaz-r Jul 19, 2023

As @alexriedel1 said, and judging by your error message, you are not using the right version. It can happen if you install anomalib as a package or from zip, so follow the instructions above to make a fresh install from branch not zip.

Palettenbrett Jul 19, 2023
Author

Thank you so much.
I did a complete reinstall in a fresh conda environment using git clone.
Now it works like a charm.
Thank you again for your help and patience.

EfficientAD: RuntimeError: quantile() input tensor is too large. #1189

Uh oh!

Uh oh!

Palettenbrett Jul 18, 2023

Replies: 1 comment · 6 replies

Uh oh!

Uh oh!

blaz-r Jul 18, 2023

Uh oh!

Palettenbrett Jul 19, 2023 Author

Uh oh!

blaz-r Jul 19, 2023

Uh oh!

Palettenbrett Jul 19, 2023 Author

Uh oh!

alexriedel1 Jul 19, 2023

Uh oh!

Uh oh!

blaz-r Jul 19, 2023

Uh oh!

Palettenbrett Jul 19, 2023 Author

Palettenbrett
Jul 18, 2023

Replies: 1 comment 6 replies

blaz-r
Jul 18, 2023

Palettenbrett Jul 19, 2023
Author

Palettenbrett Jul 19, 2023
Author

Palettenbrett Jul 19, 2023
Author