EfficientAD: RuntimeError: quantile() input tensor is too large. #1189
-
Hey, i encountered an Error while trying to train EfficientAD on my dataset. **The Error is as follows:
RuntimeError: quantile() input tensor is too large.**
It appears as the script finishes calculation Validation Dataset Quantiles it crashes.
The dataset contains 1000 images for normal and 250 images each for normal/anormal testing.
If i reduce the number of training images to 200 and the testing images to 100 each this dosnt happen.
Is this a VRAM issue? It seems EfficientAD uses all my VRAM when calculating Teacher Channel Mean.
PC Specs:
GTX 1080 8gb
i7 8700k
32gb DDR4 RAM
OS: Windows 10 Build 19045
I trained only for one epoch to check if the model works.
Here is the complete error log:
Training: 0it [00:00, ?it/s]2023-07-18 14:58:10,131 - anomalib.models.efficientad.lightning_model - INFO - Calculate teacher channel mean and std
Calculate teacher channel mean: 100%|██████████████████████████████████████████████| 1000/1000 [01:14<00:00, 13.51it/s]
Calculate teacher channel std: 100%|████████████████████████████████████████████| 1000/1000 [00:00<00:00, 11235.60it/s]
Epoch 0: 0%| | 0/1032 [00:00<?, ?it/s]C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called `self.log('train_st', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
rank_zero_warn(
C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called `self.log('train_ae', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
rank_zero_warn(
C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called `self.log('train_stae', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
rank_zero_warn(
C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\core\module.py:493: UserWarning: You called `self.log('train_loss', ..., logger=True)` but have no logger configured. You can enable one by doing `Trainer(logger=ALogger(...))`
rank_zero_warn(
Epoch 0: 88%|██████████████████████████████████████████████████████████████████████ | 903/1032 [03:14<00:27, Epoch 0: 88%|▉| 903/1032 [03:14<00:27, 4.64it/s, loss=7.12, train_st_step=5.800, train_ae_step=0.478, train_stae_step Epoch 0: 97%|▉| 1000/1032 [032023-07-18 15:02:53,201 - anomalib.models.efficientad.lightning_model - INFO - Calculate Validation Dataset Quantiles/s]
Calculate Validation Dataset Quantiles: 100%|██████████████████████████████████████| 1000/1000 [01:28<00:00, 11.35it/s]
Traceback (most recent call last):
File "tools/train.py", line 79, in <module>
train(args)
File "tools/train.py", line 64, in train
trainer.fit(model=model, datamodule=datamodule)
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1112, in _run
results = self._run_stage()
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1191, in _run_stage
self._run_train()
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\loop.py", line 199, in run
self.advance(*args, **kwargs)
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\loop.py", line 200, in run
self.on_advance_end()
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 250, in on_advance_end
self._run_validation()
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 308, in _run_validation
self.val_loop.run()
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\loop.py", line 194, in run
self.on_run_start(*args, **kwargs)
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 132, in on_run_start
self._on_evaluation_start()
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 246, in _on_evaluation_start
self.trainer._call_lightning_module_hook(hook_name, *args, **kwargs)
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\anomalib\models\efficientad\lightning_model.py", line 231, in on_validation_start
map_norm_quantiles = self.map_norm_quantiles(self.trainer.datamodule.train_dataloader())
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\PaulR\.conda\envs\anomalib_env\lib\site-packages\anomalib\models\efficientad\lightning_model.py", line 175, in map_norm_quantiles
qa_st = torch.quantile(maps_st, q=0.9).to(self.device)
RuntimeError: quantile() input tensor is too large
Epoch 0: 97%|█████████▋| 1000/1032 [04:58<00:09, 3.35it/s, loss=6.18, train_st_step=5.600, train_ae_step=0.432, train_stae_step=0.0497, train_loss_step=6.080] The config i used is as follows: dataset:
name: NanoBlades
format: folder
path: ./datasets/NanoBlades
task: classification
normal_dir: IO
abnormal_dir: NIO_Test
normal_test_dir: IO_Test
train_batch_size: 1
eval_batch_size: 16
num_workers: 10
image_size: 256 # dimensions to which images are resized (mandatory)
center_crop: null # dimensions to which images are center-cropped after resizing (optional)
normalization: none # data distribution to which the images will be normalized: [none, imagenet]
mask_dir: null
extensions: null
transform_config:
train: null
eval: null
test_split_mode: from_dir # options: [from_dir, synthetic]
test_split_ratio: 0.2 # fraction of train images held out testing (usage depends on test_split_mode)
val_split_mode: same_as_test # options: [same_as_test, from_test, synthetic]
val_split_ratio: 0.5 # fraction of train/test images held out for validation (usage depends on val_split_mode)
model:
name: efficientad
teacher_out_channels: 384
model_size: small # options: [small, medium]
lr: 0.0001
weight_decay: 0.00001
padding: true
generic params
normalization_method: min_max # options: [null, min_max, cdf]
metrics:
image:
- F1Score
pixel: null
threshold:
method: adaptive #options: [adaptive, manual]
manual_image: null
manual_pixel: null
visualization:
show_images: False # show images on the screen
save_images: True # save images to the file system
log_images: False # log images to the available loggers (if any)
image_save_path: null # path to which images will be saved
mode: full # options: ["full", "simple"]
project:
seed: 42
path: ./results/efficientad_NB_v2
logging:
logger: [] # options: [comet, tensorboard, wandb, csv] or combinations.
log_graph: false # Logs the model graph to respective logger.
optimization:
export_mode: # options: torch, onnx, openvino
PL Trainer Args. Don't add extra parameter here.
trainer:
enable_checkpointing: true
default_root_dir: null
gradient_clip_val: 0
gradient_clip_algorithm: norm
num_nodes: 1
devices: 1
enable_progress_bar: true
overfit_batches: 0.0
track_grad_norm: -1
check_val_every_n_epoch: 1
fast_dev_run: false
accumulate_grad_batches: 1
max_epochs: 1
min_epochs: null
max_steps: -1
min_steps: null
max_time: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
limit_predict_batches: 1.0
val_check_interval: 1.0
log_every_n_steps: 50
accelerator: gpu # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
strategy: null
sync_batchnorm: false
precision: 32
enable_model_summary: true
num_sanity_val_steps: 0
profiler: null
benchmark: false
deterministic: false
reload_dataloaders_every_n_epochs: 0
auto_lr_find: false
replace_sampler_ddp: true
detect_anomaly: false
auto_scale_batch_size: false
plugins: null
move_metrics_to_cpu: true # changed
multiple_trainloader_mode: max_size_cycle> |
Beta Was this translation helpful? Give feedback.
Answered by
alexriedel1
Jul 19, 2023
Replies: 1 comment 6 replies
-
Hello. This is an issue with the way quantile works. Check PR #1182 on how to change it so it works even for larger data. From your error line number you also don't have the latest version where I believe this issue is already solved. (PR above solves it for torch model, while for lighting model the fix was already merged so you should update your code.) Also set all batch sizes to 1. |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
You are not using this branch https://github.com/alexriedel1/anomalib/blob/efficientad_quantile or the current main branch of anomalib. Please make a new environment and install from source as described here https://github.com/openvinotoolkit/anomalib/tree/main#local-install