OOM error with custom dataset - python systematically crashes after a couple of epochs #581
-
Context dataset:
name: brooks
format: folder
path: <removed_in_purpose>
normal_dir: <removed_in_purpose> # name of the folder containing normal images.
abnormal_dir: <removed_in_purpose> # name of the folder containing abnormal images.
normal_test_dir: null # name of the folder containing normal test images.
task: segmentation # classification or segmentation
mask: <removed_in_purpose> #optional
extensions: null
split_ratio: 0.1 # ratio of the normal images that will be used to create a test split
image_size: [512,512] #[256,256] #[115, 194] #[1149, 1940]
train_batch_size: 1
test_batch_size: 1
num_workers: 4
transform_config:
train: null
val: null
create_validation_set: true
tiling:
apply: false
tile_size: null
stride: null
remove_border_count: 0
use_random_tiling: False
random_tile_count: 16
model:
name: padim
backbone: resnet18
pre_trained: true
layers:
- layer1
- layer2
- layer3
normalization_method: min_max # options: [none, min_max, cdf]
metrics:
image:
- F1Score
- AUROC
pixel:
- F1Score
- AUROC
threshold:
image_default: 3
pixel_default: 3
adaptive: true
visualization:
show_images: False # show images on the screen
save_images: True # save images to the file system
log_images: True # log images to the available loggers (if any)
image_save_path: null # path to which images will be saved
mode: full # options: ["full", "simple"]
project:
seed: 42
path: <removed_in_purpose>
logging:
logger: [] # options: [tensorboard, wandb, csv] or combinations.
log_graph: false # Logs the model graph to respective logger.
#optimization:
# openvino:
# apply: false
# PL Trainer Args. Don't add extra parameter here.
trainer:
accelerator: auto # <"cpu", "gpu", "tpu", "ipu", "hpu", "auto">
accumulate_grad_batches: 1
amp_backend: native
auto_lr_find: false
auto_scale_batch_size: false
auto_select_gpus: false
benchmark: false
check_val_every_n_epoch: 1 # Don't validate before extracting features.
default_root_dir: null
detect_anomaly: false
deterministic: false
devices: 1
enable_checkpointing: true
enable_model_summary: true
enable_progress_bar: true
fast_dev_run: false
gpus: null # Set automatically
gradient_clip_val: 0
ipus: null
limit_predict_batches: 1.0
limit_test_batches: 1.0
limit_train_batches: 1.0
limit_val_batches: 1.0
log_every_n_steps: 50
max_epochs: 4
max_steps: -1
max_time: null
min_epochs: null
min_steps: null
move_metrics_to_cpu: false
multiple_trainloader_mode: max_size_cycle
num_nodes: 1
num_processes: null
num_sanity_val_steps: 0
overfit_batches: 0.0
plugins: null
precision: 32
profiler: null
reload_dataloaders_every_n_epochs: 0
replace_sampler_ddp: true
sync_batchnorm: false
tpu_cores: null
track_grad_norm: -1
val_check_interval: 1.0 # Don't validate before extracting features. Describe the bug As mentioned, I tried different batch sizes (as low as 1), number of epochs, and image input sizes, and these are some of the tests I tried:
I have tested this using three different environments, with the same results: Using a 80 core Xeon CPU with 96GB of memory with no GPU; using an aws g5.xlarge instance with 16GB RAM and 24GB GPU (NVIDIA A10G); and using Google Colab. In all of them I get mostly the same results: the code just crashes after a couple of epochs. If I monitor the RAM/GPU usage, I can see that the process is killed once a certain max usage is achieved. In summary: The only meaningful good results start when I train the model for input size > 256, and for more than 1 epoch. For an image input size of 100px, I can train it for only 10 epochs before it crashes. So effectively, I cannot train the model to achieve the accuracy I would expect. Expected behavior
Screenshots Hardware and Software Configuration My conda env config:
And pip freeze (inside the conda environment used):
Additional comments |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 1 reply
-
PaDim isn't "trained" but is extracting image features at training time that are stored. If the features of every dataset image have been retrieved, at test time the test set image features are compared against the stored training features. You accuracy will not rise after more epochs! Try different algorithms (e.g. PatchCore) or extraction backbones (e.g. Wide ResNet 50) or better training data. |
Beta Was this translation helpful? Give feedback.
-
@alexriedel1 Thank you for your reply! I was not aware of this. In any case, I tried patchcore with wide_resnet_50 and also with resnet18, and still got the same result: the process gets killed before the first epoch even finishes. I tried this in a machine using GPU and another one without GPU (but 96GB of RAM) and still got the same result. In the server without GPU no warning nor error message is displayed. The process simply gets killed at around 43% of the first epoch (I assumed that's the point at which some max memory threshold was achieved). However when running this in the machine WITH GPU there's an interesting warning that pops-up right before the process getting killed. As you can see in the log I am attaching below (the one for the machine with GPU, when using patchcore and a resnet18 backbone) there's a mention to a CUDA OOM Runtime error, as well as to an env variable named "PYTORCH_CUDA_ALLOC_CONF" and another variable named "max_split_size_mb". Looking for "PYTORCH_CUDA_ALLOC_CONF" around the internet I found a couple of places (pytorch/pytorch#16417) where they mentioned this could be solved by either:
Any ideas?? Thank you very much!!
|
Beta Was this translation helpful? Give feedback.
-
The error is probably GPU OOM in both cases and there's not too much you can do about it besides increasing your GPU VRAM or reducing the training set size. The first is a bit more difficult so you should start with reducing the training set size. Also try to decrease decrease the image size to (256,256) |
Beta Was this translation helpful? Give feedback.
-
@manuelblancovalentin, as @alexriedel1 pointed out, padim and patchcore are not memory efficient. If you get OOM even in a single epoch, you could try @alexriedel1's suggestion, or alternatively try to train DRAEM+SSPCAB model. The authors claim sota results here on video anomaly detection, which would be more suitable to your use-case. |
Beta Was this translation helpful? Give feedback.
-
I'll convert this to a discussion, feel free to continue from there. |
Beta Was this translation helpful? Give feedback.
-
i am facing same issue with ganomaly and fastpatch |
Beta Was this translation helpful? Give feedback.
-
Hi, I am also facing this issue. My dataset is very small, 900 images in total, with file size of 0.5 mb each, and my cluster has lots of memory (high-end A100). Only padim model seems to work. With many of the others, I get |
Beta Was this translation helpful? Give feedback.
PaDim isn't "trained" but is extracting image features at training time that are stored. If the features of every dataset image have been retrieved, at test time the test set image features are compared against the stored training features.
Thus you don't have to "train" PaDim for more than one epoch (except for maybe you use random image augmentations).
You accuracy will not rise after more epochs! Try different algorithms (e.g. PatchCore) or extraction backbones (e.g. Wide ResNet 50) or better training data.