Strange behavior of PL Trainer with Conv1D Autoencoder for timeseries #13385
Unanswered
milan-marinov-usu
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
I have some headaches with using the PL Trainer with a Conv1D deep autoencoder in PyTorch Lightning, which I use for anomaly predictions on timeseries data. It is a vanilla Autoencoder (AE), not a Variational AE. It is similar to https://pytorch-lightning.readthedocs.io/en/stable/notebooks/course_UvA-DL/08-deep-autoencoders.html. However, it works on 1D data (timeseries) and not 2D image data. Using my plain PyTorch training function, it works very well: It reduces the loss function quite quickly and consistently, and it learns a good reconstruction of the original timeseries.
However, I had serious trouble trying to train the same NN (actually, it is even the same class) with PyTorch Lightning.
During my initial attempts, the loss exploded (became a greater number than Python can represent) after a few epochs of training and the training crashed always at the same epoch and batch, no matter what hyperparameters I changed. The problem occured when my dataloaders were defined in my Autoencoder
pl.LightningModule
viatrain_dataloader()
andval_dataloader()
, but it DID NOT occur, when I removed those methods and passed the dataloaders as parameters to theTrainer.fit()
method.There were some indications, that PL has some serious problems with threadpools: When using the
pin_memory=True
parameter in the the train and val dataloader constructors, after each training batch there came an error message from a C++ module, concerning threadpools. Because of that, I updated my environment: Python from 3.7 to 3.10, Pytorch from 1.9 to 1.11 and PL from 1.6.3 to 1.6.4. I also updated the CUDA drivers and CUDA toolkit of my Gogle Compute Engine GPU instance to versions compatible with the updated PyTorch version.After the update, the C++ error did not appear any more, the loss did not explode any more, and the training did not crash, but the training stagnated and the loss hovered somewhere around 9-e3, which is a huge value: Usually it gets down to something between 35 and 50 for a batch of 100 timeseries, depending on hyperparams etc. Strangely enough, now the training was broken both when I used the
train_dataloader()
/val_dataloader()
methods of my PLModule AND when I removed those methods and passed my dataloaders "manually" to theTrainer.fit()
method.After that, I suspected that there might be some problem with PL Trainer automatic optimization and implemented a manual optimization in my PL Module, which is analogous to the corresponding lines from my plain PyTorch training function:
Now with this manual optimization, the training works fine, with similar performance like my plain PyTorch training function. However, the LR Finder doesn't work anymore (see the error messages below), and I suspect that other things might be broken too:
Motivated by this situation, I have some questions about PyTorch Lightning:
Based on these experiences, I also posted a reply with some solution ideas to a question from January this year (#11598).
I am trying hard to make PL work for my engineering research projects and would be very grateful for any solution ideas and replies.
Beta Was this translation helpful? Give feedback.
All reactions