Strange behavior of PL Trainer with Conv1D Autoencoder for timeseries #13385

milan-marinov-usu · 2022-06-23T10:24:00Z

milan-marinov-usu
Jun 23, 2022

Hello everyone,

I have some headaches with using the PL Trainer with a Conv1D deep autoencoder in PyTorch Lightning, which I use for anomaly predictions on timeseries data. It is a vanilla Autoencoder (AE), not a Variational AE. It is similar to https://pytorch-lightning.readthedocs.io/en/stable/notebooks/course_UvA-DL/08-deep-autoencoders.html. However, it works on 1D data (timeseries) and not 2D image data. Using my plain PyTorch training function, it works very well: It reduces the loss function quite quickly and consistently, and it learns a good reconstruction of the original timeseries.

However, I had serious trouble trying to train the same NN (actually, it is even the same class) with PyTorch Lightning.

During my initial attempts, the loss exploded (became a greater number than Python can represent) after a few epochs of training and the training crashed always at the same epoch and batch, no matter what hyperparameters I changed. The problem occured when my dataloaders were defined in my Autoencoder pl.LightningModule via train_dataloader() and val_dataloader(), but it DID NOT occur, when I removed those methods and passed the dataloaders as parameters to the Trainer.fit() method.

There were some indications, that PL has some serious problems with threadpools: When using the pin_memory=True parameter in the the train and val dataloader constructors, after each training batch there came an error message from a C++ module, concerning threadpools. Because of that, I updated my environment: Python from 3.7 to 3.10, Pytorch from 1.9 to 1.11 and PL from 1.6.3 to 1.6.4. I also updated the CUDA drivers and CUDA toolkit of my Gogle Compute Engine GPU instance to versions compatible with the updated PyTorch version.

After the update, the C++ error did not appear any more, the loss did not explode any more, and the training did not crash, but the training stagnated and the loss hovered somewhere around 9-e3, which is a huge value: Usually it gets down to something between 35 and 50 for a batch of 100 timeseries, depending on hyperparams etc. Strangely enough, now the training was broken both when I used the train_dataloader() / val_dataloader() methods of my PLModule AND when I removed those methods and passed my dataloaders "manually" to the Trainer.fit() method.

After that, I suspected that there might be some problem with PL Trainer automatic optimization and implemented a manual optimization in my PL Module, which is analogous to the corresponding lines from my plain PyTorch training function:

    def training_step(self, batch, batch_idx):
        opt = self.optimizers(use_pl_optimizer=True)
        opt.zero_grad()
        loss = self._get_reconstruction_loss(batch)
        self.log('train_loss', loss)
        self.manual_backward(loss)
        opt.step()
        return {'loss': loss}

Now with this manual optimization, the training works fine, with similar performance like my plain PyTorch training function. However, the LR Finder doesn't work anymore (see the error messages below), and I suspect that other things might be broken too:

    File /opt/conda/envs/pl/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py:231, in TrainingEpochLoop.advance(self, data_fetcher)
    225 model_fx = self.trainer.lightning_module.on_train_batch_end
    226 extra_kwargs = (
    227     {"dataloader_idx": 0}
    228     if callable(model_fx) and is_param_in_hook_signature(model_fx, "dataloader_idx", explicit=True)
    229     else {}
    230 )
--> 231 self.trainer._call_callback_hooks("on_train_batch_end", batch_end_outputs, batch, batch_idx, **extra_kwargs)
    232 self.trainer._call_lightning_module_hook(
    233     "on_train_batch_end", batch_end_outputs, batch, batch_idx, **extra_kwargs
    234 )
    235 self.trainer._call_callback_hooks("on_batch_end")

File /opt/conda/envs/pl/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1630, in Trainer._call_callback_hooks(self, hook_name, *args, **kwargs)
   1628 elif hook_name == "on_train_batch_end":
   1629     with self.profiler.profile(hook_name):
-> 1630         self._on_train_batch_end(*args, **kwargs)
   1631 else:
   1632     for callback in self.callbacks:

File /opt/conda/envs/pl/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1662, in Trainer._on_train_batch_end(self, outputs, batch, batch_idx, dataloader_idx)
   1660     callback.on_train_batch_end(self, self.lightning_module, outputs, batch, batch_idx, 0)
   1661 else:
-> 1662     callback.on_train_batch_end(self, self.lightning_module, outputs, batch, batch_idx)

File /opt/conda/envs/pl/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py:332, in _LRCallback.on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx)
    329 if self.progress_bar:
    330     self.progress_bar.update()
--> 332 current_loss = trainer.fit_loop.running_loss.last().item()
    333 current_step = trainer.global_step
    335 # Avg loss (loss with momentum) + smoothing

Motivated by this situation, I have some questions about PyTorch Lightning:

Using manual optimization, now the loss value is not printed out anymore during training in the progress bar. I tried different things to show the loss during training in the progress bar, but nothing has worked so far. How can I fix that?
How do I fix the LR Finder to work with my manual optimization?
Is it possible to use the automatic optimization of PL with my AE? My manual optimization shown above is very simple. How does the automatic optimization of PL differ from my simple one? Can I make the automatic optimization behave similarly to my manual optimization? Do I have to set some parameters to make it work?

Based on these experiences, I also posted a reply with some solution ideas to a question from January this year (#11598).

I am trying hard to make PL work for my engineering research projects and would be very grateful for any solution ideas and replies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strange behavior of PL Trainer with Conv1D Autoencoder for timeseries #13385

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Strange behavior of PL Trainer with Conv1D Autoencoder for timeseries #13385

Uh oh!

Uh oh!

milan-marinov-usu Jun 23, 2022

Replies: 0 comments

milan-marinov-usu
Jun 23, 2022