loss logged as second GLOBAL Tensorboard run, leading to crash due to too many files #11160
Unanswered
jonbraunstat
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I'm using a TensorBoard logger like this:
tb_logger = pl_loggers.TensorBoardLogger(args.dir)
trainer = pl.Trainer(...., logger=tb_logger, check_val_every_n_epoch=20, log_every_n_steps=500, num_sanity_val_steps=0)
in my training_epoch_end and validation_epoch_end I log everything like this:
tensorboard = self.logger.experiment
tensorboard.add_scalar('Acc', acc, self.current_epoch)
However when I monitor the training run, TensorBoard actually shows me two runs, one called "default/version_0", which has all my scalars, histograms etc. So what I want and as intended.
Another training run called "GLOBAL" logs a scalar called "nll_loss_output_0", I'm merely calling
loss = torch.nn.functional.cross_entropy(logits, targets)
My local TensorBoard tells me there are too many files in the GLOBAL folder (2800+) and SageMaker with TB monitor has an InternalServerError and the whole training run fails a third of the way in.
In train_step, I am returning:
log = {
'train_loss': loss.detach(),
'acc1': acc1,
'acc10': acc10
}
return {'loss': loss, 'log': log}
is log a keyword there or something? But I'm not calling anything "nll_loss_output_0" anywhere... could someone please advise on how to get rid of the GLOBAL run altogether? Setting log_every_n_steps to 2000000 or so might not fix it, as I'm getting multiple files per train_step?
Or is there any logging implicit for learning rate schedulers?
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.parameters(), lr=self.lr)
lr_schedule = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, self.trainer.max_epochs )
self.optimizer_ = optimizer
return [optimizer],[lr_schedule]
I'm also using a checkpoint callback:
checkpoint_callback = ModelCheckpoint(
save_top_k=-1,
dirpath=os.path.join(args.output_dir,'checkpoints/'),
filename='checkpoint{epoch:04d}',
auto_insert_metric_name=False,
every_n_epochs=20,
save_on_train_epoch_end=True)
Would be great if anyone could help me out!
All the best,
Jonas
Beta Was this translation helpful? Give feedback.
All reactions