loss logged as second GLOBAL Tensorboard run, leading to crash due to too many files #11160

jonbraunstat · 2021-12-19T18:33:19Z

jonbraunstat
Dec 19, 2021

Hi everyone,

I'm using a TensorBoard logger like this:
tb_logger = pl_loggers.TensorBoardLogger(args.dir)
trainer = pl.Trainer(...., logger=tb_logger, check_val_every_n_epoch=20, log_every_n_steps=500, num_sanity_val_steps=0)

in my training_epoch_end and validation_epoch_end I log everything like this:
tensorboard = self.logger.experiment
tensorboard.add_scalar('Acc', acc, self.current_epoch)

However when I monitor the training run, TensorBoard actually shows me two runs, one called "default/version_0", which has all my scalars, histograms etc. So what I want and as intended.
Another training run called "GLOBAL" logs a scalar called "nll_loss_output_0", I'm merely calling
loss = torch.nn.functional.cross_entropy(logits, targets)

My local TensorBoard tells me there are too many files in the GLOBAL folder (2800+) and SageMaker with TB monitor has an InternalServerError and the whole training run fails a third of the way in.

In train_step, I am returning:

log = {
'train_loss': loss.detach(),
'acc1': acc1,
'acc10': acc10
}
return {'loss': loss, 'log': log}

is log a keyword there or something? But I'm not calling anything "nll_loss_output_0" anywhere... could someone please advise on how to get rid of the GLOBAL run altogether? Setting log_every_n_steps to 2000000 or so might not fix it, as I'm getting multiple files per train_step?

Or is there any logging implicit for learning rate schedulers?
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.parameters(), lr=self.lr)
lr_schedule = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, self.trainer.max_epochs )
self.optimizer_ = optimizer
return [optimizer],[lr_schedule]

I'm also using a checkpoint callback:
checkpoint_callback = ModelCheckpoint(
save_top_k=-1,
dirpath=os.path.join(args.output_dir,'checkpoints/'),
filename='checkpoint{epoch:04d}',
auto_insert_metric_name=False,
every_n_epochs=20,
save_on_train_epoch_end=True)

Would be great if anyone could help me out!
All the best,
Jonas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

loss logged as second GLOBAL Tensorboard run, leading to crash due to too many files #11160

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

loss logged as second GLOBAL Tensorboard run, leading to crash due to too many files #11160

Uh oh!

Uh oh!

jonbraunstat Dec 19, 2021

Replies: 0 comments

jonbraunstat
Dec 19, 2021