Skip to content

Tensorboard Logger is flushed on every step #20551

@leoleoasd

Description

@leoleoasd

Bug description

I noticed a significantly degraded performance with tensorboard logger on S3.
I printede the call stack of the tensorboard logger's flush call, and found that, on every call to log_metrics, tensorboard's flush will be called.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

    logger = TensorBoardLogger("s3-mountpoint", max_queue=1000, flush_secs=20)
    trainer = L.Trainer(
        num_nodes=num_nodes,
        devices=local_world_size,
        accelerator="cuda",
        max_epochs=1,
        precision="bf16-true",
        strategy="fsdp",
        log_every_n_steps=1,
        enable_checkpointing=False,
        default_root_dir="mountpoint",
        logger=logger,
    )

Error messages and logs

    trainer.fit(lit_model, data)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 278, in advance
    trainer._logger_connector.update_train_step_metrics()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 163, in update_train_step_metrics
    self.log_metrics(self.metrics["log"])
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 118, in log_metrics
    logger.save()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loggers/tensorboard.py", line 210, in save
    super().save()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/fabric/loggers/tensorboard.py", line 290, in save
    self.experiment.flush()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/torch/utils/tensorboard/writer.py", line 1194, in flush
    writer.flush()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/torch/utils/tensorboard/writer.py", line 153, in flush
    self.event_writer.flush()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 127, in flush
    self._async_writer.flush()
  File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 185, in flush
    traceback.print_stack()

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.4.0
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux): Linux
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

cc @lantiga

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions