-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
Description
Bug description
I noticed a significantly degraded performance with tensorboard logger on S3.
I printede the call stack of the tensorboard logger's flush call, and found that, on every call to log_metrics
, tensorboard's flush
will be called.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
logger = TensorBoardLogger("s3-mountpoint", max_queue=1000, flush_secs=20)
trainer = L.Trainer(
num_nodes=num_nodes,
devices=local_world_size,
accelerator="cuda",
max_epochs=1,
precision="bf16-true",
strategy="fsdp",
log_every_n_steps=1,
enable_checkpointing=False,
default_root_dir="mountpoint",
logger=logger,
)
Error messages and logs
trainer.fit(lit_model, data)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
call._call_and_handle_interrupt(
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
results = self._run_stage()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
self.fit_loop.run()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
self.advance()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
self.advance(data_fetcher)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 278, in advance
trainer._logger_connector.update_train_step_metrics()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 163, in update_train_step_metrics
self.log_metrics(self.metrics["log"])
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 118, in log_metrics
logger.save()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/pytorch/loggers/tensorboard.py", line 210, in save
super().save()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/lightning/fabric/loggers/tensorboard.py", line 290, in save
self.experiment.flush()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/torch/utils/tensorboard/writer.py", line 1194, in flush
writer.flush()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/torch/utils/tensorboard/writer.py", line 153, in flush
self.event_writer.flush()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 127, in flush
self._async_writer.flush()
File "/root/miniforge3/envs/lightning/lib/python3.11/site-packages/tensorboard/summary/writer/event_file_writer.py", line 185, in flush
traceback.print_stack()
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.4.0
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux): Linux
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
cc @lantiga