-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Outline & Motivation
The usage of log_dir and save_dir are confusion.
Imaging we are implement a new logger, let's see how we can define the save_dir property.
class CustomLogger(Logger):
def __init__(self, all_experiment_path):
self._root_dir = all_experiment_path
@property
@override
def save_dir(self) -> str:
option1 = os.path.join(self._root_dir, self.name, str(self.version))
# option2 = self._root_dir
# option3 = os.path.join(self._root_dir, self.name)
return option1 # or option2 or option3The first option includes the experiment name and version as subdirectories, and the second one does not.
As CustomLogger is not a subclass of TensorBoardLogger or CSVLogger. the Trainer.log_dir is defined to be logger.save_dir if the first logger is a CustomLogger, according to here.
The config.yaml file will be saved to logger.save_dir/config.yaml. The checkpoint file will be saved to logger.save_dir/logger.name/logger.version, which
expand to be logger._root_dir/logger.name/logger.version/logger.name/logger.version, which is with duplicate directory hierarchy, according to __resolve_ckpt_dir() function.
I DO NOT like it. The following tree view shows this clearly.
<all_experiment_root>
└── experiment_name
├── version_0
│ ├── config.yaml
│ └── experiment_name
│ └── version_0
│ └── checkpoints
│ └── epoch=7-step=3000.ckpt
└── version_1
├── config.yaml
└── experiment_name
└── version_1
└── checkpoints
└── epoch=28-step=10875.ckpt
However, there is no way to remove the duplicate hierarchy for now. If the second or third option is adopted, the config.yaml file will be saved in a directory irrelevant to the experiment version, and will be overwrite by next experiment. This is not acceptable.
Pitch
I'm expecting the result hierarchy like this:
<all_experiment_root>
└── experiment_name
├── version_0
│ ├── config.yaml
│ └── checkpoints
│ └── epoch=7-step=3000.ckpt
└── version_1
├── config.yaml
└── checkpoints
└── epoch=28-step=10875.ckpt
I tried to fix this problem and was willingly to submit a PR. Fixing this issue seems more complicated than imagined.
Additional context
Because I'm new to Lightning, I'm not sure I fully understand Logger's design. From the current code, several attributes such as root_dir, log_dir, and save_dir are not clearly defined. For example, in the defination of lightning.fabric.loggers.tensorboard.TensorBoardLogger class, the experiment name is not included in the root_dir path, but for lightning.pytorch.loggers.tensorboard.TensorBoardLogger, it is included.
The root_dir and log_dir properties can be None introduces unnecessary complexities. The documentation of log_dir says:
Return the root directory where all versions of an experiment get saved, or `None` if the logger does not save data locally.
If I understand it correctly, the property root_dir and log_dir should never be None for all concrete loggers. This is because the config.yaml should always be saved to somewhere like root_dir/name/version. Even if a logger does not save config.yaml locally, we can still initialize root_dir to the current directory ('.'), but not really use it.
I have two questions:
-
Why we need the
save_dirproperty? Why always settingroot_dirto be<root>/nameandlog_dirto be<root>/name/versionnot sufficient? -
What makes
TensorBoardLoggerandCSVLoggerspecial that result in theTrainer.log_dirto belogger.log_dirinstead oflogger.save_dir?