Add old stack logging support to new stack#889
Add old stack logging support to new stack#889quic-abhamidi wants to merge 1 commit intoquic:ft_experimentalfrom
Conversation
Added the following support for easy visualization of training and validation statistics:
1. train_logger callback function which captures the per epoch time, per epoch loss metric and per epoch perplexity
2. This function also captures number of trainable parameters, number of samples in training and eval dataset
3. All these are logged into a log file which can be given as an input by user by setting the flag --log_file_path in the input config .yaml file.
Signed-off-by: Anusha Bhamidipati <abhamidi@qti.qualcomm.com>
| # Compute perplexity safely | ||
| train_metric = None | ||
| if train_loss is not None: | ||
| train_metric = math.exp(train_loss) |
There was a problem hiding this comment.
Verify the train_metric values, check if there is a step wise match, wrt to the old FT stack. Use the same sdk, and same seed and data_seed on both stacks, for reproducibility
There was a problem hiding this comment.
Also use try block and handle in case metric value overflows
| if self.rank != 0: | ||
| return | ||
|
|
||
| epoch = int(state.epoch) + 1 |
There was a problem hiding this comment.
Here it is +1 , but in other methods it is just state.epoch.
There was a problem hiding this comment.
Here, we increment by one to make epoch 0 to 1, for better logging
There was a problem hiding this comment.
So logging will start from EPOCH 1, is it?
There was a problem hiding this comment.
If we are running it for 5 epochs, instead of representing the logs from epochs 0- 4 we just represent it as epochs 1-5. The information is logged from 1st epoch or epoch 0 itself as usual.
| if self.rank != 0: | ||
| return | ||
|
|
||
| epoch = int(state.epoch) |
There was a problem hiding this comment.
Here, since this is at the end of the epoch, the state.epoch is already incremented by 1 i.e. epoch 0 after the training of 0th epoch is done, state.epoch value is changes to one so need not add +1 here
There was a problem hiding this comment.
This is the same reason for not incrementing state.epoch by 1 in eval logs as well, as it is done after train epoch is completed and at that point the state.epoch value is incremented.
| if self.rank != 0: | ||
| return | ||
| logger.log_rank_zero(text) | ||
| with open(self.log_file, "a") as f: |
There was a problem hiding this comment.
It would be better to put inside try block, to catch any write errors
| # Compute perplexity safely | ||
| train_metric = None | ||
| if train_loss is not None: | ||
| train_metric = math.exp(train_loss) |
There was a problem hiding this comment.
Also use try block and handle in case metric value overflows
| from QEfficient.finetune.experimental.core.utils.training_config_utils import prepare_training_config | ||
|
|
||
| logger = Logger(__name__) | ||
| train_logger = TrainingLogger(rank=0) |
There was a problem hiding this comment.
In DDP case, this will fail I think. Please check. I believe we can't hardcode 0 here.
| # ---------------------------------------------------- | ||
| # Safe write to log (only rank 0) | ||
| # ---------------------------------------------------- | ||
| def _write(self, text): |
There was a problem hiding this comment.
Usually single underscore at the front is for private methods. But _write method is called outside function at finetune_experimental. Please check
Added the following support for easy visualization of training and validation statistics:
1. train_logger callback function which captures the per epoch time, per epoch loss metric and per epoch perplexity
2. This function also captures number of trainable parameters, number of samples in training and eval dataset
3. All these are logged into a log file which can be given as an input by user by setting the flag --log_file_path in the input config .yaml file.