Skip to content

Add old stack logging support to new stack#889

Open
quic-abhamidi wants to merge 1 commit intoquic:ft_experimentalfrom
quic-abhamidi:loggerv2
Open

Add old stack logging support to new stack#889
quic-abhamidi wants to merge 1 commit intoquic:ft_experimentalfrom
quic-abhamidi:loggerv2

Conversation

@quic-abhamidi
Copy link

Added the following support for easy visualization of training and validation statistics:
1. train_logger callback function which captures the per epoch time, per epoch loss metric and per epoch perplexity
2. This function also captures number of trainable parameters, number of samples in training and eval dataset
3. All these are logged into a log file which can be given as an input by user by setting the flag --log_file_path in the input config .yaml file.

 Added the following support for easy visualization of training and validation statistics:
    1. train_logger callback function which captures the per epoch time, per epoch loss metric and per epoch perplexity
    2. This function also captures number of trainable parameters, number of samples in training and eval dataset
    3. All these are logged into a log file which can be given as an input by user by setting the flag --log_file_path in the input config .yaml file.

Signed-off-by: Anusha Bhamidipati <abhamidi@qti.qualcomm.com>
# Compute perplexity safely
train_metric = None
if train_loss is not None:
train_metric = math.exp(train_loss)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify the train_metric values, check if there is a step wise match, wrt to the old FT stack. Use the same sdk, and same seed and data_seed on both stacks, for reproducibility

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also use try block and handle in case metric value overflows

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add this check

Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check PP, DDP and Multinode DDP APIs are logging properly, with the new change.

if self.rank != 0:
return

epoch = int(state.epoch) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is +1 , but in other methods it is just state.epoch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we increment by one to make epoch 0 to 1, for better logging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So logging will start from EPOCH 1, is it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are running it for 5 epochs, instead of representing the logs from epochs 0- 4 we just represent it as epochs 1-5. The information is logged from 1st epoch or epoch 0 itself as usual.

if self.rank != 0:
return

epoch = int(state.epoch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHeck this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, since this is at the end of the epoch, the state.epoch is already incremented by 1 i.e. epoch 0 after the training of 0th epoch is done, state.epoch value is changes to one so need not add +1 here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same reason for not incrementing state.epoch by 1 in eval logs as well, as it is done after train epoch is completed and at that point the state.epoch value is incremented.

if self.rank != 0:
return
logger.log_rank_zero(text)
with open(self.log_file, "a") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to put inside try block, to catch any write errors

# Compute perplexity safely
train_metric = None
if train_loss is not None:
train_metric = math.exp(train_loss)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also use try block and handle in case metric value overflows

from QEfficient.finetune.experimental.core.utils.training_config_utils import prepare_training_config

logger = Logger(__name__)
train_logger = TrainingLogger(rank=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DDP case, this will fail I think. Please check. I believe we can't hardcode 0 here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change this

# ----------------------------------------------------
# Safe write to log (only rank 0)
# ----------------------------------------------------
def _write(self, text):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually single underscore at the front is for private methods. But _write method is called outside function at finetune_experimental. Please check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants