Add old stack logging support to new stack by quic-abhamidi · Pull Request #889 · quic/efficient-transformers

quic-abhamidi · 2026-03-26T06:30:14Z

Added the following support for easy visualization of training and validation statistics:
1. train_logger callback function which captures the per epoch time, per epoch loss metric and per epoch perplexity
2. This function also captures number of trainable parameters, number of samples in training and eval dataset
3. All these are logged into a log file which can be given as an input by user by setting the flag --log_file_path in the input config .yaml file.

Added the following support for easy visualization of training and validation statistics: 1. train_logger callback function which captures the per epoch time, per epoch loss metric and per epoch perplexity 2. This function also captures number of trainable parameters, number of samples in training and eval dataset 3. All these are logged into a log file which can be given as an input by user by setting the flag --log_file_path in the input config .yaml file. Signed-off-by: Anusha Bhamidipati <abhamidi@qti.qualcomm.com>

quic-akuruvil · 2026-03-26T06:37:47Z

QEfficient/finetune/experimental/core/callbacks.py

+        # Compute perplexity safely
+        train_metric = None
+        if train_loss is not None:
+            train_metric = math.exp(train_loss)


Verify the train_metric values, check if there is a step wise match, wrt to the old FT stack. Use the same sdk, and same seed and data_seed on both stacks, for reproducibility

Also use try block and handle in case metric value overflows

Yes, will add this check

quic-akuruvil

Also check PP, DDP and Multinode DDP APIs are logging properly, with the new change.

quic-akuruvil · 2026-03-26T08:24:03Z

QEfficient/finetune/experimental/core/callbacks.py

+        if self.rank != 0:
+            return
+
+        epoch = int(state.epoch) + 1


Here it is +1 , but in other methods it is just state.epoch.

Here, we increment by one to make epoch 0 to 1, for better logging

So logging will start from EPOCH 1, is it?

If we are running it for 5 epochs, instead of representing the logs from epochs 0- 4 we just represent it as epochs 1-5. The information is logged from 1st epoch or epoch 0 itself as usual.

quic-akuruvil · 2026-03-26T08:24:50Z

QEfficient/finetune/experimental/core/callbacks.py

+        if self.rank != 0:
+            return
+
+        epoch = int(state.epoch)


Here, since this is at the end of the epoch, the state.epoch is already incremented by 1 i.e. epoch 0 after the training of 0th epoch is done, state.epoch value is changes to one so need not add +1 here

This is the same reason for not incrementing state.epoch by 1 in eval logs as well, as it is done after train epoch is completed and at that point the state.epoch value is incremented.

quic-akuruvil · 2026-03-26T08:27:49Z

QEfficient/finetune/experimental/core/callbacks.py

+        if self.rank != 0:
+            return
+        logger.log_rank_zero(text)
+        with open(self.log_file, "a") as f:


It would be better to put inside try block, to catch any write errors

quic-akuruvil · 2026-03-26T08:37:08Z

QEfficient/finetune/experimental/core/callbacks.py

+        # Compute perplexity safely
+        train_metric = None
+        if train_loss is not None:
+            train_metric = math.exp(train_loss)


Also use try block and handle in case metric value overflows

quic-akuruvil · 2026-03-26T08:47:51Z

QEfficient/cloud/finetune_experimental.py

 from QEfficient.finetune.experimental.core.utils.training_config_utils import prepare_training_config

 logger = Logger(__name__)
+train_logger = TrainingLogger(rank=0)


In DDP case, this will fail I think. Please check. I believe we can't hardcode 0 here.

Will change this

quic-akuruvil · 2026-03-26T09:22:06Z

QEfficient/finetune/experimental/core/callbacks.py

+    # ----------------------------------------------------
+    # Safe write to log (only rank 0)
+    # ----------------------------------------------------
+    def _write(self, text):


Usually single underscore at the front is for private methods. But _write method is called outside function at finetune_experimental. Please check

quic-abhamidi requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners March 26, 2026 06:30

quic-akuruvil reviewed Mar 26, 2026

View reviewed changes

quic-akuruvil requested changes Mar 26, 2026

View reviewed changes

quic-akuruvil reviewed Mar 26, 2026

View reviewed changes

quic-akuruvil requested changes Mar 26, 2026

View reviewed changes

quic-akuruvil reviewed Mar 26, 2026

View reviewed changes

Conversation

quic-abhamidi commented Mar 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quic-akuruvil left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

quic-akuruvil left a comment •

edited

Loading