Log subset of metrics to console as text #6182

addisonklinke · 2021-02-24T16:58:17Z

addisonklinke
Feb 24, 2021

I am in the processing of converting a vanilla PyTorch training system to Lightning and am having some difficulty understanding how to replicate my existing logging setup (as closely as possible). I like the benefit of built-in Tensorboard support, but I still need a good text log to allow inspection on machines without a GUI or easy browser connectivity.

Here's a sample of the first epoch from my current log file to illustrate what I'm looking to replicate. Note: char, seq, and region are short for character, sequence, and region accuracy

[DEBUG] Git repository state: master @ 28549a50
[INFO ] Received 218,985 samples + 1,060 for validation
[INFO ] Logging stats every 219 batches
[INFO ] Creating model with fresh parameters
[INFO ] Model parameters: 6,318,171
[INFO ] Starting training
[DEBUG] Epoch 0/29 | Batch 0/2190    | Loss 4.1212 | Char 1.84  Seq 0.00  Region 0.00
[DEBUG] Epoch 0/29 | Batch 219/2190  | Loss 2.5101 | Char 33.11 Seq 2.77  Region 15.94
[DEBUG] Epoch 0/29 | Batch 438/2190  | Loss 1.4691 | Char 65.30 Seq 10.45 Region 22.20
[DEBUG] Epoch 0/29 | Batch 657/2190  | Loss 1.1294 | Char 75.42 Seq 18.06 Region 23.79
[DEBUG] Epoch 0/29 | Batch 876/2190  | Loss 0.9958 | Char 79.17 Seq 21.12 Region 26.38
[DEBUG] Epoch 0/29 | Batch 1095/2190 | Loss 0.8981 | Char 81.86 Seq 23.87 Region 29.53
[DEBUG] Epoch 0/29 | Batch 1314/2190 | Loss 0.8195 | Char 83.91 Seq 25.21 Region 33.30
[DEBUG] Epoch 0/29 | Batch 1533/2190 | Loss 0.7626 | Char 85.77 Seq 26.36 Region 35.82
[DEBUG] Epoch 0/29 | Batch 1752/2190 | Loss 0.7012 | Char 87.25 Seq 27.96 Region 39.75
[DEBUG] Epoch 0/29 | Batch 1971/2190 | Loss 0.6549 | Char 88.11 Seq 29.21 Region 43.15
[INFO ] Saving checkpoint for epoch 0 (LR 1e-04)
[INFO ] Training:   char 76.82 seq 21.55 region 31.52
[INFO ] Validation: char 88.02 seq 29.55 region 52.55

@awaelchli suggests Lightning's CSVLogger in #4876, but it falls short of a few desirable features

Log text unrelated to a metric: sometimes the training routine has conditional branches and it's nice to add a log line to clarify which one was executed. In my example, whether a model was initialized from scratch with fresh parameters or loaded from a checkpoint file. I also like to include the git branch and hash to help with reproducibility
Full control of string formatting: using Python's built-in logger module, you can take advantage of f-strings to nicely align variable length output in a monospace font. In my example, the | separators stay in vertical columns despite the number of digits in the batch number varying, and large numbers are dynamically formatted with commas
Log a subset of metrics: I'd like to log the detailed metrics for validation loss and accuracy to Tensorboard so the events file could be copied to a GUI machine and reviewed in full. However, I don't need that extra information in my text logs, just a summary at the end of the epoch as I showed is fine. It seems like this can't be accomplished using self.log() inside my pl.LightningModule since if the trainer has multiple loggers configured, they must all emit/save the same metrics at the same frequency

I followed the docs for creating a custom logger that uses Python's built-in logging framework. It works alright, but has all the same limitations as the CSVLogger. If it helps, I can share the complete code for my logging implementation.

Has anyone found a solution that can address the above shortcomings? Should this be a feature request?

justusschock · 2021-02-24T18:49:11Z

justusschock
Feb 24, 2021
Maintainer

Hi @addisonklinke

You can still use Python's built-in logging for every kind of text logging within your module. We don't log anything to console except for the progressbar (well we do, but not the metric related stuff, more like warnings).

2 replies

addisonklinke Feb 24, 2021
Author

Yes, that's essentially what I did with my custom logger - get Lightning's logger and make some config modidifications

class ConsoleLogger(LightningLoggerBase):

    def __init__(
            self,
            level=logging.DEBUG,
            fmt='[%(levelname)8s] %(message)s',
            stdout=True,
            file=None,
            log_size_mb=100,
            num_log_archives=5):
        super(ConsoleLogger, self).__init__()

        # Get the Lightning logger and add handlers/formatter
        if not stdout and file is None:
            raise ValueError('ConsoleLogger will have no handlers if stdout=False and file=None')
        self.logger = logging.getLogger('lightning')
        self.logger.setLevel(level)
        formatter = logging.Formatter(fmt)
        if stdout:
            console_handler = logging.StreamHandler(sys.stdout)
            console_handler.setFormatter(formatter)
            self.logger.addHandler(console_handler)
        if file is not None:
            file_handler = RotatingFileHandler(
                file, 
                maxBytes=log_size_mb * 1024 * 1024, 
                backupCount=num_log_archives)
            file_handler.setFormatter(formatter)
            self.logger.addHandler(file_handler)
        self.logger.debug('Initialized ConsoleLogger')

... required abstract methods here

The shortcoming of using built-in logging directly inside my pl.LightningModule is that it duplicates a lot of the Tensorboard logging. I'm hoping for a more concise solution that takes advantage of the trainer's ability to dispatch to multiple loggers with a single call to self.log(). For instance, this was my previous approach before trying a custom logger

import logging
import sys
import pytorch_lightning as pl
import torch.nn.functional as F

logger = logging.getLogger('lightning')
logger.setLevel(logging.DEBUG)
formatter = logging.Formatter('[%(levelname)8s] %(message)s')
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)


class MyModel(pl.LightningModule):

    def training_step(self, batch, batch_idx):

        # Unpack batch and run forward prop
        ...

        # Calculate metrics
        loss = F.cross_entropy(logits, gt_labels)
        char_acc = accuracy(...)
        seq_acc = sequence_accuracy(...)

        # Tensorboard logging should go through the built-in ``.log()`` method
        # Metrics will also require more verbose names here to ensure they're unique from the validation loop
        self.log('epoch', self.current_epoch)
        self.log('train_batch', batch_idx)
        self.log('train_loss', loss)
        self.log('train_acc_char', char_acc)
        self.log('train_acc_seq', seq_acc)

        # Console logging a line of text has to happen separately
        # This seems to waste a lot of code since we've already passed many of the same metrics above
        msgs = [
            f'Epoch {self.current_epoch}/{self.trainer.max_epochs}',
            f'Batch {batch_idx}/{self.trainer.num_training_batches}',
            f'Loss {loss:.4f}',
            f'Char {char_acc:.2f}',
            f'Seq {seq_acc:.2f}']
        if batch_idx % self.trainer.log_every_n_steps == 0:
            logger.debug(' | '.join(msgs))
        return loss

One thought I had was using ConsoleLogger.save() to aggregate the metrics it would have stored from self.log() and join them into a single line delimited by the | character. This has the benefit of automatically being called at the frequency defined by trainer.log_every_n_steps, but unfortunately there are a few drawbacks

I can't access some of the totals available in my module (i.e. trainer.max_epochs and trainer.num_training_batches)
It's more cumbersome to apply string formatting
Any plain text logging would still have to be handled separately by a logger instance in the module since the connected logger can only accept key-value metric pairs

    @rank_zero_only
    def save(self) -> None:
        super().save()
        # TODO how to access the totals (i.e. trainer max_epochs and num_training_batches)
        # TODO duplicate output when trainer.log_every_n_steps=100, but not 10
        latest_values = []
        for k, vals in self.metrics.items():
            formatted = round(vals[-1], 4) if isinstance(vals[-1], float) else vals[-1]
            latest_values.append(f'{k} {formatted}')
        self.logger.debug(' | '.join(latest_values))

addisonklinke Feb 26, 2021
Author

@justusschock any thoughts on how to unify the Tensorboard and console logging through the .log() interface in my above snippets?

Maybe this kind of progress monitoring would be better implemented as a custom callback similar to Bolts' print table metrics, except with the hook being on_train_batch_end() instead of on_epoch_end()

addisonklinke · 2021-03-15T14:06:23Z

addisonklinke
Mar 15, 2021
Author

Revisiting this and summarizing what I would now consider best practice. While the custom callback mentioned above can work, I don't think it's a great solution since the formatting and metric names users will choose to log depends heavily on the type of model used. In my opinion, a better option is to define the hooks on the Lightning Module. This has a few benefits

The custom logging implementation follows your model around regardless of what trainer settings and callbacks are used
We get easy access to key trainer attributes (number of batches, max epochs, etc) that a custom logger class doesn't provide. Then we don't have to manually log these as keyword metrics
The formatting of console log messages still has to occur separately from Tensorboard and other loggers that use self.log(), but now the code is moved out of training_step() and into the appropriate hook method. This makes the author's intention clear to new readers that the console logging is non-essential and avoids clutter in the train step
You can use Lightning to automatically aggregate epoch-level metrics

def training_step(self, batch, batch_idx):

    # Unpack batch and run forward prop
    ...

    # Calculate metrics
    outputs = {
        'loss': F.cross_entropy(logits, gt_labels),
        'char_acc': accuracy(...),
        'seq_acc': sequence_accuracy(...)}

    # Log values per-step (with train prefix) and per-epoch (with avg prefix)
    for k, v in outputs.items():
        self.log(f'train_{k}', v)
        self.log(f'avg_train_{k}', v, on_step=False, on_epoch=True)
    return outputs

def on_train_batch_end(self, outputs, batch, batch_idx, dataloader_idx):
    metrics = self.trainer.callback_metrics
    logger.info(f'Now we can access {metrics} as well as {self.trainer.num_training_batches}')

def on_validation_epoch_end(self):
    metrics = self.trainer.callback_metrics
    logger.info(f'The averages are pre-computed for us, like {metrics["avg_train_loss"]}')

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Log subset of metrics to console as text #6182

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Log subset of metrics to console as text #6182

Uh oh!

addisonklinke Feb 24, 2021

Replies: 2 comments · 2 replies

Uh oh!

justusschock Feb 24, 2021 Maintainer

Uh oh!

addisonklinke Feb 24, 2021 Author

Uh oh!

Uh oh!

addisonklinke Feb 26, 2021 Author

Uh oh!

addisonklinke Mar 15, 2021 Author

addisonklinke
Feb 24, 2021

Replies: 2 comments 2 replies

justusschock
Feb 24, 2021
Maintainer

addisonklinke Feb 24, 2021
Author

addisonklinke Feb 26, 2021
Author

addisonklinke
Mar 15, 2021
Author