where to add preprocessing initialization #7307

kingjr · 2021-05-01T20:11:32Z

kingjr
May 1, 2021

I would like to have a step called before the first training step, and that yet necessitates the dataloader

e.g. (mock code)

class Scaler(nn.Module):
    '''center target data'''
     def __init__(self, dims):
         self.mean = nn.Parameter(torch.tensor(dims))
         self.n = nn.Parameters(torch.zeros(1))

     def forward(self, batch):
          input, target = batch
          if self.training:
              self.mean += target.mean(0)
              self.n += 1
          else:
              return input, (target - self.mean)/self.n

class MySystem(pl.LightningModule):
    def __init__(self, scaler_dims, model_dims):
        self.model = nn.Linear(**model_dims)
        self.scaler = Scaler(self.dims).train()

    def on_first_epoch(self, dataloader):  # <---- not sure where this should live
         # learn to scale the dataset
         for batch in dataloader:
               self.scaler(batch)

    def training_step(self, batch, batch_idx):
         self.scaler.eval()
         input, target = self.scaler(batch)
         pred = self.model(input)
         loss = F.l1_loss(pred, target)
         return loss
      

dm = MyDataModule()
system = MySystem()
trainer = pl.Trainer()
trainer.fit(system, dm)

I'm not clear on how to do this with PL's API: nn.LithningModule.setup() does not have access to the dataloader.

Any advice?

Thanks!

Answered by kingjr

May 1, 2021

thanks @ananthsub , I think the point of lightning is to try to keep everything in the same system.

going through the doc, I think the best is either

move the pl.DataLightningModule to the pl.LightningModule and setup such scaler with self._prepare_data (which is called once in distributed, as opposed to self.setup)
using a callback `on_init_start': https://pytorch-lightning.readthedocs.io/en/latest/extensions/callbacks.html

View full answer

ananthsub · 2021-05-01T20:15:31Z

ananthsub
May 1, 2021

In this instance would it be simpler to iterate through the dataset outside of Lightning, prior to starting training?

0 replies

kingjr · 2021-05-01T22:00:16Z

kingjr
May 1, 2021
Author

thanks @ananthsub , I think the point of lightning is to try to keep everything in the same system.

going through the doc, I think the best is either

move the pl.DataLightningModule to the pl.LightningModule and setup such scaler with self._prepare_data (which is called once in distributed, as opposed to self.setup)
using a callback `on_init_start': https://pytorch-lightning.readthedocs.io/en/latest/extensions/callbacks.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

where to add preprocessing initialization #7307

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

where to add preprocessing initialization #7307

Uh oh!

kingjr May 1, 2021

Replies: 2 comments

Uh oh!

ananthsub May 1, 2021

Uh oh!

kingjr May 1, 2021 Author

kingjr
May 1, 2021

ananthsub
May 1, 2021

kingjr
May 1, 2021
Author