load_from_checkpoint giving different validation results #6678

keitht226 · 2021-03-25T15:20:06Z

keitht226
Mar 25, 2021

I'm creating a classifier that first trains a VAE then passes it into a convolutional network. The psudo code below kind of describes it:

class VAE(pl.LightningModule):
# ...

class ConvNetwork(pl.LightningModule):
    def __init__(self, vae):
        # Trying both ways: pass in entire model vs loading checkpoint
        # self.vae = vae
        # self.vae = VAE.load_from_checkpoint(vae)
        freeze_training(self.vae) # sets all params to requries_grad=False

        self.sub_network = nn.Sequential(
            # Mix of convolutional layers, ReLU activations, and Batch Normalization
        )

    def forward(self, data):
         vae_decoded_results = self.vae(data)
         results_that_differ_wildly = self.sub_network(vae_decoded_results)

If I train the VAE and pass in the entire model before training the convolutional network, I get good training/validation results. What I would prefer, however, is to train the VAE in a separate script, save off checkpoints, then pass the path of the checkpoint into the convolutional network. Then in the convolutional network's init I load the vae network, freeze training on it, and proceed to train the convolutional network. When I do this, my training results seem okay, but my validation results are all over the place. Some things I've checked:

After loading the VAE from a checkpoint, I verified the model parameters perfectly match the VAE that produced the checkpoint.
In my forward function for the convolutional network I call the VAE's forward function. The results at this step differ by less than 1% between loading a checkpoint and passing in an entire model.
After passing the VAE forward() results into the first stage of my Convolution network (consists of some convolution layers, ReLU activations, and batch normalization) I get very different results.

I can't for the life of me figure out why using the results from a loaded model would so wildly differ from the results of a model I train and pass in all in one script, especially when the parameters and vae output appear to match. I'm sure I'm just missing something stupid.

Answered by kielnino

Mar 27, 2021

Just a wild guess, but maybe the model is in train-mode after loading from a checkpoint. Have you tried model.eval() in addition to setting the requires_grad? I'm thinking about BN layers and so on, where this is important (see here).

View full answer

kielnino · 2021-03-27T14:19:15Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

load_from_checkpoint giving different validation results #6678

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

load_from_checkpoint giving different validation results #6678

Uh oh!

keitht226 Mar 25, 2021

Replies: 1 comment · 7 replies

Uh oh!

Uh oh!

kielnino Mar 27, 2021

Uh oh!

tomerperetz Jan 5, 2022

Uh oh!

Chrixtar Mar 2, 2022

Uh oh!

tomerperetz Mar 2, 2022

Uh oh!

lanlanlan3 Jul 4, 2022

Uh oh!

ruiyan1995 Sep 4, 2022

keitht226
Mar 25, 2021

Replies: 1 comment 7 replies

kielnino
Mar 27, 2021