Why JAX shows different training pattern than Pytorch? #6325

hudsonchen · 2021-04-02T02:06:16Z

hudsonchen
Apr 2, 2021

Hi all,

I noticed that Jax used together with dm-haiku shows different training dynamics than PyTorch, when using the same architecture, optimizer, and hyperparameters, initialization scheme, seeds, dataloaders, etc. Specifically, Jax appears to show faster convergence than PyTorch and has (comparably) higher accuracy after 100 epochs. The difference seems fairly significant and seems systematic across datasets and training runs.

You can see the described behavior in the following two minimal implementations:
JAX
PyTorch

In the implementations above, the following factors were considered to ensure the comparison is accurate:

Both use a same custom CNN (a small network tailored to CIFAR-10 classification).
Both use Adam optimizer with learning rate 1e-3 (without learning rate decay), a mini-batch size of 200 and weight decay of 5e-4.
Both use the CIFAR-10 data loader in torchvision (with the same transformations).
Both use a self-implemented cross-entropy loss function (instead of nn.functional.cross_entropy in the case of PyTorch).
Both use an explicit weight decay term in the loss function (instead of using the weight decay argument in the PyTorch Adam optimizer).
The Jax weight and bias initialization is changed to he_uniform (which is what I understand is used in PyTorch).

I also took a closer look into the Jax and PyTorch repos and found that:

Both Jax and PyTorch use the same way to implement Adam.
Both Jax and PyTorch use float32 as the default data type.

After accounting for all of the above, I’m really at a loss what could be causing the fairly different training dynamics, especially since it makes it difficult to reproduce results for models originally implemented in PyTorch.

I would really appreciate if you could let me know if I missed something, and/or if you have an idea what could be causing the different training dynamics.

Thank you!

Answered by hudsonchen

Apr 3, 2021

Hi all!

The training dynamics is now similar, thanks to a very useful comment from @n2cholas

The details can be found in this notebook here
Final Notebook

Thank you all for your replies! Cheers!

View full answer

mattjj · 2021-04-02T02:36:41Z

mattjj
Apr 2, 2021
Maintainer

Thanks for the question! Unfortunately I don't have any hypotheses to offer.

You could try generating the exact same initializations, and on one minibatch try breaking things down layer-by-layer to find where the first numerical divergence occurs.

2 replies

hudsonchen Apr 2, 2021
Author

Thanks for the reply!

I will look into it more thoroughly in a layer-by-layer way.

tomhennigan Apr 2, 2021

+1 to what Matt suggests. I don't think there's a fundamental reason why one library would be able to train better than the other, most likely there is just some subtle difference between the implementations. I'd carefully check initialisation, parameter counts, batch size then the implementation of things such as your loss function etc.

8bitmp3 · 2021-04-02T17:08:57Z

8bitmp3
Apr 2, 2021

Specifically, Jax appears to show faster convergence than PyTorch and has (comparably) higher accuracy after 100 epochs. The difference seems fairly significant and seems systematic across datasets and training runs.

@tomhennigan @mtthss @inoryy
@cgarciae @n2cholas @srush @RobertTLange @tancik @KristianHolsheimer @rlouf

0 replies

n2cholas · 2021-04-02T17:48:44Z

n2cholas
Apr 2, 2021

+1 to what @mattjj suggested, starting from an identical initialisation is the way to go. I've been working on porting some torchvision checkpoints to Flax, and have some simple utilities to help check numerical equivalence of intermediate activations. Perhaps they'll be useful to you to debug your problem.

Also, I noticed your L2 regularisation is implemented slightly differently between the two implementations: in PyTorch you use the sum of squares, whereas in JAX you use optimizers.l2_norm which is the square root of the sum of squares. I'd swap out optimizers.l2_norm for sum(jnp.sum(p**2) for p in jax.tree_leaves(params)) so it's equivalent to the PyTorch version.

I'd also explicitly use a padding of [(1, 1), (1, 1)] instead of "SAME" to match Pytorch exactly (even if the output shapes are the same for [(1, 1), (1, 1)] and "SAME", the way it's padded can be different).

Good luck!

1 reply

hudsonchen Apr 2, 2021
Author

Thank you so much for the reply and spotting the bug in our code!

It is really helpful. Thank you so much!

hudsonchen · 2021-04-03T01:54:31Z

hudsonchen
Apr 3, 2021
Author

Hi all!

I have made some interesting discoveries based on kind suggestions from @mattjj @tomhennigan @n2cholas

I did the same initiailization for JAX and Pytorch in the original custom CNN model, and discovered that the values start to diverge at the first forward pass. Then I crack down my network to a single CNN and FC layer to trace down the divergence, and find out that the forward pass at initialization is exactly the same. Check the notebooks below for details.
Single CNN
MLP

However, interestingly, if CNN and FC layer are combined, then the forward pass values start to diverge. Check here
CNN and MLP combinied

It is so stange that CNN layer and FC layer shows the same pattern as Pytorch in seperate, but divergence occurs when they are combined. I wold appreciate if you could let me know if I missed something, and/or if you have an idea what could be causing this divergence.

PS: I am new to this area, and I feel so happy that my question is being valued and the replies are so nice and helpful. I really appreciate that! :)

3 replies

n2cholas Apr 3, 2021

I believe it's because PyTorch Conv outputs are NCHW while JAX uses NHWC (by default). So, when you flatten the intermediate activations, the elements are permuted. The outputs/activations match when you change the Haiku model's __call__ to the following:

def __call__(self, inputs):
    out = self.activation_fn(self.conv1(inputs))
    out = jnp.moveaxis(out, 3, 1)  # added this line, converts the NHWC array to NCHW
    out = out.reshape([-1, 32 * 32 * 8])
    out = self.activation_fn(self.fc_1(out))
    out = self.fc_2(out)
    return out

PS. Your detailed notebooks make it very easy to help 😊

hudsonchen Apr 3, 2021
Author

Thank you so much!

After adding the lines in the notebook, the forward pass shows the same values for JAX and Pytorch! Yay!

But I am still curious why that would lead to different training patterns. My intuition is that flattening should not make a big difference to the FC layers. I will keep digging.

Thanks again!

8bitmp3 Apr 3, 2021

the mysteries of the black box 🤔 😄

hudsonchen · 2021-04-03T04:17:24Z

hudsonchen
Apr 3, 2021
Author

Hi all!

The training dynamics is now similar, thanks to a very useful comment from @n2cholas

The details can be found in this notebook here
Final Notebook

Thank you all for your replies! Cheers!

1 reply

opooladz Mar 17, 2024

stale link can we get a re-upload?

Why JAX shows different training pattern than Pytorch? #6325

Uh oh!

hudsonchen Apr 2, 2021

Replies: 5 comments · 7 replies

Uh oh!

mattjj Apr 2, 2021 Maintainer

Uh oh!

hudsonchen Apr 2, 2021 Author

Uh oh!

tomhennigan Apr 2, 2021

Uh oh!

8bitmp3 Apr 2, 2021

Uh oh!

Uh oh!

n2cholas Apr 2, 2021

Uh oh!

hudsonchen Apr 2, 2021 Author

Uh oh!

Uh oh!

hudsonchen Apr 3, 2021 Author

Uh oh!

Uh oh!

n2cholas Apr 3, 2021

Uh oh!

hudsonchen Apr 3, 2021 Author

Uh oh!

8bitmp3 Apr 3, 2021

Uh oh!

hudsonchen Apr 3, 2021 Author

Uh oh!

opooladz Mar 17, 2024

hudsonchen
Apr 2, 2021

Replies: 5 comments 7 replies

mattjj
Apr 2, 2021
Maintainer

hudsonchen Apr 2, 2021
Author

8bitmp3
Apr 2, 2021

n2cholas
Apr 2, 2021

hudsonchen Apr 2, 2021
Author

hudsonchen
Apr 3, 2021
Author

hudsonchen Apr 3, 2021
Author

hudsonchen
Apr 3, 2021
Author