Difference between JAX and PyTorch Training #13581

ehonig · 2022-12-09T08:32:53Z

ehonig
Dec 9, 2022

Hi,

I have had a problem for some time now, similar to #6325.

Using seemingly identical setups, including exactly copied model initialization weights and identical stochasticity, Jax/Flax has stable training for this toy Variational Diffusion Model (VDM) example, while PyTorch training consistently fails (tested on 50+ PyTorch seeds).

Below are the two implementations:
Jax/Flax
PyTorch

In particular, it can be seen from the training loop cell outputs that the KL-divergence loss term (bpd_latent) goes to 0 in PyTorch training, despite its computation being identical to that of the Jax/Flax version.

I believe I have accounted for the differences between optax and torch.optim regarding AdamW, and have been considering this problem for a while.

Any help or comments are greatly appreciated; thanks!

Answered by ehonig

Dec 23, 2022

Solved: gradients weren't being propagated through the diffusion loss term properly.

To replace jax.jvp in PyTorch: one should use functorch.jvp to properly propagate gradients, or use torch.autograd.functional.jvp with create_graph=True.

View full answer

ehonig · 2022-12-23T17:53:28Z

ehonig
Dec 23, 2022
Author

Solved: gradients weren't being propagated through the diffusion loss term properly.

To replace jax.jvp in PyTorch: one should use functorch.jvp to properly propagate gradients, or use torch.autograd.functional.jvp with create_graph=True.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Difference between JAX and PyTorch Training #13581

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Difference between JAX and PyTorch Training #13581

Uh oh!

ehonig Dec 9, 2022

Replies: 1 comment

Uh oh!

ehonig Dec 23, 2022 Author

ehonig
Dec 9, 2022

ehonig
Dec 23, 2022
Author