v0.4.0
Sequential differentiation improvements
This version provides some improvements to how backward and mtl_backward differentiate when parallel_chunk_size is such that not all tensors can be differentiated in parallel at once (for instance if parallel_chunk_size=2 but you have 3 losses).
In particular, when a single tensor has to be differentiated (e.g. when using parellel_chunk_size=1), we now avoid relying on torch.vmap, which has several issues.
The parameter retain_graph of backward and mtl_backward has also been changed to be only used during the last differentiation. In most cases, you can now simply use the default retain_graph=False (prior to this change, you had to use retain_graph=True if the differentiations were not all made in parallel at once). This should provide some improvements in terms of memory overhead.
Lastly, this update enables the usage of torchjd for training recurrent neural networks. As @lth456321 discovered, there can be an incompatibility between torch.vmap and torch.nn.RNN when running on CUDA. With this update, you can now simply set the parellel_chunk_size to 1 to avoid using torch.vmap and fix the problem. A usage example for RNNs has therefore been added to the documentation.
Changelog
Changed
-
Changed how the Jacobians are computed when calling
backwardormtl_backwardwith
parallel_chunk_size=1to not rely ontorch.autograd.vmapin this case. Whenevervmapdoes
not support something (compiled functions, RNN on cuda, etc.), users should now be able to avoid
usingvmapby callingbackwardormtl_backwardwithparallel_chunk_size=1. -
Changed the effect of the parameter
retain_graphofbackwardandmtl_backward. When set to
False, it now frees the graph only after all gradients have been computed. In most cases, users
should now leave the default valueretain_graph=False, no matter what the value of
parallel_chunk_sizeis. This will reduce the memory overhead.
Added
- RNN training usage example in the documentation.