To enable features such as elastic training.
Looks like a very useful framework for distributed deep learning and currently has integrations with Slurm, K8s, Ray etc.
It would be interesting to investigate the possibility of an Armada integration and know how much work this might be.
https://pytorch.org/torchx/latest/
┆Issue is synchronized with this Jira Task by Unito