Skip to content

Conversation

daidahao
Copy link
Contributor

Checklist before merging this PR:

  • Mentioned all issues that this PR fixes or addresses.
  • Summarized the updates of this PR under Summary.
  • Added an entry under Unreleased in the Changelog.

Fixes #2670.

Summary

Some versions of pytorch-lightning may not be comptaible with torch_vla>=2.7.0 anymore due to deprecation of XLA APIs since 2.7. This was fixed in Lightning 2.5.3. See Lightning's #20852 for the upstream bug fix.

However, running the "21-TSMixer-examples", the notebook never got past "Model Training". Using the latest nightly build XLA 2.9.0 fixes the issue, as instructed on pytorch/xla. The exact reasons why that is are unknown.

Other Information

Fixes unit8co#2670.

Some versions of `pytorch-lightning` may not be comptaible with
`torch_vla>=2.7.0` anymore due to deprecation of XLA APIs since 2.7.
This was fixed in [Lightning
2.5.3](https://github.com/Lightning-AI/pytorch-lightning/releases/tag/2.5.3).
See Lightning's
[#20852](Lightning-AI/pytorch-lightning#20852)
for the upstream bug fix.

However, running the "21-TSMixer-examples", the notebook never got past
"Model Training". Using the latest nightly build XLA 2.9.0 fixes the
issue, as instructed on [pytorch/xla](https://github.com/pytorch/xla).
The exact reasons why that is are unknown.
Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigation and the the suggested solution @daidahao 🚀 Very nice to see!

As you mentioned, we currently fix pytorch lightning <= 2.5.3 because there was an issue with the lr_find() method that didn't give the same learning rate suggestion as in earlier versions.
I have not yet had the time to raise this issue on their GitHub but will do so soon.

With your update, users wouldn't be able to install the current darts package because pypi would complain about incompatible versions. Or did this work for you / could you test it?

Regardless, as soon as we would relax the lightning cap, and if your suggestion is ested then we can merge this PR :)

@daidahao
Copy link
Contributor Author

Thanks for investigation and the the suggested solution @daidahao 🚀 Very nice to see!

As you mentioned, we currently fix pytorch lightning <= 2.5.3 because there was an issue with the lr_find() method that didn't give the same learning rate suggestion as in earlier versions. I have not yet had the time to raise this issue on their GitHub but will do so soon.

With your update, users wouldn't be able to install the current darts package because pypi would complain about incompatible versions. Or did this work for you / could you test it?

Regardless, as soon as we would relax the lightning cap, and if your suggestion is ested then we can merge this PR :)

Hi Dennis, I have tested installing the latest darts and "pytorch-lightning>=2.5.3" at the same time on Colab. As you could see in this example 21-TSMixer-examples.ipynb, the requirement from darts 0.37.1 could be overwritten and the pytorch_lightning-2.5.3 would be installed.

This is not a perfect solution because of Lightning's new lr_find() behaviour, neither is installing a nightly build of XLA. But I have tried installing older versions of XLA and they always caused errors with Lightning calling older XLA APIs like get_ordinal() or got stuck in model training.

@dennisbader
Copy link
Collaborator

Thanks @daidahao . In that case I would suggest that we wait first until they released a stable version and then we can continue with this PR. Is that alright with you?

@daidahao
Copy link
Contributor Author

Thanks @daidahao . In that case I would suggest that we wait first until they released a stable version and then we can continue with this PR. Is that alright with you?

Yes, that is very reasonable! In the meantime, we could point to this PR as a temporary fix if there is any issue with TPU being raised.

@dennisbader
Copy link
Collaborator

@daidahao, yes sounds great :) In the meantime I should find some time to raise the lr_find issue to lightning.

@dennisbader
Copy link
Collaborator

@daidahao I have opened the issue now. I'll keep an eye on it :)

@dennisbader
Copy link
Collaborator

@daidahao, looks like Lightning-AI/pytorch-lightning#21171 will fix the issue :) I'll keep an eye on it and once everything is in place we can go ahead with this PR

@daidahao
Copy link
Contributor Author

@dennisbader Thanks Dennis!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] - TPU training not working in Google Colab

2 participants