Skip to content

[BUG] - TPU training not working in Google Colab #2670

@jbelhamc1

Description

@jbelhamc1

Describe the bug
I am aiming to use TPUs to train on Google Colab.
After update to Google Colab and pytorch xla wheels, the following code no longer runs as I believe TPUs don't run in nodes, they are intrinsically linked to the VMs they are running on. :

To Reproduce

!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchtext==0.10.0 -f https://download.pytorch.org/whl/cu111/torch_stable.html
!pip install pyyaml==5.4.1

Expected behavior
Could alternative code be provided and the documentation updated? I am currently trying to use:

!pip install torch~=2.6.0 'torch_xla[tpu]~=2.6.0' \
  -f https://storage.googleapis.com/libtpu-releases/index.html \
  -f https://storage.googleapis.com/libtpu-wheels/index.html

!pip install cloud-tpu-client==0.10
    
# Install the latest PyTorch packages (using CUDA 12.6 builds)
!pip install torch==2.6.0 torchvision==0.21.0 torchtext==0.18 \
    -f https://download.pytorch.org/whl/cu118/torch_stable.html

# Install the latest version of PyYAML
!pip install pyyaml==6.0

Whilst this then shows the TPU as available, training is incredibly slow at around 0.18 it/second (when it works) and when running the code here , it never gets past cell [11] as it gets stuck here for unknown reasons and no progress bar even shows up.

System (please complete the following information):

  • Python version: [e.g. 3.11.11]
  • darts version [e.g. 0.32.0]

Additional context
The code in "To Reproduce" is taken from here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggpuQuestion or bug occuring with gpu

    Type

    No type

    Projects

    Status

    To do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions