how to run horovod strategy? #13663

JiahaoYao · 2022-07-14T21:28:03Z

JiahaoYao
Jul 14, 2022

What I got is:
I run the horovod for the multi-gpus using the following command for the pytorch lightning 1.6.5.

python pl_examples/basic_examples/mnist_examples/image_classifier_5_lightning_datamodule.py --trainer.accelerator 'gpu' --trainer.devices 4 --trainer.strategy 'horovod'

the output is

Thu Jul 14 14:26:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   35C    P0    34W /  70W |   1547MiB / 15360MiB |     37%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   31C    P8    16W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   31C    P8    16W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P8    16W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     39649      C                                    1545MiB |
+-----------------------------------------------------------------------------+

the command line output is

(base) ray@ip-172-31-36-78:~/horovod-gpu/lightning$ python pl_examples/basic_examples/mnist_examples/image_classifier_5_lightning_datamodule.py --trainer.accelerator 'gpu' --trainer.devices 4 --trainer.strategy 'horovod'


                    ####
                ###########
             ####################
         ############################
    #####################################
##############################################
#########################  ###################
#######################    ###################
####################      ####################
##################       #####################
################        ######################
#####################        #################
######################     ###################
#####################    #####################
####################   #######################
###################  #########################
##############################################
    #####################################
         ############################
             ####################
                  ##########
                     ####

Global seed set to 42
/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/loops/utilities.py:92: PossibleUserWarning: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
  rank_zero_warn(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/ray/anaconda3/lib/python3.8/site-packages/torchvision/datasets/mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name     | Type     | Params
--------------------------------------
0 | model    | Net      | 1.2 M 
1 | test_acc | Accuracy | 0     
--------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.800     Total estimated model params size (MB)
/home/ray/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:240: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:   0%|

What should be:

It should be multiple gpus. How should I run?

Answered by treblenalto

Jul 15, 2022

Try using gpus to specify number of devices to train on --trainer.gpus=4

View full answer

treblenalto · 2022-07-15T00:14:55Z

treblenalto
Jul 15, 2022

Try using gpus to specify number of devices to train on --trainer.gpus=4

12 replies

JiahaoYao Jul 15, 2022
Author

this works but the memory has some memory issue.

JiahaoYao Jul 15, 2022
Author

let me create a issue for this!

JiahaoYao Jul 15, 2022
Author

could u please take a look @taehee-k

JiahaoYao Jul 15, 2022
Author

thanks @taehee-k and let me close this discussion and redirection to that issue

#13665

treblenalto Jul 15, 2022

yup:) sorry I couldn't be of more help
hope the issue gets resolved soon!

JiahaoYao · 2022-07-15T00:44:04Z

JiahaoYao
Jul 15, 2022
Author

@taehee-k

Thu Jul 14 17:43:52 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   34C    P0    34W /  70W |   1547MiB / 15360MiB |     38%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   32C    P8    16W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   32C    P8    16W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   31C    P8    16W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     20520      C                                    1545MiB |
+-----------------------------------------------------------------------------+

0 replies

JiahaoYao · 2022-07-15T00:44:36Z

JiahaoYao
Jul 15, 2022
Author

still see this

(base) ray@ip-172-31-93-242:~/horovod-gpu/lightning$ python pl_examples/basic_examples/mnist_examples/image_classifier_5_lightning_datamodule.py --trainer.accelerator 'gpu' --trainer.gpus 4 --trainer.strategy 'horovod'


                    ####
                ###########
             ####################
         ############################
    #####################################
##############################################
#########################  ###################
#######################    ###################
####################      ####################
##################       #####################
################        ######################
#####################        #################
######################     ###################
#####################    #####################
####################   #######################
###################  #########################
##############################################
    #####################################
         ############################
             ####################
                  ##########
                     ####

Global seed set to 42
/home/ray/horovod-gpu/lightning/pytorch_lightning/loops/utilities.py:92: PossibleUserWarning: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
  rank_zero_warn(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw/train-images-idx3-ubyte.gz
9913344it [00:00, 23040462.18it/s]                                                                                                                                                                                           
Extracting /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw/train-images-idx3-ubyte.gz to /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw/train-labels-idx1-ubyte.gz
29696it [00:00, 119648464.54it/s]                                                                                                                                                                                            
Extracting /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw/train-labels-idx1-ubyte.gz to /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw/t10k-images-idx3-ubyte.gz
1649664it [00:00, 10574307.41it/s]                                                                                                                                                                                           
Extracting /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz
5120it [00:00, 48475928.85it/s]                                                                                                                                                                                              
Extracting /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/ray/horovod-gpu/lightning/Datasets/MNIST/raw

/home/ray/anaconda3/lib/python3.8/site-packages/torchvision/datasets/mnist.py:498: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
9913344it [00:00, 35476547.86it/s]                                                                                                                                                                                           
Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz
29696it [00:00, 118510039.57it/s]                                                                                                                                                                                            
Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz
1649664it [00:00, 10670010.40it/s]                                                                                                                                                                                           
Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz
5120it [00:00, 49941480.19it/s]                                                                                                                                                                                              
Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw

Missing logger folder: /home/ray/horovod-gpu/lightning/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name     | Type     | Params
--------------------------------------
0 | model    | Net      | 1.2 M 
1 | test_acc | Accuracy | 0     
--------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.800     Total estimated model params size (MB)
/home/ray/horovod-gpu/lightning/pytorch_lightning/trainer/connectors/data_connector.py:240: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:   0%|                                                                                                                                                                                      | 0/1875 [00:00<?, ?it/s]/home/ray/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Epoch 2:  66%|██████████████████████████████████████████████████████████████████████████████████████████████████▊                                                  | 1243/1875 [00:11<00:05, 107.94it/s, loss=0.103, v_num=0]^C/home/ray/horovod-gpu/lightning/pytorch_lightning/trainer/trainer.py:726: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")
Restoring states from the checkpoint path at /home/ray/horovod-gpu/lightning/lightning_logs/version_0/checkpoints/epoch=1-step=3750.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Loaded model weights from checkpoint at /home/ray/horovod-gpu/lightning/lightning_logs/version_0/checkpoints/epoch=1-step=3750.ckpt
/home/ray/horovod-gpu/lightning/pytorch_lightning/trainer/connectors/data_connector.py:330: PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`, it is recommended to use `Trainer(devices=1)` to ensure each sample/batch gets evaluated exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some samples to make sure all devices have same batch size in case of uneven inputs.
  rank_zero_warn(
/home/ray/horovod-gpu/lightning/pytorch_lightning/trainer/connectors/data_connector.py:240: PossibleUserWarning: The dataloader, test_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Testing DataLoader 0:   0%|                                                                                                                                                                          | 0/313 [00:00<?, ?it/s]/home/ray/anaconda3/lib/python3.8/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: Torchmetrics v0.9 introduced a new argument class property called `full_state_update` that has
                not been set for this class (_ResultMetric). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_no_full_state`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.
                
  warnings.warn(*args, **kwargs)
Testing DataLoader 0:   5%|████████▋                                                                                                                                                       | 17/313 [00:00<00:02, 143.89it/s]^C^C

0 replies

JiahaoYao · 2022-07-15T00:45:15Z

JiahaoYao
Jul 15, 2022
Author

i am using pytorch lightning 1.6.5

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to run horovod strategy? #13663

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 12 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

how to run horovod strategy? #13663

Uh oh!

JiahaoYao Jul 14, 2022

Replies: 4 comments · 12 replies

Uh oh!

treblenalto Jul 15, 2022

Uh oh!

JiahaoYao Jul 15, 2022 Author

Uh oh!

JiahaoYao Jul 15, 2022 Author

Uh oh!

JiahaoYao Jul 15, 2022 Author

Uh oh!

JiahaoYao Jul 15, 2022 Author

Uh oh!

treblenalto Jul 15, 2022

Uh oh!

JiahaoYao Jul 15, 2022 Author

Uh oh!

JiahaoYao Jul 15, 2022 Author

Uh oh!

JiahaoYao Jul 15, 2022 Author

JiahaoYao
Jul 14, 2022

Replies: 4 comments 12 replies

treblenalto
Jul 15, 2022

JiahaoYao Jul 15, 2022
Author

JiahaoYao Jul 15, 2022
Author

JiahaoYao Jul 15, 2022
Author

JiahaoYao Jul 15, 2022
Author

JiahaoYao
Jul 15, 2022
Author

JiahaoYao
Jul 15, 2022
Author

JiahaoYao
Jul 15, 2022
Author