Skip to content

[BUG] Notebook example multi gpu parallel training using horovod fails #1114

@vs385

Description

@vs385

Bug description

Trying to run the notebook example and I keep getting the below error.

Steps/Code to reproduce bug

  1. Running notebook in Databricks Runtime 13.0 ML GPU on a g5.12xlarge instance type (2 workers)
  2. pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
    pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
    pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
    pip install merlin-models nvtabular transformers4rec[pytorch,nvtabular,dataloader]==23.2.0 protobuf==3.20.*

Then launched a cluster

  1. Tried running the notebook example
  2. p.s. I've been trying to run horovod on my own models aside from the example and get the exact same error with the data loader (I printed out the str(MPI_RANK) making sure the correct parquet partitions are being loaded:
    [1,0]:MPI_RANK is : 0
    [1,3]:MPI_RANK is : 3
    [1,2]:MPI_RANK is : 2
    [1,1]:MPI_RANK is : 1
    .......
    [1,1]:/dbfs/Workspace/.../tmp/data/processed_nvt/full_dataset_positive_events_train/part_1.parquet
    [1,0]:/dbfs/Workspace/.../tmp/data/processed_nvt/full_dataset_positive_events_train/part_0.parquet
    [1,2]:/dbfs/Workspace/.../tmp//data/processed_nvt/full_dataset_positive_events_train/part_2.parquet

p.s. I have also tried running this without horovod and the model gets trained fine -> seems issue I'm getting is with the data loader when creating the train_loader and valid_loader objects

nvidia-smi
-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1B.0 Off | 0 |
| 0% 25C P8 19W / 300W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G Off | 00000000:00:1C.0 Off | 0 |
| 0% 25C P8 16W / 300W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G Off | 00000000:00:1D.0 Off | 0 |
| 0% 24C P8 15W / 300W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 25C P8 16W / 300W | 0MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found

Expected behavior

Notebook should have run successfully

Environment details

  • Merlin version: 23.4.0
  • Platform: Databricks
  • Python version: 3.10.6
  • PyTorch version (GPU?): NA
  • Tensorflow version (GPU?): 2.11.0

Additional context

Error below PFA

[1,1]: File "/Workspace/Repos/<user_name>/merlin-models/examples/usecases/tf_trainer.py", line 59, in
[1,0]: print("Number batches: " + str(len(train_loader)))
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in len
[1,0]: return LoaderBase.len(self)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in len
[1,0]: batches = _num_steps(self._buff_len, self.batch_size)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len
[1,0]: self.__buff_len = len(self._buff)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff
[1,0]: self.__buff = ChunkQueue(
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in init
[1,0]: self.itr = dataloader._data_iter(epochs)
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter
[1,1]: print("Number batches: " + str(len(train_loader)))
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in len
[1,0]: indices = self._indices_for_process()
[1,0]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process
[1,0]: raise IndexError
[1,0]:IndexError
[1,1]: return LoaderBase.len(self)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in len
[1,1]: batches = _num_steps(self._buff_len, self.batch_size)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len
[1,1]: self.__buff_len = len(self._buff)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff
[1,1]: self.__buff = ChunkQueue(
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in init
[1,1]: self.itr = dataloader._data_iter(epochs)
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter
[1,1]: indices = self._indices_for_process()
[1,1]: File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process
[1,1]: raise IndexError
[1,1]:IndexError

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[32806,1],0]
Exit code: 1

Metadata

Metadata

Assignees

Labels

P0bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions