[BUG] Notebook example multi gpu parallel training using horovod fails

### Bug description
Trying to run the [notebook example ](https://github.com/NVIDIA-Merlin/models/blob/ed7caee08a7f015c49cad036ec6af4ac09763050/examples/usecases/multi-gpu-data-parallel-training.ipynb#L258) and I keep getting the below error.

### Steps/Code to reproduce bug

1. Running notebook in Databricks Runtime 13.0 ML GPU on a g5.12xlarge instance type (2 workers)
2. pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.nvidia.com
pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com
pip install cugraph-cu11 --extra-index-url=https://pypi.nvidia.com
pip install merlin-models nvtabular transformers4rec[pytorch,nvtabular,dataloader]==23.2.0 protobuf==3.20.*

Then launched a cluster

3. Tried running the notebook example
4. p.s. I've been trying to run horovod on my own models aside from the example and get the exact same error with the data loader (I printed out the str(MPI_RANK) making sure the correct parquet partitions are being loaded:
[1,0]<stdout>:MPI_RANK is : 0
[1,3]<stdout>:MPI_RANK is : 3
[1,2]<stdout>:MPI_RANK is : 2
[1,1]<stdout>:MPI_RANK is : 1
.......
[1,1]<stdout>:/dbfs/Workspace/.../tmp/data/processed_nvt/full_dataset_positive_events_train/part_1.parquet
[1,0]<stdout>:/dbfs/Workspace/.../tmp/data/processed_nvt/full_dataset_positive_events_train/part_0.parquet
[1,2]<stdout>:/dbfs/Workspace/.../tmp//data/processed_nvt/full_dataset_positive_events_train/part_2.parquet

p.s. I have also tried running this without horovod and the model gets trained fine -> seems issue I'm getting is with the data loader when creating the train_loader and valid_loader objects

nvidia-smi
-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         Off  | 00000000:00:1B.0 Off |                    0 |
|  0%   25C    P8    19W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         Off  | 00000000:00:1C.0 Off |                    0 |
|  0%   25C    P8    16W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         Off  | 00000000:00:1D.0 Off |                    0 |
|  0%   24C    P8    15W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         Off  | 00000000:00:1E.0 Off |                    0 |
|  0%   25C    P8    16W / 300W |      0MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                            


### Expected behavior
Notebook should have run successfully

### Environment details
- Merlin version: 23.4.0
- Platform: Databricks
- Python version: 3.10.6
- PyTorch version (GPU?): NA
- Tensorflow version (GPU?): 2.11.0

### Additional context

Error below PFA

[1,1]<stderr>:  File "/Workspace/Repos/<user_name>/merlin-models/examples/usecases/tf_trainer.py", line 59, in <module>
[1,0]<stderr>:    print("Number batches: " + str(len(train_loader)))
[1,0]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in __len__
[1,0]<stderr>:    return LoaderBase.__len__(self)
[1,0]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in __len__
[1,0]<stderr>:    batches = _num_steps(self._buff_len, self.batch_size)
[1,0]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len
[1,0]<stderr>:    self.__buff_len = len(self._buff)
[1,0]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff
[1,0]<stderr>:    self.__buff = ChunkQueue(
[1,0]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in __init__
[1,0]<stderr>:    self.itr = dataloader._data_iter(epochs)
[1,0]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter
[1,1]<stderr>:    print("Number batches: " + str(len(train_loader)))
[1,1]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/tensorflow.py", line 83, in __len__
[1,0]<stderr>:    indices = self._indices_for_process()
[1,0]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process
[1,0]<stderr>:    raise IndexError
[1,0]<stderr>:IndexError
[1,1]<stderr>:    return LoaderBase.__len__(self)
[1,1]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 192, in __len__
[1,1]<stderr>:    batches = _num_steps(self._buff_len, self.batch_size)
[1,1]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 153, in _buff_len
[1,1]<stderr>:    self.__buff_len = len(self._buff)
[1,1]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 140, in _buff
[1,1]<stderr>:    self.__buff = ChunkQueue(
[1,1]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 709, in __init__
[1,1]<stderr>:    self.itr = dataloader._data_iter(epochs)
[1,1]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 268, in _data_iter
[1,1]<stderr>:    indices = self._indices_for_process()
[1,1]<stderr>:  File "/databricks/python/lib/python3.10/site-packages/merlin/dataloader/loader_base.py", line 224, in _indices_for_process
[1,1]<stderr>:    raise IndexError
[1,1]<stderr>:IndexError
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32806,1],0]
  Exit code:    1
--------------------------------------------------------------------------

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Notebook example multi gpu parallel training using horovod fails #1114

Bug description

Steps/Code to reproduce bug

Expected behavior

Environment details

Additional context

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[32806,1],0]
Exit code: 1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Notebook example multi gpu parallel training using horovod fails #1114

Description

Bug description

Steps/Code to reproduce bug

Expected behavior

Environment details

Additional context

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[32806,1],0] Exit code: 1

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[32806,1],0]
Exit code: 1