Skip to content

Move horovod and sok init functions to user code#1135

Closed
edknv wants to merge 3 commits intoNVIDIA-Merlin:mainfrom
edknv:horovod/init
Closed

Move horovod and sok init functions to user code#1135
edknv wants to merge 3 commits intoNVIDIA-Merlin:mainfrom
edknv:horovod/init

Conversation

@edknv
Copy link
Contributor

@edknv edknv commented Jun 6, 2023

Goals ⚽

  • Remove hvd.init() and sok.init() and have the users run them.
  • Introduce singlegpu and multigpu markers.

Implementation Details 🚧

In some cases when horovod is not built correctly or not configured correctly with MPI, hvd.init() or sok.init() will throw an MPI_LIB errors on the C-level, which we cannot catch from Python. This happened in #1134 for example.

Following @oliverholworthy's proposal, this PR leaves hvd.init() and/or sok.init() up to the user.

We still need to initialize them to run unit tests but incorrect installation in the ci-runner will still block the CI, so this PR also introduces singlegpu and multigpu pytest markers to skip the horovod tests in the single-gpu case, similarly to NVIDIA-Merlin/Merlin#999.

@github-actions
Copy link

github-actions bot commented Jun 6, 2023

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-1135

@edknv edknv changed the title Remove horovod and sok init functions Move horovod and sok init functions to user code Jun 6, 2023
@edknv edknv added the ci label Jun 6, 2023
@edknv
Copy link
Contributor Author

edknv commented Jun 7, 2023

Ci is unblocked for now in #1136. Will close this PR and revisit if necessary.

@edknv edknv closed this Jun 7, 2023
@edknv edknv deleted the horovod/init branch June 7, 2023 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant