To streamline the installation process on GPU machines, we have published the reference Dockerfile so
you can get started with Horovod in minutes. The container includes Examples in the /examples
directory.
Pre-built Docker containers with Horovod are available on DockerHub.
Before building, you can modify Dockerfile.gpu to your liking, e.g. select a different CUDA, TensorFlow or Python version.
$ mkdir horovod-docker-gpu
$ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
$ docker build -t horovod:latest horovod-docker-gpuFor users without GPUs available in their environments, we've also published a CPU Dockerfile you can build and run similarly.
After the container is built, run it using nvidia-docker.
Note: You can replace horovod:latest with the specific pre-build
Docker container with Horovod instead of building it by yourself.
$ nvidia-docker run -it horovod:latest
root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.pyIf you don't run your container in privileged mode, you may see the following message:
[a8c9914754d2:00040] Read -1, expected 131072, errno = 1You can ignore this message.
Here we describe a simple example involving a shared filesystem /mnt/share using a common port number 12345 for the SSH
daemon that will be run on all the containers. /mnt/share/ssh would contain a typical id_rsa and authorized_keys
pair that allows passwordless authentication.
Note: These are not hard requirements but they make the example more concise. A shared filesystem can be replaced by rsyncing
SSH configuration and code across machines, and a common SSH port can be replaced by machine-specific ports
defined in /root/.ssh/ssh_config file.
Primary worker:
host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.pySecondary workers:
host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"If you have Mellanox NICs, we recommend that you mount your Mellanox devices (/dev/infiniband) in the container
and enable the IPC_LOCK capability for memory registration:
$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest
root@c278c88dd552:/examples# ...You need to specify these additional configuration options on primary and secondary workers.