Name	Name	Last commit message	Last commit date
parent directory ..
DOCKER.md	DOCKER.md
Dockerfile.example.cntk-mpi	Dockerfile.example.cntk-mpi
Dockerfile.example.tensorflow-mpi	Dockerfile.example.tensorflow-mpi
README.md	README.md
cntk-mpi.json	cntk-mpi.json
cntk-mpi.sh	cntk-mpi.sh
prepare.sh	prepare.sh
tensorflow-mpi.json	tensorflow-mpi.json

Note

Now(27th September, 2018), the mpi examples are still unready. Ignore them!

MPI on OpenPAI

This guide introduces how to run Open MPI workload on OpenPAI. The following contents show some basic Open MPI examples, other customized MPI code can be run similarly.

Open MPI TensorFlow CIFAR-10 example
Open MPI CNTK grapheme-to-phoneme conversion example

Open MPI TensorFlow / CNTK CIFAR-10 example

Prepare work

Prepare the data:

TensorFlow: Just go to the official website and download the python version data by the url. wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar zxvf cifar-10-python.tar.gz && rm cifar-10-python.tar.gz After you downloading the data, upload them to HDFS:hdfs dfs -put filename hdfs://ip:port/examples/mpi/tensorflow/data or hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/data Note that we use the same data as tensorflow distributed cifar-10 example. So, if you have already run that example, just use that data path.
CNTK: Download all files in https://git.io/vbT5A wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b and put them up to HDFS:hdfs dfs -put filename hdfs://ip:port/examples/cntk/data or hdfs dfs -put filename hdfs://ip:port/examples/mpi/cntk/data. Note that we use the same data as cntk example. So, if you have already run that example, just use that data path.

Prepare the execable code:

Tensorflow: We use the same code as tensorflow distributed cifar-10 example. You can follow that document.
cntk: Download the script example from githubwget https://github.com/Microsoft/pai/raw/master/examples/mpi/cntk-mpi.sh. Then upload them to HDFS:hdfs dfs -put filename hdfs://ip:port/examples/mpi/cntk/code/

Prepare a docker image and upload it to docker hub. OpenPAI packaged the docker env required by the job for user to use. User could refer to DOCKER.md to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image openpai/pai.example.tensorflow-mpi, openpai/pai.example.cntk-mp with your own.
Prepare a job configuration file and submit it through webportal. The config examples are following.

Note that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local mechine. If you can, just run the shell script with a parameter of your HDFS socket! /bin/bash prepare.sh ip:port

Here're some configuration file examples:

Open MPI TensorFlow CIFAR-10 example

TensorFlow cifar10 benchmark

{
  "jobName": "tensorflow-mpi",
  "image": "openpai/pai.example.tensorflow-mpi",

  // download cifar10 dataset from http://www.cs.toronto.edu/~kriz/cifar.html and upload to hdfs
  "dataDir": "$PAI_DEFAULT_FS_URI/path/tensorflow-mpi/data",
  // make a new dir for output on hdfs
  "outputDir": "$PAI_DEFAULT_FS_URI/path/tensorflow-mpi/output",
  // download code from tensorflow benchmark https://git.io/vF4wT and upload to hdfs
  "codeDir": "$PAI_DEFAULT_FS_URI/path/tensorflow-mpi/code",

  "taskRoles": [
    {
      "name": "ps_server",
      "taskNumber": 2,
      "cpuNumber": 2,
      "memoryMB": 8192,
      "gpuNumber": 0,
      "command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=ps --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX --server_protocol=grpc+mpi"
    },
    {
      "name": "worker",
      "taskNumber": 2,
      "cpuNumber": 2,
      "memoryMB": 16384,
      "gpuNumber": 4,
      "command": "pip --quiet install scipy && python code/tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=resnet20 --variable_update=parameter_server --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --ps_hosts=$PAI_TASK_ROLE_ps_server_HOST_LIST --worker_hosts=$PAI_TASK_ROLE_worker_HOST_LIST --job_name=worker --task_index=$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX --server_protocol=grpc+mpi",
      "minSucceededTaskCount": 2
    }
  ],
  "retryCount": 0
}

Open MPI CNTK grapheme-to-phoneme conversion example

CNTK G2P example

{
  "jobName": "cntk-mpi",
  "image": "openpai/pai.example.cntk-mpi",

  // prepare cmudict corpus in CNTK format https://git.io/vbT5A and upload to hdfs
  "dataDir": "$PAI_DEFAULT_FS_URI/path/cntk-mpi/data",
  // make a new dir for output on hdfs
  "outputDir": "$PAI_DEFAULT_FS_URI/path/cntk-mpi/output",
  // prepare g2p distributed training script cntk-mpi.sh and upload to hdfs
  "codeDir": "$PAI_DEFAULT_FS_URI/path/cntk-mpi/code",
  "virtualCluster": "default",

  "taskRoles": [
    {
      "name": "mpi",
      "taskNumber": 1,
      "cpuNumber": 8,
      "memoryMB": 16384,
      "gpuNumber": 0,
      "command": "cd code && mpirun --allow-run-as-root -np 2 --host worker-0,worker-1 /bin/bash cntk-mpi.sh",
      "minSucceededTaskCount": 1
    },
    {
      "name": "worker",
      "taskNumber": 2,
      "cpuNumber": 8,
      "memoryMB": 16384,
      "gpuNumber": 2,
      "command": "/bin/bash"
    }
  ],
  "retryCount": 0
}

For more details on how to write a job configuration file, please refer to job tutorial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Note

MPI on OpenPAI

Contents

Open MPI TensorFlow / CNTK CIFAR-10 example

Prepare work

Open MPI TensorFlow CIFAR-10 example

TensorFlow cifar10 benchmark

Open MPI CNTK grapheme-to-phoneme conversion example

CNTK G2P example

FilesExpand file tree

mpi

Directory actions

More options

Directory actions

More options

Latest commit

History

mpi

Folders and files

parent directory

README.md

Note

MPI on OpenPAI

Contents

Open MPI TensorFlow / CNTK CIFAR-10 example

Prepare work

Open MPI TensorFlow CIFAR-10 example

TensorFlow cifar10 benchmark

Open MPI CNTK grapheme-to-phoneme conversion example

CNTK G2P example