Merge pull request #444 from yahoo/leewyang_wiki

anttisaukko · web-flow · commit 0c57b937c3ac · 2019-08-29T08:59:28.000-07:00
move TF version specific instructions from wiki to README
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ cluster with the following steps:
 1. **Startup** - launches the Tensorflow main function on the executors, along with listeners for data/control messages.
 1. **Data ingestion**
    - **InputMode.TENSORFLOW** - leverages TensorFlow's built-in APIs to read data files directly from HDFS.
-   - **InputMode.SPARK** - sends Spark RDD data to the TensorFlow nodes via the [feed_dict](https://www.tensorflow.org/how_tos/reading_data/#feeding) mechanism.  Note that we leverage the [Hadoop Input/Output Format](https://github.com/tensorflow/ecosystem/tree/master/hadoop) to access TFRecords on HDFS.
+   - **InputMode.SPARK** - sends Spark RDD data to the TensorFlow nodes via a `TFNode.DataFeed` class.  Note that we leverage the [Hadoop Input/Output Format](https://github.com/tensorflow/ecosystem/tree/master/hadoop) to access TFRecords on HDFS.
 1. **Shutdown** - shuts down the Tensorflow workers and PS nodes on the executors.
 
 ## Table of Contents
@@ -36,17 +36,17 @@ cluster with the following steps:
 ## Background
 
 TensorFlowOnSpark was developed by Yahoo for large-scale distributed
-deep learning on our Hadoop clusters in Yahoo's private cloud. 
+deep learning on our Hadoop clusters in Yahoo's private cloud.
 
 TensorFlowOnSpark provides some important benefits (see [our
 blog](http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep))
 over alternative deep learning solutions.
-   * Easily migrate all existing TensorFlow programs with <10 lines of code change;
-   * Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard;
-   * Server-to-server direct communication achieves faster learning when available;
-   * Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow; 
-   * Easily integrate with your existing data processing pipelines and machine learning algorithms (ex. MLlib, CaffeOnSpark);
-   * Easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband. 
+   * Easily migrate existing TensorFlow programs with <10 lines of code change
+   * Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard
+   * Server-to-server direct communication achieves faster learning when available
+   * Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow
+   * Easily integrate with your existing Spark data processing pipelines
+   * Easily deployed on cloud or on-premise and on CPUs or GPUs.
 
 ## Install
 
diff --git a/examples/mnist/README.md b/examples/mnist/README.md
@@ -2,6 +2,105 @@
 
 Original Source: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py
 
-Note: this has been heavily modified to support different input formats (CSV and TFRecords) as well as to demonstrate the different data ingestion methods (feed_dict and QueueRunner).
+Notes:
+- This assumes that you have already [installed Spark, TensorFlow, and TensorFlowOnSpark](https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_Standalone)
+- This code has been heavily modified to support different input formats (CSV and TFRecords) and different data ingestion methods (`InputMode.TENSORFLOW` and `InputMode.SPARK`).
 
-Please follow [these instructions](https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN) to run this example.
+### Download MNIST data
+
+```
+mkdir ${TFoS_HOME}/mnist
+pushd ${TFoS_HOME}/mnist
+curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
+curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
+curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
+curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"
+popd
+```
+
+### Convert the MNIST zip files using Spark
+
+```
+cd ${TFoS_HOME}
+# rm -rf examples/mnist/csv
+${SPARK_HOME}/bin/spark-submit \
+--master ${MASTER} \
+${TFoS_HOME}/examples/mnist/mnist_data_setup.py \
+--output examples/mnist/csv \
+--format csv
+ls -lR examples/mnist/csv
+```
+
+### Start Spark Standalone Cluster
+
+```
+export MASTER=spark://$(hostname):7077
+export SPARK_WORKER_INSTANCES=2
+export CORES_PER_WORKER=1
+export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
+${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G ${MASTER}
+```
+
+### Run distributed MNIST training using `InputMode.SPARK`
+
+```
+# rm -rf mnist_model
+${SPARK_HOME}/bin/spark-submit \
+--master ${MASTER} \
+--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
+--conf spark.cores.max=${TOTAL_CORES} \
+--conf spark.task.cpus=${CORES_PER_WORKER} \
+--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
+${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
+--cluster_size ${SPARK_WORKER_INSTANCES} \
+--images examples/mnist/csv/train/images \
+--labels examples/mnist/csv/train/labels \
+--format csv \
+--mode train \
+--model mnist_model
+
+ls -l mnist_model
+```
+
+### Run distributed MNIST inference using `InputMode.SPARK`
+
+```
+# rm -rf predictions
+${SPARK_HOME}/bin/spark-submit \
+--master ${MASTER} \
+--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
+--conf spark.cores.max=${TOTAL_CORES} \
+--conf spark.task.cpus=${CORES_PER_WORKER} \
+--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
+${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
+--cluster_size ${SPARK_WORKER_INSTANCES} \
+--images examples/mnist/csv/test/images \
+--labels examples/mnist/csv/test/labels \
+--mode inference \
+--format csv \
+--model mnist_model \
+--output predictions
+
+less predictions/part-00000
+```
+
+The prediction result should look like:
+```
+2017-02-10T23:29:17.009563 Label: 7, Prediction: 7
+2017-02-10T23:29:17.009677 Label: 2, Prediction: 2
+2017-02-10T23:29:17.009721 Label: 1, Prediction: 1
+2017-02-10T23:29:17.009761 Label: 0, Prediction: 0
+2017-02-10T23:29:17.009799 Label: 4, Prediction: 4
+2017-02-10T23:29:17.009838 Label: 1, Prediction: 1
+2017-02-10T23:29:17.009876 Label: 4, Prediction: 4
+2017-02-10T23:29:17.009914 Label: 9, Prediction: 9
+2017-02-10T23:29:17.009951 Label: 5, Prediction: 6
+2017-02-10T23:29:17.009989 Label: 9, Prediction: 9
+2017-02-10T23:29:17.010026 Label: 0, Prediction: 0
+```
+
+### Shutdown Spark cluster
+
+```
+${SPARK_HOME}/sbin/stop-slave.sh; ${SPARK_HOME}/sbin/stop-master.sh
+```