Skip to content

Commit 0c57b93

Browse files
authored
Merge pull request #444 from yahoo/leewyang_wiki
move TF version specific instructions from wiki to README
2 parents e2f5cc4 + ffdf0c5 commit 0c57b93

File tree

2 files changed

+109
-10
lines changed

2 files changed

+109
-10
lines changed

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ cluster with the following steps:
2121
1. **Startup** - launches the Tensorflow main function on the executors, along with listeners for data/control messages.
2222
1. **Data ingestion**
2323
- **InputMode.TENSORFLOW** - leverages TensorFlow's built-in APIs to read data files directly from HDFS.
24-
- **InputMode.SPARK** - sends Spark RDD data to the TensorFlow nodes via the [feed_dict](https://www.tensorflow.org/how_tos/reading_data/#feeding) mechanism. Note that we leverage the [Hadoop Input/Output Format](https://github.com/tensorflow/ecosystem/tree/master/hadoop) to access TFRecords on HDFS.
24+
- **InputMode.SPARK** - sends Spark RDD data to the TensorFlow nodes via a `TFNode.DataFeed` class. Note that we leverage the [Hadoop Input/Output Format](https://github.com/tensorflow/ecosystem/tree/master/hadoop) to access TFRecords on HDFS.
2525
1. **Shutdown** - shuts down the Tensorflow workers and PS nodes on the executors.
2626

2727
## Table of Contents
@@ -36,17 +36,17 @@ cluster with the following steps:
3636
## Background
3737

3838
TensorFlowOnSpark was developed by Yahoo for large-scale distributed
39-
deep learning on our Hadoop clusters in Yahoo's private cloud.
39+
deep learning on our Hadoop clusters in Yahoo's private cloud.
4040

4141
TensorFlowOnSpark provides some important benefits (see [our
4242
blog](http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep))
4343
over alternative deep learning solutions.
44-
* Easily migrate all existing TensorFlow programs with <10 lines of code change;
45-
* Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard;
46-
* Server-to-server direct communication achieves faster learning when available;
47-
* Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow;
48-
* Easily integrate with your existing data processing pipelines and machine learning algorithms (ex. MLlib, CaffeOnSpark);
49-
* Easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband.
44+
* Easily migrate existing TensorFlow programs with <10 lines of code change
45+
* Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inferencing and TensorBoard
46+
* Server-to-server direct communication achieves faster learning when available
47+
* Allow datasets on HDFS and other sources pushed by Spark or pulled by TensorFlow
48+
* Easily integrate with your existing Spark data processing pipelines
49+
* Easily deployed on cloud or on-premise and on CPUs or GPUs.
5050

5151
## Install
5252

examples/mnist/README.md

Lines changed: 101 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,105 @@
22

33
Original Source: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py
44

5-
Note: this has been heavily modified to support different input formats (CSV and TFRecords) as well as to demonstrate the different data ingestion methods (feed_dict and QueueRunner).
5+
Notes:
6+
- This assumes that you have already [installed Spark, TensorFlow, and TensorFlowOnSpark](https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_Standalone)
7+
- This code has been heavily modified to support different input formats (CSV and TFRecords) and different data ingestion methods (`InputMode.TENSORFLOW` and `InputMode.SPARK`).
68

7-
Please follow [these instructions](https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN) to run this example.
9+
### Download MNIST data
10+
11+
```
12+
mkdir ${TFoS_HOME}/mnist
13+
pushd ${TFoS_HOME}/mnist
14+
curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
15+
curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
16+
curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
17+
curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"
18+
popd
19+
```
20+
21+
### Convert the MNIST zip files using Spark
22+
23+
```
24+
cd ${TFoS_HOME}
25+
# rm -rf examples/mnist/csv
26+
${SPARK_HOME}/bin/spark-submit \
27+
--master ${MASTER} \
28+
${TFoS_HOME}/examples/mnist/mnist_data_setup.py \
29+
--output examples/mnist/csv \
30+
--format csv
31+
ls -lR examples/mnist/csv
32+
```
33+
34+
### Start Spark Standalone Cluster
35+
36+
```
37+
export MASTER=spark://$(hostname):7077
38+
export SPARK_WORKER_INSTANCES=2
39+
export CORES_PER_WORKER=1
40+
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
41+
${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-slave.sh -c $CORES_PER_WORKER -m 3G ${MASTER}
42+
```
43+
44+
### Run distributed MNIST training using `InputMode.SPARK`
45+
46+
```
47+
# rm -rf mnist_model
48+
${SPARK_HOME}/bin/spark-submit \
49+
--master ${MASTER} \
50+
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
51+
--conf spark.cores.max=${TOTAL_CORES} \
52+
--conf spark.task.cpus=${CORES_PER_WORKER} \
53+
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
54+
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
55+
--cluster_size ${SPARK_WORKER_INSTANCES} \
56+
--images examples/mnist/csv/train/images \
57+
--labels examples/mnist/csv/train/labels \
58+
--format csv \
59+
--mode train \
60+
--model mnist_model
61+
62+
ls -l mnist_model
63+
```
64+
65+
### Run distributed MNIST inference using `InputMode.SPARK`
66+
67+
```
68+
# rm -rf predictions
69+
${SPARK_HOME}/bin/spark-submit \
70+
--master ${MASTER} \
71+
--py-files ${TFoS_HOME}/examples/mnist/spark/mnist_dist.py \
72+
--conf spark.cores.max=${TOTAL_CORES} \
73+
--conf spark.task.cpus=${CORES_PER_WORKER} \
74+
--conf spark.executorEnv.JAVA_HOME="$JAVA_HOME" \
75+
${TFoS_HOME}/examples/mnist/spark/mnist_spark.py \
76+
--cluster_size ${SPARK_WORKER_INSTANCES} \
77+
--images examples/mnist/csv/test/images \
78+
--labels examples/mnist/csv/test/labels \
79+
--mode inference \
80+
--format csv \
81+
--model mnist_model \
82+
--output predictions
83+
84+
less predictions/part-00000
85+
```
86+
87+
The prediction result should look like:
88+
```
89+
2017-02-10T23:29:17.009563 Label: 7, Prediction: 7
90+
2017-02-10T23:29:17.009677 Label: 2, Prediction: 2
91+
2017-02-10T23:29:17.009721 Label: 1, Prediction: 1
92+
2017-02-10T23:29:17.009761 Label: 0, Prediction: 0
93+
2017-02-10T23:29:17.009799 Label: 4, Prediction: 4
94+
2017-02-10T23:29:17.009838 Label: 1, Prediction: 1
95+
2017-02-10T23:29:17.009876 Label: 4, Prediction: 4
96+
2017-02-10T23:29:17.009914 Label: 9, Prediction: 9
97+
2017-02-10T23:29:17.009951 Label: 5, Prediction: 6
98+
2017-02-10T23:29:17.009989 Label: 9, Prediction: 9
99+
2017-02-10T23:29:17.010026 Label: 0, Prediction: 0
100+
```
101+
102+
### Shutdown Spark cluster
103+
104+
```
105+
${SPARK_HOME}/sbin/stop-slave.sh; ${SPARK_HOME}/sbin/stop-master.sh
106+
```

0 commit comments

Comments
 (0)