Skip to content

Commit 3dd0812

Browse files
authored
Merge pull request #89161 from dagiro/cats159
cats159
2 parents 4e381f0 + d1701ef commit 3dd0812

File tree

1 file changed

+13
-19
lines changed

1 file changed

+13
-19
lines changed

articles/hdinsight/spark/apache-spark-deep-learning-caffe.md

Lines changed: 13 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,9 @@ ms.date: 02/17/2017
1212

1313
# Use Caffe on Azure HDInsight Spark for distributed deep learning
1414

15-
1615
## Introduction
1716

18-
Deep learning is impacting everything from healthcare to transportation to manufacturing, and more. Companies are turning to deep learning to solve hard problems, like [image classification](https://blogs.microsoft.com/next/2015/12/10/microsoft-researchers-win-imagenet-computer-vision-challenge/), [speech recognition](https://googleresearch.blogspot.jp/2015/08/the-neural-networks-behind-google-voice.html), object recognition, and machine translation.
17+
Deep learning is impacting everything from healthcare to transportation to manufacturing, and more. Companies are turning to deep learning to solve hard problems, like [image classification](https://blogs.microsoft.com/next/2015/12/10/microsoft-researchers-win-imagenet-computer-vision-challenge/), [speech recognition](https://googleresearch.blogspot.jp/2015/08/the-neural-networks-behind-google-voice.html), object recognition, and machine translation.
1918

2019
There are [many popular frameworks](https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software), including [Microsoft Cognitive Toolkit](https://www.microsoft.com/en-us/research/product/cognitive-toolkit/), [Tensorflow](https://www.tensorflow.org/), [Apache MXNet](https://mxnet.apache.org/), Theano, etc. [Caffe](https://caffe.berkeleyvision.org/) is one of the most famous non-symbolic (imperative) neural network frameworks, and widely used in many areas including computer vision. Furthermore, [CaffeOnSpark](https://yahoohadoop.tumblr.com/post/139916563586/caffeonspark-open-sourced-for-distributed-deep) combines Caffe with Apache Spark, in which case deep learning can be easily used on an existing Hadoop cluster. You can use deep learning together with Spark ETL pipelines, reducing system complexity, and latency for complete solution learning.
2120

@@ -54,7 +53,6 @@ To get started, you need to install the dependencies. The Caffe site and [CaffeO
5453
sudo ldconfig
5554
echo "protobuf installation done"
5655

57-
5856
There are two steps in the script action. The first step is to install all the required libraries. Those libraries include the necessary libraries for both compiling Caffe(such as gflags, glog) and running Caffe (such as numpy). you are using libatlas for CPU optimization, but you can always follow the CaffeOnSpark wiki on installing other optimization libraries, such as MKL or CUDA (for GPU).
5957

6058
The second step is to download, compile, and install protobuf 2.5.0 for Caffe during runtime. Protobuf 2.5.0 [is required](https://github.com/yahoo/CaffeOnSpark/issues/87), however this version is not available as a package on Ubuntu 16, so you need to compile it from the source code. There are also a few resources on the Internet on how to compile it. For more information, see [here](https://jugnu-life.blogspot.com/2013/09/install-protobuf-25-on-ubuntu.html).
@@ -63,10 +61,9 @@ To get started, you can just run this script action against your cluster to all
6361

6462
![Script Actions to Install Dependencies](./media/apache-spark-deep-learning-caffe/submit-script-action.png)
6563

66-
6764
## Step 2: Build Caffe on Apache Spark for HDInsight on the head node
6865

69-
The second step is to build Caffe on the headnode, and then distribute the compiled libraries to all the worker nodes. In this step, you must [ssh into your headnode](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-linux-use-ssh-unix). After that, you must follow the [CaffeOnSpark build process](https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn). Below is the script you can use to build CaffeOnSpark with a few additional steps.
66+
The second step is to build Caffe on the headnode, and then distribute the compiled libraries to all the worker nodes. In this step, you must [ssh into your headnode](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-linux-use-ssh-unix). After that, you must follow the [CaffeOnSpark build process](https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn). Below is the script you can use to build CaffeOnSpark with a few additional steps.
7067

7168
#!/bin/bash
7269
git clone https://github.com/yahoo/CaffeOnSpark.git --recursive
@@ -110,7 +107,6 @@ You may need to do more than what the documentation of CaffeOnSpark says. The ch
110107
- Put the datasets to the BLOB storage, which is a shared location that is accessible to all worker nodes for later use.
111108
- Put the compiled Caffe libraries to BLOB storage, and later you copy those libraries to all the nodes using script actions to avoid additional compilation time.
112109

113-
114110
### Troubleshooting: An Ant BuildException has occurred: exec returned: 2
115111

116112
When first trying to build CaffeOnSpark, sometimes it says
@@ -130,7 +126,6 @@ Sometimes maven gives a connection time-out error, similar to the following sni
130126

131127
You must retry after a few minutes.
132128

133-
134129
### Troubleshooting: Test failure for Caffe
135130

136131
You probably see a test failure when doing the final check for CaffeOnSpark. This is probably related with UTF-8 encoding, but should not impact the usage of Caffe
@@ -162,7 +157,7 @@ Caffe is using an "expressive architecture", where for composing a model, you ju
162157

163158
The model that you train is a sample model for MNIST training. The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. CaffeOnSpark has some scripts to download the dataset and convert it into the right format.
164159

165-
CaffeOnSpark provides some network topologies example for MNIST training. It has a nice design of splitting the network architecture (the topology of the network) and optimization. In this case, There are two files required:
160+
CaffeOnSpark provides some network topologies example for MNIST training. It has a nice design of splitting the network architecture (the topology of the network) and optimization. In this case, There are two files required:
166161

167162
the "Solver" file (${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt) is used for overseeing the optimization and generating parameter updates. For example, it defines whether CPU or GPU is used, what's the momentum, how many iterations are, etc. It also defines which neuron network topology should the program use (which is the second file you need). For more information about Solver, see [Caffe documentation](https://caffe.berkeleyvision.org/tutorial/solver.html).
168163

@@ -171,7 +166,7 @@ For this example, since you are using CPU rather than GPU, you should change the
171166
# solver mode: CPU or GPU
172167
solver_mode: CPU
173168

174-
![Caffe Config1](./media/apache-spark-deep-learning-caffe/caffe-configuration1.png
169+
![HDInsight caffe configuration example](./media/apache-spark-deep-learning-caffe/caffe-configuration1.png
175170
)
176171

177172
You can change other lines as needed.
@@ -181,7 +176,7 @@ The second file (${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt) define
181176
- change the "file:/Users/mridul/bigml/demodl/mnist_train_lmdb" to "wasb:///projects/machine_learning/image_dataset/mnist_train_lmdb"
182177
- change "file:/Users/mridul/bigml/demodl/mnist_test_lmdb/" to "wasb:///projects/machine_learning/image_dataset/mnist_test_lmdb"
183178

184-
![Caffe Config2](./media/apache-spark-deep-learning-caffe/caffe-configuration2.png)
179+
![HDInsight caffe configuration example again](./media/apache-spark-deep-learning-caffe/caffe-configuration2.png)
185180

186181
For more information on how to define the network, check the [Caffe documentation on MNIST dataset](https://caffe.berkeleyvision.org/gathered/examples/mnist.html)
187182

@@ -197,19 +192,19 @@ Since you are using YARN cluster mode, in which case the Spark driver will be sc
197192

198193
17/02/01 23:22:16 INFO Client: Application report for application_1485916338528_0015 (state: RUNNING)
199194

200-
If you want to know what happened, you usually need to get the Spark driver's log, which has more information. In this case, you need to go to the YARN UI to find the relevant YARN logs. You can get the YARN UI by this URL:
195+
If you want to know what happened, you usually need to get the Spark driver's log, which has more information. In this case, you need to go to the YARN UI to find the relevant YARN logs. You can get the YARN UI by this URL:
201196

202197
https://yourclustername.azurehdinsight.net/yarnui
203-
204-
![YARN UI](./media/apache-spark-deep-learning-caffe/apache-yarn-window-1.png)
198+
199+
![apache yarn scheduler browser view](./media/apache-spark-deep-learning-caffe/apache-yarn-window-1.png)
205200

206201
You can take a look at how many resources are allocated for this particular application. You can click the "Scheduler" link, and then you will see that for this application, there are nine containers running. you ask YARN to provide eight executors, and another container is for driver process.
207202

208-
![YARN Scheduler](./media/apache-spark-deep-learning-caffe/apache-yarn-scheduler.png)
203+
![HDI apache YARN Scheduler view](./media/apache-spark-deep-learning-caffe/apache-yarn-scheduler.png)
209204

210205
You may want to check the driver logs or container logs if there are failures. For driver logs, you can click the application ID in YARN UI, then click the "Logs" button. The driver logs are written into stderr.
211206

212-
![YARN UI 2](./media/apache-spark-deep-learning-caffe/apache-yarn-window-2.png)
207+
![apache yarn window browser view](./media/apache-spark-deep-learning-caffe/apache-yarn-window-2.png)
213208

214209
For example, you might see some of the error below from the driver logs, indicating you allocate too many executors.
215210

@@ -257,7 +252,6 @@ from the headnode. After checking container failure, it is caused by using GPU m
257252
WARNING: Logging before InitGoogleLogging() is written to STDERR
258253
F0201 07:10:48.309725 11624 common.cpp:79] Cannot use GPU in CPU-only Caffe: check mode.
259254

260-
261255
## Getting results
262256

263257
Since you are allocating 8 executors, and the network topology is simple, it should only take around 30 minutes to run the result. From the command line, you can see that you put the model to wasb:///mnist.model, and put the results to a folder named wasb:///mnist_features_result.
@@ -280,19 +274,19 @@ and the result looks like:
280274

281275
The SampleID represents the ID in the MNIST dataset, and the label is the number that the model identifies.
282276

283-
284277
## Conclusion
285278

286279
In this documentation, you have tried to install CaffeOnSpark with running a simple example. HDInsight is a full managed cloud distributed compute platform, and is the best place for running machine learning and advanced analytics workloads on large data set, and for distributed deep learning, you can use Caffe on HDInsight Spark to perform deep learning tasks.
287280

288-
289281
## <a name="seealso"></a>See also
282+
290283
* [Overview: Apache Spark on Azure HDInsight](apache-spark-overview.md)
291284

292285
### Scenarios
286+
293287
* [Apache Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data](apache-spark-ipython-notebook-machine-learning.md)
294288
* [Apache Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results](apache-spark-machine-learning-mllib-ipython.md)
295289

296290
### Manage resources
297-
* [Manage resources for the Apache Spark cluster in Azure HDInsight](apache-spark-resource-manager.md)
298291

292+
* [Manage resources for the Apache Spark cluster in Azure HDInsight](apache-spark-resource-manager.md)

0 commit comments

Comments
 (0)