You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/apache-spark-deep-learning-caffe.md
+13-19Lines changed: 13 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,10 +12,9 @@ ms.date: 02/17/2017
12
12
13
13
# Use Caffe on Azure HDInsight Spark for distributed deep learning
14
14
15
-
16
15
## Introduction
17
16
18
-
Deep learning is impacting everything from healthcare to transportation to manufacturing, and more. Companies are turning to deep learning to solve hard problems, like [image classification](https://blogs.microsoft.com/next/2015/12/10/microsoft-researchers-win-imagenet-computer-vision-challenge/), [speech recognition](https://googleresearch.blogspot.jp/2015/08/the-neural-networks-behind-google-voice.html), object recognition, and machine translation.
17
+
Deep learning is impacting everything from healthcare to transportation to manufacturing, and more. Companies are turning to deep learning to solve hard problems, like [image classification](https://blogs.microsoft.com/next/2015/12/10/microsoft-researchers-win-imagenet-computer-vision-challenge/), [speech recognition](https://googleresearch.blogspot.jp/2015/08/the-neural-networks-behind-google-voice.html), object recognition, and machine translation.
19
18
20
19
There are [many popular frameworks](https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software), including [Microsoft Cognitive Toolkit](https://www.microsoft.com/en-us/research/product/cognitive-toolkit/), [Tensorflow](https://www.tensorflow.org/), [Apache MXNet](https://mxnet.apache.org/), Theano, etc. [Caffe](https://caffe.berkeleyvision.org/) is one of the most famous non-symbolic (imperative) neural network frameworks, and widely used in many areas including computer vision. Furthermore, [CaffeOnSpark](https://yahoohadoop.tumblr.com/post/139916563586/caffeonspark-open-sourced-for-distributed-deep) combines Caffe with Apache Spark, in which case deep learning can be easily used on an existing Hadoop cluster. You can use deep learning together with Spark ETL pipelines, reducing system complexity, and latency for complete solution learning.
21
20
@@ -54,7 +53,6 @@ To get started, you need to install the dependencies. The Caffe site and [CaffeO
54
53
sudo ldconfig
55
54
echo "protobuf installation done"
56
55
57
-
58
56
There are two steps in the script action. The first step is to install all the required libraries. Those libraries include the necessary libraries for both compiling Caffe(such as gflags, glog) and running Caffe (such as numpy). you are using libatlas for CPU optimization, but you can always follow the CaffeOnSpark wiki on installing other optimization libraries, such as MKL or CUDA (for GPU).
59
57
60
58
The second step is to download, compile, and install protobuf 2.5.0 for Caffe during runtime. Protobuf 2.5.0 [is required](https://github.com/yahoo/CaffeOnSpark/issues/87), however this version is not available as a package on Ubuntu 16, so you need to compile it from the source code. There are also a few resources on the Internet on how to compile it. For more information, see [here](https://jugnu-life.blogspot.com/2013/09/install-protobuf-25-on-ubuntu.html).
@@ -63,10 +61,9 @@ To get started, you can just run this script action against your cluster to all
63
61
64
62

65
63
66
-
67
64
## Step 2: Build Caffe on Apache Spark for HDInsight on the head node
68
65
69
-
The second step is to build Caffe on the headnode, and then distribute the compiled libraries to all the worker nodes. In this step, you must [ssh into your headnode](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-linux-use-ssh-unix). After that, you must follow the [CaffeOnSpark build process](https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn). Below is the script you can use to build CaffeOnSpark with a few additional steps.
66
+
The second step is to build Caffe on the headnode, and then distribute the compiled libraries to all the worker nodes. In this step, you must [ssh into your headnode](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-linux-use-ssh-unix). After that, you must follow the [CaffeOnSpark build process](https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn). Below is the script you can use to build CaffeOnSpark with a few additional steps.
@@ -110,7 +107,6 @@ You may need to do more than what the documentation of CaffeOnSpark says. The ch
110
107
- Put the datasets to the BLOB storage, which is a shared location that is accessible to all worker nodes for later use.
111
108
- Put the compiled Caffe libraries to BLOB storage, and later you copy those libraries to all the nodes using script actions to avoid additional compilation time.
112
109
113
-
114
110
### Troubleshooting: An Ant BuildException has occurred: exec returned: 2
115
111
116
112
When first trying to build CaffeOnSpark, sometimes it says
@@ -130,7 +126,6 @@ Sometimes maven gives a connection time-out error, similar to the following sni
130
126
131
127
You must retry after a few minutes.
132
128
133
-
134
129
### Troubleshooting: Test failure for Caffe
135
130
136
131
You probably see a test failure when doing the final check for CaffeOnSpark. This is probably related with UTF-8 encoding, but should not impact the usage of Caffe
@@ -162,7 +157,7 @@ Caffe is using an "expressive architecture", where for composing a model, you ju
162
157
163
158
The model that you train is a sample model for MNIST training. The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. CaffeOnSpark has some scripts to download the dataset and convert it into the right format.
164
159
165
-
CaffeOnSpark provides some network topologies example for MNIST training. It has a nice design of splitting the network architecture (the topology of the network) and optimization. In this case, There are two files required:
160
+
CaffeOnSpark provides some network topologies example for MNIST training. It has a nice design of splitting the network architecture (the topology of the network) and optimization. In this case, There are two files required:
166
161
167
162
the "Solver" file (${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt) is used for overseeing the optimization and generating parameter updates. For example, it defines whether CPU or GPU is used, what's the momentum, how many iterations are, etc. It also defines which neuron network topology should the program use (which is the second file you need). For more information about Solver, see [Caffe documentation](https://caffe.berkeleyvision.org/tutorial/solver.html).
168
163
@@ -171,7 +166,7 @@ For this example, since you are using CPU rather than GPU, you should change the

185
180
186
181
For more information on how to define the network, check the [Caffe documentation on MNIST dataset](https://caffe.berkeleyvision.org/gathered/examples/mnist.html)
187
182
@@ -197,19 +192,19 @@ Since you are using YARN cluster mode, in which case the Spark driver will be sc
197
192
198
193
17/02/01 23:22:16 INFO Client: Application report for application_1485916338528_0015 (state: RUNNING)
199
194
200
-
If you want to know what happened, you usually need to get the Spark driver's log, which has more information. In this case, you need to go to the YARN UI to find the relevant YARN logs. You can get the YARN UI by this URL:
195
+
If you want to know what happened, you usually need to get the Spark driver's log, which has more information. In this case, you need to go to the YARN UI to find the relevant YARN logs. You can get the YARN UI by this URL:
You can take a look at how many resources are allocated for this particular application. You can click the "Scheduler" link, and then you will see that for this application, there are nine containers running. you ask YARN to provide eight executors, and another container is for driver process.
You may want to check the driver logs or container logs if there are failures. For driver logs, you can click the application ID in YARN UI, then click the "Logs" button. The driver logs are written into stderr.
For example, you might see some of the error below from the driver logs, indicating you allocate too many executors.
215
210
@@ -257,7 +252,6 @@ from the headnode. After checking container failure, it is caused by using GPU m
257
252
WARNING: Logging before InitGoogleLogging() is written to STDERR
258
253
F0201 07:10:48.309725 11624 common.cpp:79] Cannot use GPU in CPU-only Caffe: check mode.
259
254
260
-
261
255
## Getting results
262
256
263
257
Since you are allocating 8 executors, and the network topology is simple, it should only take around 30 minutes to run the result. From the command line, you can see that you put the model to wasb:///mnist.model, and put the results to a folder named wasb:///mnist_features_result.
@@ -280,19 +274,19 @@ and the result looks like:
280
274
281
275
The SampleID represents the ID in the MNIST dataset, and the label is the number that the model identifies.
282
276
283
-
284
277
## Conclusion
285
278
286
279
In this documentation, you have tried to install CaffeOnSpark with running a simple example. HDInsight is a full managed cloud distributed compute platform, and is the best place for running machine learning and advanced analytics workloads on large data set, and for distributed deep learning, you can use Caffe on HDInsight Spark to perform deep learning tasks.
287
280
288
-
289
281
## <aname="seealso"></a>See also
282
+
290
283
*[Overview: Apache Spark on Azure HDInsight](apache-spark-overview.md)
291
284
292
285
### Scenarios
286
+
293
287
*[Apache Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data](apache-spark-ipython-notebook-machine-learning.md)
294
288
*[Apache Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results](apache-spark-machine-learning-mllib-ipython.md)
295
289
296
290
### Manage resources
297
-
*[Manage resources for the Apache Spark cluster in Azure HDInsight](apache-spark-resource-manager.md)
298
291
292
+
*[Manage resources for the Apache Spark cluster in Azure HDInsight](apache-spark-resource-manager.md)
0 commit comments