Skip to content

Commit b09974b

Browse files
lu-wang-dlsueann
authored andcommitted
Update README to 1.0 (#123)
- Update the Spark version compatibility, release info for Release 1.0.0 - Add the example for hyperparameter turning with KerasImageFileEstimator - Update link to Databricks notebooks (for release 1.0.0)
1 parent 16415f7 commit b09974b

File tree

1 file changed

+69
-13
lines changed

1 file changed

+69
-13
lines changed

README.md

Lines changed: 69 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Deep Learning Pipelines provides high-level APIs for scalable deep learning in P
1717
- [Quick user guide](#quick-user-guide)
1818
- [Working with images in Spark](#working-with-images-in-spark)
1919
- [Transfer learning](#transfer-learning)
20+
- [Distributed hyperparameter tuning](#distributed-hyperparameter-tuning)
2021
- [Applying deep learning models at scale](#applying-deep-learning-models-at-scale)
2122
- [Deploying models as SQL functions](#deploying-models-as-sql-functions)
2223
- [License](#license)
@@ -41,7 +42,7 @@ For an overview of the library, see the Databricks [blog post](https://databrick
4142

4243
The library is in its early days, and we welcome everyone's feedback and contribution.
4344

44-
Maintainers: Bago Amirbekian, Joseph Bradley, Sue Ann Hong, Tim Hunter, Siddharth Murching, Tomas Nykodym
45+
Maintainers: Bago Amirbekian, Joseph Bradley, Yogesh Garg, Sue Ann Hong, Tim Hunter, Siddharth Murching, Tomas Nykodym, Lu Wang
4546

4647

4748
## Building and running unit tests
@@ -52,12 +53,12 @@ To run the Python unit tests, run the `run-tests.sh` script from the `python/` d
5253

5354
```bash
5455
# Be sure to run build/sbt assembly before running the Python tests
55-
sparkdl$ SPARK_HOME=/usr/local/lib/spark-2.1.1-bin-hadoop2.7 PYSPARK_PYTHON=python2 SCALA_VERSION=2.11.8 SPARK_VERSION=2.1.1 ./python/run-tests.sh
56+
sparkdl$ SPARK_HOME=/usr/local/lib/spark-2.3.0-bin-hadoop2.7 PYSPARK_PYTHON=python3 SCALA_VERSION=2.11.8 SPARK_VERSION=2.3.0 ./python/run-tests.sh
5657
```
5758

5859
## Spark version compatibility
5960

60-
Spark 2.2.0 and Python 3.6 are recommended for working with the latest code. See the [travis config](https://github.com/databricks/spark-deep-learning/blob/master/.travis.yml) for the regularly-tested combinations.
61+
To work with the latest code, Spark 2.3.0 is required and Python 3.6 & Scala 2.11 are recommended . See the [travis config](https://github.com/databricks/spark-deep-learning/blob/master/.travis.yml) for the regularly-tested combinations.
6162

6263
Compatibility requirements for each release are listed in the [Releases](#releases) section.
6364

@@ -70,13 +71,10 @@ You can also post bug reports and feature requests in Github issues.
7071

7172

7273
## Releases
73-
<!--
74-
TODO: might want to add TensorFlow compatibility information.
75-
- 1.0.0 release: Spark 2.3 required. Python 3.6 & Scala 2.11 recommended. TensorFlow 1.5.0+ required.
76-
1. Using the definition of images from Spark 2.3. The new definition uses the BGR channel ordering
74+
- [1.0.0](https://github.com/databricks/spark-deep-learning/releases/tag/v1.0.0) release: Spark 2.3.0 required. Python 3.6 & Scala 2.11 recommended. TensorFlow 1.6.0 required.
75+
1. Using the definition of images from Spark 2.3.0. The new definition uses the BGR channel ordering
7776
for 3-channel images instead of the RGB ordering used in this project before the change.
7877
2. Persistence for DeepImageFeaturizer (both Python and Scala).
79-
-->
8078
- [0.3.0](https://github.com/databricks/spark-deep-learning/releases/tag/v0.3.0) release: Spark 2.2.0, Python 3.6 & Scala 2.11 recommended. TensorFlow 1.4.1- required.
8179
1. KerasTransformer & TFTransformer for large-scale batch inference on non-image (tensor) data.
8280
2. Scala API for transfer learning (`DeepImageFeaturizer`). InceptionV3 is supported.
@@ -94,20 +92,20 @@ Deep Learning Pipelines provides a suite of tools around working with and proces
9492

9593
- [Working with images in Spark](#working-with-images-in-spark) : natively in Spark DataFrames
9694
- [Transfer learning](#transfer-learning) : a super quick way to leverage deep learning
97-
- Distributed hyper-parameter tuning : via Spark MLlib Pipelines (coming soon)
95+
- [Distributed hyperparameter tuning](#distributed-hyperparameter-tuning) : via Spark MLlib Pipelines
9896
- [Applying deep learning models at scale - to images](#applying-deep-learning-models-at-scale) : apply your own or known popular models to make predictions or transform them into features
9997
- [Applying deep learning models at scale - to tensors](#applying-deep-learning-models-at-scale-to-tensors) : of up to 2 dimensions
10098
- [Deploying models as SQL functions](#deploying-models-as-sql-functions) : empower everyone by making deep learning available in SQL.
10199

102100
To try running the examples below, check out the Databricks notebook in the [Databricks docs for Deep Learning Pipelines](https://docs.databricks.com/applications/deep-learning/deep-learning-pipelines.html), which works with the latest release of Deep Learning Pipelines. Here are some Databricks notebooks compatible with earlier releases:
103101
[0.1.0](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5669198905533692/3647723071348946/3983381308530741/latest.html),
104102
[0.2.0](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5669198905533692/1674891575666800/3983381308530741/latest.html),
105-
[0.3.0](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4856334613426202/3381529530484660/4079725938146156/latest.html).
106-
103+
[0.3.0](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4856334613426202/3381529530484660/4079725938146156/latest.html),
104+
[1.0.0](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6026450283250196/3874201704285756/7409402632610251/latest.html).
107105

108106
### Working with images in Spark
109107

110-
The first step to applying deep learning on images is the ability to load the images. Spark and Deep Learning Pipelines include utility functions that can load millions of images into a Spark DataFrame and decode them automatically in a distributed fashion, allowing manipulation at scale.
108+
The first step to apply deep learning on images is the ability to load the images. Spark and Deep Learning Pipelines include utility functions that can load millions of images into a Spark DataFrame and decode them automatically in a distributed fashion, allowing manipulation at scale.
111109

112110
Using Spark's ImageSchema
113111

@@ -155,6 +153,64 @@ evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
155153
print("Training set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
156154
```
157155

156+
157+
### Distributed hyperparameter tuning
158+
159+
Getting the best results in deep learning requires experimenting with different values for training parameters, an important step called hyperparameter tuning. Since Deep Learning Pipelines enables exposing deep learning training as a step in Spark’s machine learning pipelines, users can rely on the hyperparameter tuning infrastructure already built into Spark MLlib.
160+
161+
##### For Keras users
162+
To perform hyperparameter tuning with a Keras Model, `KerasImageFileEstimator` can be used to build an Estimator and use MLlib’s tooling for tuning the hyperparameters (e.g. CrossValidator). `KerasImageFileEstimator` works with image URI columns (not ImageSchema columns) in order to allow for custom image loading and processing functions often used with keras.
163+
164+
To build the estimator with `KerasImageFileEstimator`, we need to have a Keras model stored as a file. The model could be Keras built-in model or user trained model.
165+
166+
```python
167+
from keras.applications import InceptionV3
168+
169+
model = InceptionV3(weights="imagenet")
170+
model.save('/tmp/model-full.h5')
171+
```
172+
We also need to create an image loading function that reads the image data from a URI, preprocesses them, and returns the numerical tensor in the keras Model input format.
173+
Then, we can create a KerasImageFileEstimator that takes our saved model file.
174+
```python
175+
import PIL.Image
176+
import numpy as np
177+
from keras.applications.imagenet_utils import preprocess_input
178+
from sparkdl.estimators.keras_image_file_estimator import KerasImageFileEstimator
179+
180+
def load_image_from_uri(local_uri):
181+
img = (PIL.Image.open(local_uri).convert('RGB').resize((299, 299), PIL.Image.ANTIALIAS))
182+
img_arr = np.array(img).astype(np.float32)
183+
img_tnsr = preprocess_input(img_arr[np.newaxis, :])
184+
return img_tnsr
185+
186+
estimator = KerasImageFileEstimator( inputCol="uri",
187+
outputCol="prediction",
188+
labelCol="one_hot_label",
189+
imageLoader=load_image_from_uri,
190+
kerasOptimizer='adam',
191+
kerasLoss='categorical_crossentropy',
192+
modelFile='/tmp/model-full-tmp.h5' # local file path for model
193+
)
194+
```
195+
We can use it for hyperparameter tuning by doing a grid search using `CrossValidataor`.
196+
197+
```python
198+
from pyspark.ml.evaluation import BinaryClassificationEvaluator
199+
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
200+
201+
paramGrid = (
202+
ParamGridBuilder()
203+
.addGrid(estimator.kerasFitParams, [{"batch_size": 32, "verbose": 0},
204+
{"batch_size": 64, "verbose": 0}])
205+
.build()
206+
)
207+
bc = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label" )
208+
cv = CrossValidator(estimator=estimator, estimatorParamMaps=paramGrid, evaluator=bc, numFolds=2)
209+
210+
cvModel = cv.fit(train_df)
211+
```
212+
213+
158214
### Applying deep learning models at scale
159215

160216
Spark DataFrames are a natural construct for applying deep learning models to a large-scale dataset. Deep Learning Pipelines provides a set of Spark MLlib Transformers for applying TensorFlow Graphs and TensorFlow-backed Keras Models at scale. The Transformers, backed by the Tensorframes library, efficiently handle the distribution of models and data to Spark workers.
@@ -211,7 +267,7 @@ For applying Keras models in a distributed manner using Spark, [`KerasImageFileT
211267

212268
The difference in the API from `TFImageTransformer` above stems from the fact that usual Keras workflows have very specific ways to load and resize images that are not part of the TensorFlow Graph.
213269

214-
To use the transformer, we first need to have a Keras model stored as a file. For this notebook we'll just save the Keras built-in InceptionV3 model instead of training one.
270+
To use the transformer, we first need to have a Keras model stored as a file. We can just save the Keras built-in InceptionV3 model instead of training one.
215271

216272

217273
```python

0 commit comments

Comments
 (0)