Skip to content

Commit b447aa2

Browse files
authored
Sparkml (#238)
* SparkML POC; unit tests pass for all 4 operators * Added Pipeline logic+tests; Digressed Pipeline Conversion from sklearn; all unit tests pass * Adding documentation; Changing locations of some files * fixes after relocating file; Added Profiling test to sparkml * fixing broken link * added verification step to the profiling test for sparkml * removed the zipmap step from LogisticRegression conversion; conversion has sped up already * fixing profile sparkml to include all columns * fixing the SPARK_HOME detection code * adding winutils and setting HADOOP_PATH; changes to the profiler pipeline to make it 2.7 compatible
1 parent 8ae70fc commit b447aa2

File tree

79 files changed

+40654
-9
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

79 files changed

+40654
-9
lines changed

README.md

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ ONNXMLTools enables you to convert models from different machine learning toolki
1010
* Apple Core ML
1111
* scikit-learn (subset of models convertible to ONNX)
1212
* Keras
13+
* Spark ML (experimental)
1314
* LightGBM
1415
* libsvm
1516

@@ -32,8 +33,9 @@ This package relies on ONNX, NumPy, and ProtoBuf. If you are converting a model
3233
2. CoreMLTools
3334
3. Keras (version 2.0.8 or higher) with the corresponding Tensorflow version
3435
4. LightGBM (scikit-learn interface)
35-
5. XGBoost (scikit-learn interface)
36-
6. libsvm
36+
5. SparkML (pyspark version 2.3.3 only)
37+
6. XGBoost (scikit-learn interface)
38+
7. libsvm
3739

3840
# Examples
3941
If you want the converted ONNX model to be compatible with a certain ONNX version, please specify the target_opset parameter upon invoking the convert function. The following Keras model conversion example demonstrates this below. You can identify the mapping from ONNX Operator Sets (referred to as opsets) to ONNX releases in the [versioning documentation](https://github.com/onnx/onnx/blob/master/docs/Versioning.md#released-versions).
@@ -91,6 +93,12 @@ keras_model = Model(inputs=[input1, input2], output=sub_sum)
9193
onnx_model = onnxmltools.convert_keras(keras_model, target_opset=7)
9294
```
9395

96+
## Spark ML to ONNX Conversion
97+
Please refer to the following documents:
98+
* [Conversion Framework](onnxmltools/README.md) and
99+
* [Spark ML to Onnx Model Conversion](onnxmltools/convert/sparkml/README.md)
100+
101+
94102
# Testing model converters
95103

96104
*onnxmltools* converts models into the ONNX format which

onnxmltools/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
from .convert import convert_keras
2222
from .convert import convert_lightgbm
2323
from .convert import convert_sklearn
24+
from .convert import convert_sparkml
2425

2526
# from .convert.common.interface import *
2627

onnxmltools/convert/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,5 @@
99
from .main import convert_libsvm
1010
from .main import convert_lightgbm
1111
from .main import convert_sklearn
12+
from .main import convert_sparkml
1213
from .main import convert_xgboost
13-

onnxmltools/convert/common/_container.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,33 @@ def output_names(self):
3939
raise NotImplementedError()
4040

4141

42+
class SparkmlModelContainer(RawModelContainer):
43+
44+
def __init__(self, sparkml_model):
45+
super(SparkmlModelContainer, self).__init__(sparkml_model)
46+
# Sparkml models have no input and output specified, so we create them and store them in this container.
47+
self._inputs = []
48+
self._outputs = []
49+
50+
@property
51+
def input_names(self):
52+
return [variable.raw_name for variable in self._inputs]
53+
54+
@property
55+
def output_names(self):
56+
return [variable.raw_name for variable in self._outputs]
57+
58+
def add_input(self, variable):
59+
# The order of adding variables matters. The final model's input names are sequentially added as this list
60+
if variable not in self._inputs:
61+
self._inputs.append(variable)
62+
63+
def add_output(self, variable):
64+
# The order of adding variables matters. The final model's output names are sequentially added as this list
65+
if variable not in self._outputs:
66+
self._outputs.append(variable)
67+
68+
4269
class CoremlModelContainer(RawModelContainer):
4370

4471
def __init__(self, coreml_model):

onnxmltools/convert/common/utils.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,16 @@
1010
from distutils.version import LooseVersion, StrictVersion
1111

1212

13+
def sparkml_installed():
14+
"""
15+
Checks that *spark* is available.
16+
"""
17+
try:
18+
import pyspark
19+
return True
20+
except ImportError:
21+
return False
22+
1323
def sklearn_installed():
1424
"""
1525
Checks that *scikit-learn* is available.

onnxmltools/convert/main.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,14 @@ def convert_sklearn(model, name=None, initial_types=None, doc_string='', target_
6464
return convert_skl2onnx(model, name, initial_types, doc_string, target_opset,
6565
custom_conversion_functions, custom_shape_calculators)
6666

67+
def convert_sparkml(model, name=None, initial_types=None, doc_string='', target_opset=None,
68+
targeted_onnx=onnx.__version__, custom_conversion_functions=None, custom_shape_calculators=None):
69+
if not utils.sparkml_installed():
70+
raise RuntimeError('Spark is not installed. Please install Spark to use this feature.')
71+
72+
from .sparkml.convert import convert
73+
return convert(model, name, initial_types, doc_string, target_opset, targeted_onnx,
74+
custom_conversion_functions, custom_shape_calculators)
6775

6876
def convert_xgboost(*args, **kwargs):
6977
if not utils.xgboost_installed():
@@ -72,4 +80,3 @@ def convert_xgboost(*args, **kwargs):
7280
from .xgboost.convert import convert
7381
return convert(*args, **kwargs)
7482

75-
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Spark ML to Onnx Model Conversion
2+
3+
As of this writing there are only 4 SparkML Transformers/Evaluators
4+
are converted and for most of those only basic options are supported.
5+
6+
There are prep work needed above and beyond calling the API. In short these steps are:
7+
8+
* providing the API with the types of Tensors being input to the Session.
9+
* creating proper Tensors from the DataFrame you are going to use for prediction.
10+
* taking the output Tensor(s) and converting it(them) back to a DataFrame if further processing is required.
11+
12+
## Instructions
13+
For examples, please see the unit tests under `test/sparkml`
14+
15+
1- Create a list of input types needed to be supplied to the model conversion call.
16+
For simple cases you can use `buildInitialTypesSimple()` function in `convert/sparkml/utils.py`.
17+
To use this function just pass your test DataFrame.
18+
19+
Otherwise, the conversion code requires a list of tuples of input name and its Tensor type such as:
20+
```python
21+
initial_types = [
22+
("label", StringTensorType([1, 1])),
23+
# (repeat for the required inputs)
24+
]
25+
```
26+
Note that the input names are the same as columns names from your DataFrame and they must match the "inputCol(s)" values
27+
28+
you provided when you created your Pipeline.
29+
30+
2- Now you can create the ONNX model from your pipeline model like so:
31+
```python
32+
pipeline_model = pipeline.fit(training_data)
33+
onnx_model = convert_sparkml(pipeline_model, 'My Sparkml Pipeline', initial_types)
34+
```
35+
36+
3- (optional) You could save the ONNX model for future use or further examination by using the `SerializeToString()`
37+
method of ONNX model
38+
39+
```python
40+
with open("model.onnx", "wb") as f:
41+
f.write(onnx_model.SerializeToString())
42+
```
43+
44+
4- Before running this model (e.g. using `onnxruntime`) you need to create a `dict` from the input data. This dictionay
45+
will have entries for each input name and its corresponding TensorData. For simple cases you could use the function
46+
`buildInputDictSimple()` and pass your testing DataFrame to it. Otherwise, you need to create something like the following:
47+
48+
```python
49+
input_data = {}
50+
input_data['label'] = test_df.select('label').toPandas().values
51+
# ... (repeat for all desired inputs)
52+
```
53+
54+
55+
5- (optional) You could save the converted input data for possible debugging or future reuse. See below:
56+
```python
57+
with open("input_data", "wb") as f:
58+
pickle.dump(input, f)
59+
```
60+
61+
6- And finally run the newly converted ONNX model in the runtime:
62+
```python
63+
sess = onnxruntime.InferenceSession(onnx_model)
64+
output = sess.run(None, input_data)
65+
66+
```
67+
This output may need further conversion back to a DataFrame.
68+
69+
70+
## Known Issues
71+
72+
1. StringIndexer must not drop any records: StringIndexer in Spark has a `handleInvalid` option.
73+
Do not set this to 'drop'.
74+
75+
2. OneHotEncoderEstimator must not drop the last bit: OneHotEncoderEstimator has an option
76+
which you can use to make sure the last bit is included in the vector: `dropLast=False`
77+
78+
3. Use FloatTensorType for all numbers (instead of Int6t4Tensor or other variations)
79+
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# -------------------------------------------------------------------------
2+
# Copyright (c) Microsoft Corporation. All rights reserved.
3+
# Licensed under the MIT License. See License.txt in the project root for
4+
# license information.
5+
# --------------------------------------------------------------------------
6+
7+
from .convert import convert
8+
from .utils import *

0 commit comments

Comments
 (0)