Skip to content

Commit f0fdf12

Browse files
authored
docs: update sparkml doc; cleanups. (#559)
Signed-off-by: Jason Wang <[email protected]>
1 parent e298dfb commit f0fdf12

File tree

2 files changed

+75
-57
lines changed

2 files changed

+75
-57
lines changed

README.md

Lines changed: 27 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
11
<!--- SPDX-License-Identifier: Apache-2.0 -->
2+
#
23

4+
![ONNXMLTools_logo_main](docs/ONNXMLTools_logo_main.png)
35

4-
<p align="center"><img width="40%" src="docs/ONNXMLTools_logo_main.png" /></p>
6+
| Linux | Windows |
7+
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
8+
| [![Build Status](https://dev.azure.com/onnxmltools/onnxmltools/_apis/build/status/onnxmltools-linux-conda-ci?branchName=master)](https://dev.azure.com/onnxmltools/onnxmltools/_build/latest?definitionId=3?branchName=master) | [![Build Status](https://dev.azure.com/onnxmltools/onnxmltools/_apis/build/status/onnxmltools-win32-conda-ci?branchName=master)](https://dev.azure.com/onnxmltools/onnxmltools/_build/latest?definitionId=3?branchName=master) |
59

6-
| Linux | Windows |
7-
|-------|---------|
8-
| [![Build Status](https://dev.azure.com/onnxmltools/onnxmltools/_apis/build/status/onnxmltools-linux-conda-ci?branchName=master)](https://dev.azure.com/onnxmltools/onnxmltools/_build/latest?definitionId=3?branchName=master)| [![Build Status](https://dev.azure.com/onnxmltools/onnxmltools/_apis/build/status/onnxmltools-win32-conda-ci?branchName=master)](https://dev.azure.com/onnxmltools/onnxmltools/_build/latest?definitionId=3?branchName=master)|
10+
## Introduction
911

10-
# Introduction
1112
ONNXMLTools enables you to convert models from different machine learning toolkits into [ONNX](https://onnx.ai). Currently the following toolkits are supported:
13+
1214
* Tensorflow (a wrapper of [tf2onnx converter](https://github.com/onnx/tensorflow-onnx/))
1315
* scikit-learn (a wrapper of [skl2onnx converter](https://github.com/onnx/sklearn-onnx/))
1416
* Apple Core ML
@@ -18,22 +20,30 @@ ONNXMLTools enables you to convert models from different machine learning toolki
1820
* XGBoost
1921
* H2O
2022
* CatBoost
21-
<p>Pytorch has its builtin ONNX exporter check <a href="https://pytorch.org/docs/stable/onnx.html">here</a> for details</p>
23+
24+
Pytorch has its builtin ONNX exporter check [here](https://pytorch.org/docs/stable/onnx.html) for details.
2225

2326
## Install
27+
2428
You can install latest release of ONNXMLTools from [PyPi](https://pypi.org/project/onnxmltools/):
25-
```
29+
30+
```bash
2631
pip install onnxmltools
2732
```
33+
2834
or install from source:
29-
```
35+
36+
```bash
3037
pip install git+https://github.com/microsoft/onnxconverter-common
3138
pip install git+https://github.com/onnx/onnxmltools
3239
```
40+
3341
If you choose to install `onnxmltools` from its source code, you must set the environment variable `ONNX_ML=1` before installing the `onnx` package.
3442

3543
## Dependencies
44+
3645
This package relies on ONNX, NumPy, and ProtoBuf. If you are converting a model from scikit-learn, Core ML, Keras, LightGBM, SparkML, XGBoost, H2O, CatBoost or LibSVM, you will need an environment with the respective package installed from the list below:
46+
3747
1. scikit-learn
3848
2. CoreMLTools (version 3.1 or lower)
3949
3. Keras (version 2.0.8 or higher) with the corresponding Tensorflow version
@@ -47,9 +57,11 @@ This package relies on ONNX, NumPy, and ProtoBuf. If you are converting a model
4757
ONNXMLTools is tested with Python **3.7+**.
4858

4959
# Examples
60+
5061
If you want the converted ONNX model to be compatible with a certain ONNX version, please specify the target_opset parameter upon invoking the convert function. The following Keras model conversion example demonstrates this below. You can identify the mapping from ONNX Operator Sets (referred to as opsets) to ONNX releases in the [versioning documentation](https://github.com/onnx/onnx/blob/master/docs/Versioning.md#released-versions).
5162

5263
## Keras to ONNX Conversion
64+
5365
Next, we show an example of converting a Keras model into an ONNX model with `target_opset=7`, which corresponds to ONNX release version 1.2.
5466

5567
```python
@@ -83,6 +95,7 @@ onnx_model = onnxmltools.convert_keras(keras_model, target_opset=7)
8395
```
8496

8597
## CoreML to ONNX Conversion
98+
8699
Here is a simple code snippet to convert a Core ML model into an ONNX model.
87100

88101
```python
@@ -100,7 +113,8 @@ onnxmltools.utils.save_model(onnx_model, 'example.onnx')
100113
```
101114

102115
## H2O to ONNX Conversion
103-
Below is a code snippet to convert a H2O MOJO model into an ONNX model. The only pre-requisity is to have a MOJO model saved on the local file-system.
116+
117+
Below is a code snippet to convert a H2O MOJO model into an ONNX model. The only prerequisite is to have a MOJO model saved on the local file-system.
104118

105119
```python
106120
import onnxmltools
@@ -122,7 +136,7 @@ backend of your choice.
122136

123137
You can check the operator set of your converted ONNX model using [Netron](https://github.com/lutzroeder/Netron), a viewer for Neural Network models. Alternatively, you could identify your converted model's opset version through the following line of code.
124138

125-
```
139+
```python
126140
opset_version = onnx_model.opset_import[0].version
127141
```
128142

@@ -138,7 +152,8 @@ All converter unit test can generate the original model and converted model to a
138152
[onnxruntime](https://pypi.org/project/onnxruntime/) or
139153
[onnxruntime-gpu](https://pypi.org/project/onnxruntime-gpu/).
140154
The unit test cases are all the normal python unit test cases, you can run it with pytest command line, for example:
141-
```
155+
156+
```bash
142157
python -m pytest --ignore .\tests\
143158
```
144159

@@ -159,4 +174,5 @@ be added in *tests_backend* to compute the prediction
159174
with the runtime.
160175

161176
# License
177+
162178
[Apache License v2.0](LICENSE)

onnxmltools/convert/sparkml/README.md

Lines changed: 48 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
<!--- SPDX-License-Identifier: Apache-2.0 -->
22

3-
# Spark ML to Onnx Model Conversion
3+
# Spark ML to ONNX Model Conversion
44

55
There is prep work needed above and beyond calling the API. In short these steps are:
66

@@ -9,72 +9,74 @@ There is prep work needed above and beyond calling the API. In short these steps
99
* taking the output Tensor(s) and converting it(them) back to a DataFrame if further processing is required.
1010

1111
## Instructions
12+
1213
For examples, please see the unit tests under `tests/sparkml`
1314

14-
1- Create a list of input types needed to be supplied to the `convert_sparkml()` call.
15-
For simple cases you can use `buildInitialTypesSimple()` function in `convert/sparkml/utils.py`.
16-
To use this function just pass your test DataFrame.
15+
1. Create a list of input types needed to be supplied to the `convert_sparkml()` call.
16+
17+
For simple cases you can use `buildInitialTypesSimple()` function in `convert/sparkml/utils.py`.
18+
To use this function just pass your test DataFrame.
19+
20+
Otherwise, the conversion code requires a list of tuples with input names and their corresponding Tensor types, as shown below:
1721

18-
Otherwise, the conversion code requires a list of tuples with input names and their corresponding Tensor types, as shown below:
19-
```python
20-
initial_types = [
21-
("label", StringTensorType([1, 1])),
22-
# (repeat for the required inputs)
23-
]
24-
```
25-
Note that the input names are the same as columns names from your DataFrame and they must match the "inputCol(s)" values
22+
```python
23+
initial_types = [
24+
("label", StringTensorType([1, 1])),
25+
# (repeat for the required inputs)
26+
]
27+
```
2628

27-
you provided when you created your Pipeline.
29+
Note that the input names are the same as columns names from your DataFrame and they must match the "inputCol(s)" values
2830

29-
2- Now you can create the ONNX model from your pipeline model like so:
30-
```python
31-
pipeline_model = pipeline.fit(training_data)
32-
onnx_model = convert_sparkml(pipeline_model, 'My Sparkml Pipeline', initial_types)
33-
```
31+
you provided when you created your Pipeline.
3432

35-
3- (optional) You could save the ONNX model for future use or further examination by using the `SerializeToString()`
33+
2. Now you can create the ONNX model from your pipeline model like so:
34+
35+
```python
36+
pipeline_model = pipeline.fit(training_data)
37+
onnx_model = convert_sparkml(pipeline_model, 'My Sparkml Pipeline', initial_types)
38+
```
39+
40+
3. (optional) You could save the ONNX model for future use or further examination by using the `SerializeToString()`
3641
method of ONNX model
3742

38-
```python
39-
with open("model.onnx", "wb") as f:
40-
f.write(onnx_model.SerializeToString())
41-
```
43+
```python
44+
with open("model.onnx", "wb") as f:
45+
f.write(onnx_model.SerializeToString())
46+
```
4247

43-
4- Before running this model (e.g. using `onnxruntime`) you need to create a `dict` from the input data. This dictionay
48+
4. Before running this model (e.g. using `onnxruntime`) you need to create a `dict` from the input data. This dictionary
4449
will have entries for each input name and its corresponding TensorData. For simple cases you could use the function
4550
`buildInputDictSimple()` and pass your testing DataFrame to it. Otherwise, you need to create something like the following:
4651

47-
```python
48-
input_data = {}
49-
input_data['label'] = test_df.select('label').toPandas().values
50-
# ... (repeat for all desired inputs)
51-
```
52+
```python
53+
input_data = {}
54+
input_data['label'] = test_df.select('label').toPandas().values
55+
# ... (repeat for all desired inputs)
56+
```
5257

58+
5. (optional) You could save the converted input data for possible debugging or future reuse. See below:
5359

54-
5- (optional) You could save the converted input data for possible debugging or future reuse. See below:
55-
```python
56-
with open("input_data", "wb") as f:
57-
pickle.dump(input, f)
58-
```
60+
```python
61+
with open("input_data", "wb") as f:
62+
pickle.dump(input, f)
63+
```
5964

60-
6- And finally run the newly converted ONNX model in the runtime:
61-
```python
62-
sess = onnxruntime.InferenceSession(onnx_model)
63-
output = sess.run(None, input_data)
65+
6. And finally run the newly converted ONNX model in the runtime:
6466

65-
```
66-
This output may need further conversion back to a DataFrame.
67+
```python
68+
sess = onnxruntime.InferenceSession(onnx_model)
69+
output = sess.run(None, input_data)
70+
```
6771

72+
This output may need further conversion back to a DataFrame.
6873

6974
## Known Issues
7075

71-
1. Overall invalid data handling is problematic and not implemented in most cases.
72-
Make sure your data is clean.
76+
1. Overall invalid data handling is problematic and not implemented in most cases. Make sure your data is clean.
7377

74-
2. OneHotEncoderEstimator must not drop the last bit: OneHotEncoderEstimator has an option
75-
which you can use to make sure the last bit is included in the vector: `dropLast=False`
78+
2. When converting `OneHotEncoderModel` to ONNX, if `handleInvalid` is set to `"keep"`, then `dropLast` must be set to `True`. If `handleInvalid` is set to `"error"`, then `dropLast` must be set to `False`.
7679

77-
3. Use FloatTensorType for all numbers (instead of Int6t4Tensor or other variations)
80+
3. Use `FloatTensorType` for all numbers (instead of `Int64Tensor` or other variations)
7881

7982
4. Some conversions, such as the one for Word2Vec, can only handle batch size of 1 (one input row)
80-

0 commit comments

Comments
 (0)