|
| 1 | +# Spark ML to Onnx Model Conversion |
| 2 | + |
| 3 | +As of this writing there are only 4 SparkML Transformers/Evaluators |
| 4 | +are converted and for most of those only basic options are supported. |
| 5 | + |
| 6 | +There are prep work needed above and beyond calling the API. In short these steps are: |
| 7 | + |
| 8 | +* providing the API with the types of Tensors being input to the Session. |
| 9 | +* creating proper Tensors from the DataFrame you are going to use for prediction. |
| 10 | +* taking the output Tensor(s) and converting it(them) back to a DataFrame if further processing is required. |
| 11 | + |
| 12 | +## Instructions |
| 13 | +For examples, please see the unit tests under `test/sparkml` |
| 14 | + |
| 15 | +1- Create a list of input types needed to be supplied to the model conversion call. |
| 16 | +For simple cases you can use `buildInitialTypesSimple()` function in `convert/sparkml/utils.py`. |
| 17 | +To use this function just pass your test DataFrame. |
| 18 | + |
| 19 | + Otherwise, the conversion code requires a list of tuples of input name and its Tensor type such as: |
| 20 | +```python |
| 21 | +initial_types = [ |
| 22 | + ("label", StringTensorType([1, 1])), |
| 23 | + # (repeat for the required inputs) |
| 24 | +] |
| 25 | +``` |
| 26 | +Note that the input names are the same as columns names from your DataFrame and they must match the "inputCol(s)" values |
| 27 | + |
| 28 | +you provided when you created your Pipeline. |
| 29 | + |
| 30 | +2- Now you can create the ONNX model from your pipeline model like so: |
| 31 | +```python |
| 32 | +pipeline_model = pipeline.fit(training_data) |
| 33 | +onnx_model = convert_sparkml(pipeline_model, 'My Sparkml Pipeline', initial_types) |
| 34 | +``` |
| 35 | + |
| 36 | +3- (optional) You could save the ONNX model for future use or further examination by using the `SerializeToString()` |
| 37 | +method of ONNX model |
| 38 | + |
| 39 | +```python |
| 40 | +with open("model.onnx", "wb") as f: |
| 41 | + f.write(onnx_model.SerializeToString()) |
| 42 | +``` |
| 43 | + |
| 44 | +4- Before running this model (e.g. using `onnxruntime`) you need to create a `dict` from the input data. This dictionay |
| 45 | + will have entries for each input name and its corresponding TensorData. For simple cases you could use the function |
| 46 | +`buildInputDictSimple()` and pass your testing DataFrame to it. Otherwise, you need to create something like the following: |
| 47 | + |
| 48 | +```python |
| 49 | +input_data = {} |
| 50 | +input_data['label'] = test_df.select('label').toPandas().values |
| 51 | +# ... (repeat for all desired inputs) |
| 52 | +``` |
| 53 | + |
| 54 | + |
| 55 | +5- (optional) You could save the converted input data for possible debugging or future reuse. See below: |
| 56 | +```python |
| 57 | +with open("input_data", "wb") as f: |
| 58 | + pickle.dump(input, f) |
| 59 | +``` |
| 60 | + |
| 61 | +6- And finally run the newly converted ONNX model in the runtime: |
| 62 | +```python |
| 63 | +sess = onnxruntime.InferenceSession(onnx_model) |
| 64 | +output = sess.run(None, input_data) |
| 65 | + |
| 66 | +``` |
| 67 | + This output may need further conversion back to a DataFrame. |
| 68 | + |
| 69 | + |
| 70 | +## Known Issues |
| 71 | + |
| 72 | +1. StringIndexer must not drop any records: StringIndexer in Spark has a `handleInvalid` option. |
| 73 | +Do not set this to 'drop'. |
| 74 | + |
| 75 | +2. OneHotEncoderEstimator must not drop the last bit: OneHotEncoderEstimator has an option |
| 76 | +which you can use to make sure the last bit is included in the vector: `dropLast=False` |
| 77 | + |
| 78 | +3. Use FloatTensorType for all numbers (instead of Int6t4Tensor or other variations) |
| 79 | + |
0 commit comments