Skip to content

Commit daefee3

Browse files
TedThemistokleousTed Themistokleous
andauthored
Add migx bert squad quant example (microsoft#441)
* Initial Commit of bert quantization with squad v1.1 and v2.0 Added additional pieces with argparse to select which version of squad we want to test, batch size, sequence length and some other useful things. * Update output of script to show MIGraphX * Add Latency measurement for inferences * Add additional input options for debugging as well as io_binding for runs Seeing stall on larger batch sizes. Adding flags for debugging. * Fix path for vocab.txt * Add additional flags for resnet50 int8 run - Allow for mixed precsion via fp16 flag -Allow for variable batch -Allow for variable calibrate data size * update compute_data() * Fix error with json serialization of calibration data * Update script to enforce calibration cache name and change flags based on version * Use model path directly * Remove usage of path for e2e bert model * Load onnx model before calibration begins * Remove false from onnx load * Remove strided data reader Running into an issue with shapes with the calibration tools if I break up this calibration read. This needs a large amount of memory to create the histogram. * Arg changes for bert script * Fix merge conflicts * use ort_session instead of session * Fix error message with sample size * Additional changes to handle various model inputs Useful if we want to use another falvor of bert for now. TODO need to handle/fix some of the input/output arg maps vs the input data vs model input/outputs * Add sequence length to mxr file output naming * Add query length input parameter Another knob to tune/play with in perf runs. Right now just allow this to be default. * Modify script to add padding for now due to varying batch size used to handle features in each example Our MIGraphX EP requires a recompile of the model if we constantly change the input dimensions or batch size of the parameters. Without this we actually cause a slowdown with the larger batch size runs as we tend to go above the feature index. A workaround is to ensure that batch size stays constant as we feed data into the model we're testing to get inference timing and accuracy results via repeating the same sample until we have enough data for a proper batch size. * Set toggle for save_load of models * Add EP option and gate out model run with save/load * Fix querry_length to query_len * Add option for calibration EP data selection * Additional Fixes for save_load and adding CPU EP option * Only print quantizer info for int8 runs * Update README --------- Co-authored-by: Ted Themistokleous <[email protected]>
1 parent f35bed1 commit daefee3

File tree

9 files changed

+32606
-0
lines changed

9 files changed

+32606
-0
lines changed
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# BERT QDQ Quantization in ONNX for MIGraphX
2+
There are two main steps for the quantization:
3+
1. Calibration is done based on SQuAD dataset to get dynamic range of floating point tensors in the model
4+
2. Q/DQ nodes with dynamic range (scale and zero-point) are inserted to the model
5+
6+
After Quantization is done, you can evaluate the QDQ model by running evaluation script we provide or your own script.
7+
8+
The **e2e_migraphx_bert_example.py** is an end-to-end example for you to reference and run.
9+
10+
## Requirements
11+
* Please build from latest ONNX Runtime source (see [here](https://onnxruntime.ai/docs/build/eps.html#migraphx)) for now.
12+
* MIGraphX 2.8 and above
13+
* ROCm 5.7 and above (For calibration data)
14+
* Python 3+
15+
* numpy
16+
* The onnx model used in the script is converted from Hugging Face BERT model. https://huggingface.co/transformers/serialization.html#converting-an-onnx-model-using-the-transformers-onnx-package
17+
* We use [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset as default dataset which is included in the repo. If you want to use other dataset to do calibration/evaluation, please either follow the format of squad/dev-1.1.json to add dataset or you have to write your own pre-processing method to parse your dataset.
18+
19+
Some utility functions for dataset processing, data reader and evaluation are from Nvidia TensorRT demo BERT repo,
20+
https://github.com/NVIDIA/TensorRT/tree/master/demo/BERT
21+
22+
Code from the TensorRT example has been reused for the MIGraphX Execution Provider to showcase how simple it is to convert over CUDA and TensorRT code into MIGraphX and ROCm within Onnxruntime. Just change the desired Execution Provider and install the proper requirements (ROCm and MIGraphX) and run your script as you did with CUDA.
23+
24+
We've also added a few more input args to the script to help finetune the inference you'd like to run. Feel free to use the --help when running
25+
26+
usage: e2e_migraphx_bert_example.py [-h] [--fp16] [--int8] [--ep EP] [--cal_ep CAL_EP] [--model MODEL]
27+
[--vocab VOCAB] [--token TOKEN] [--version VERSION] [--no_eval]
28+
[--ort_verbose] [--ort_quant] [--save_load] [--batch BATCH]
29+
[--seq_len SEQ_LEN] [--query_len QUERY_LEN] [--doc_stride DOC_STRIDE]
30+
[--cal_num CAL_NUM] [--samples SAMPLES] [--verbose]
31+
32+
options:
33+
-h, --help show this help message and exit
34+
--fp16 Perform fp16 quantization on the model before running inference
35+
--int8 Perform int8 quantization on the model before running inference
36+
--ep EP The desired execution provider [MIGraphX, ROCm] are the options; Default is MIGraphX
37+
--cal_ep CAL_EP The desired execution provider [MIGraphX, ROCm, CPU] for int8 quantization; Default is
38+
MIGraphX
39+
--model MODEL Path to the desired model to be run. Default ins ./model.onnx
40+
--vocab VOCAB Path to the vocab of the model. Default is ./squad/vocab.txt
41+
--token TOKEN Path to the tokenized inputs. Default is None and will be taken from vocab file
42+
--version VERSION Squad dataset version. Default is 1.1. Choices are 1.1 and 2.0
43+
--no_eval Turn off evaluate output result for f1 and exact match score. Default False
44+
--ort_verbose Turn on onnxruntime verbose flags
45+
--ort_quant Turn on Onnxruntime Quantizer instead of MIGraphX Quantizer
46+
--save_load Turn on Onnxruntime Model save loading to speed up inference
47+
--batch BATCH Batch size per inference
48+
--seq_len SEQ_LEN sequence length of the model. Default is 384
49+
--query_len QUERY_LEN
50+
max querry length of the model. Default is 64
51+
--doc_stride DOC_STRIDE
52+
document stride of the model. Default is 128
53+
--cal_num CAL_NUM Number of calibration for QDQ Quantiation in int8. Default is 100
54+
--samples SAMPLES Number of samples to test with. Default is 0 (All the samples in the dataset)
55+
--verbose Show verbose output
56+
57+
58+
## Model Calibration
59+
Before running the calibration, please set the configuration properly in order to get better performance.
60+
61+
* **sequence_lengths** and **doc_stride** : Always consider them together. In order to get better accuracy result, choose doc stride 128 when using sequence length 384. Choose doc stride 32 when using sequence length 128. Generally speaking larger sequence_lengths and doc_stride can have better accuracy.
62+
* **calib_num** : Default is 100. It's the number of examples in dataset used for calibration.
63+
64+
When calling `create_calibrator(...)`, following parameters are also configurable.
65+
* **op_type_to_quantize** : Default is ['MatMul', 'Add']. One thing to remember is that even though quantizing more nodes improves inference latency, it can result in significant accuracy drop. So we don't suggest quantize all op type in the model.
66+
* **calibrate_method** : Default is CalibrationMethod.MinMax. MinMax (CalibrationMethod.MinMax), Percentile (CalibrationMethod.Percentile) and Entropy (CalibrationMethod.Entropy) are supported. Please notice that generally use entropy algorithm for object detection models and percentile algorithm for NLP BERT models.
67+
* **extra_options** : Default is {}. It can accept `num_bins`, `num_quantized_bins` and `percentile` as options. If no options are given, it will use internal default settings to run the calibration. When using entropy algorithm, `num_bins` (means number of bins of histogram for collecting floating tensor data) and `num_quantized_bins` (means number of bins of histogram after quantization) can be set with different combinations to test and fine-tune the calibration to get optimal result, for example, {'num_bins':8001, 'num_quantized_bins':255}. When using percentile algorithm, `num_bins` and `percentile` can be set with different values to fine-tune the calibration to get better result, for example, {'num_bins':2048, 'percentile':99.999}.
68+
69+
## QDQ Model Generation
70+
In order to get best performance from MIGraphX, there are some optimizations being done when inserting Q/DQ nodes to the model.
71+
* When inserting QDQ nodes to 'Add' node, only insert QDQ nodes to 'Add' node which is followed by ReduceMean node
72+
* Enable per channel quantization on 'MatMul' node's weights. Please see `QDQQuantizer(...)` in the e2e example, the per_channel argument should be True as well as 'QDQOpTypePerChannelSupportToAxis': {'MatMul': 1} should be specified in extra_options argument. You can also modify 'QDQOpTypePerChannelSupportToAxis' to other op types and channel axis if they can increase performance.
73+
74+
Once QDQ model generation is done, the qdq_model.onnx will be saved.
75+
76+
## QDQ Model Evaluation
77+
Remember to set env variables, ORT_MIGRAPHX_FP16_ENABLE=1 and ORT_MIGRAPHX_INT8_ENABLE=1, to run QDQ model.
78+
We use evaluation tool from Nvidia TensorRT demo BERT repo to evaluate the result based on SQuAD v1.0 and SQuAD v2.0.
79+
80+
Note: The input names of model in the e2e example is based on Hugging Face Model's naming. If model input names are not correct in your model, please modify the code ort_session.run(["output_start_logits","output_end_logits"], inputs) in the example.
81+

0 commit comments

Comments
 (0)