-
Notifications
You must be signed in to change notification settings - Fork 397
Add migx bert squad quant example #441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
tianleiwu
merged 30 commits into
microsoft:main
from
TedThemistokleous:add_migx_bert_squad_quant_example
Oct 25, 2024
Merged
Changes from 16 commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
359d3eb
Initial Commit of bert quantization with squad v1.1 and v2.0
2c5add5
Update output of script to show MIGraphX
8501a42
Add Latency measurement for inferences
149a86b
Add additional input options for debugging as well as io_binding for …
ae7d18e
Fix path for vocab.txt
cefcaef
Add additional flags for resnet50 int8 run
dd8b66d
update compute_data()
c2c8861
Fix error with json serialization of calibration data
24d7067
Update script to enforce calibration cache name and change flags base…
14b6ec4
Use model path directly
TedThemistokleous 5383c0a
Remove usage of path for e2e bert model
f501474
Load onnx model before calibration begins
92b848b
Remove false from onnx load
56586a2
Remove strided data reader
TedThemistokleous ef737c0
Arg changes for bert script
TedThemistokleous 167c805
Merge branch 'main' into add_migx_bert_squad_quant_example
TedThemistokleous 109fdfe
Fix merge conflicts
TedThemistokleous 3eeccd1
use ort_session instead of session
TedThemistokleous 7173cee
Fix error message with sample size
TedThemistokleous 31de72c
Additional changes to handle various model inputs
TedThemistokleous de45c1b
Add sequence length to mxr file output naming
62ea38d
Add query length input parameter
e9dcf09
Modify script to add padding for now due to varying batch size used t…
6125ba8
Set toggle for save_load of models
518dcb7
Add EP option and gate out model run with save/load
TedThemistokleous 5240ddc
Fix querry_length to query_len
TedThemistokleous 7757055
Add option for calibration EP data selection
09814cf
Additional Fixes for save_load and adding CPU EP option
a258158
Only print quantizer info for int8 runs
a026198
Update README
TedThemistokleous File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,67 @@ | ||
| # BERT QDQ Quantization in ONNX for MIGraphX | ||
| There are two main steps for the quantization: | ||
| 1. Calibration is done based on SQuAD dataset to get dynamic range of floating point tensors in the model | ||
| 2. Q/DQ nodes with dynamic range (scale and zero-point) are inserted to the model | ||
|
|
||
| After Quantization is done, you can evaluate the QDQ model by running evaluation script we provide or your own script. | ||
|
|
||
| The **e2e_migraphx_bert_example.py** is an end-to-end example for you to reference and run. | ||
|
|
||
| ## Requirements | ||
| * Please build from latest ONNX Runtime source (see [here](https://onnxruntime.ai/docs/build/eps.html#migraphx)) for now. | ||
| We plan to include TensorRT QDQ support later in ONNX Runtime 1.11 for [ORT Python GPU Package](https://pypi.org/project/onnxruntime-gpu/) | ||
tianleiwu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * MIGraphX 2.8 and above | ||
| * ROCm 5.7 and above (For calibration data) | ||
| * Python 3+ | ||
| * numpy | ||
| * The onnx model used in the script is converted from Hugging Face BERT model. https://huggingface.co/transformers/serialization.html#converting-an-onnx-model-using-the-transformers-onnx-package | ||
| * We use [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset as default dataset which is included in the repo. If you want to use other dataset to do calibration/evaluation, please either follow the format of squad/dev-1.1.json to add dataset or you have to write your own pre-processing method to parse your dataset. | ||
|
|
||
| Some utility functions for dataset processing, data reader and evaluation are from Nvidia TensorRT demo BERT repo, | ||
| https://github.com/NVIDIA/TensorRT/tree/master/demo/BERT | ||
|
|
||
| Code TensorRT example has been reused for the MIGraphX Execution Provider to showcase how simple it is to convert over CUDA and TensorRT code into MIGraphX and ROCm within Onnxruntime. Just change the desired Execution Provider and install the proper requirements (ROCm and MIGraphX) and run your script as you did with CUDA. | ||
|
|
||
| We've also added a few more input args to the script to help finetune the inference you'd like to run. Feel free to use the --help when running | ||
|
|
||
|
|
||
| usage: e2e_migraphx_bert_example.py [-h] [--fp16] [--int8] [--model] [--version VERSION] [--batch BATCH] [--seq_len SEQ_LEN] [--doc_stride DOC_STRIDE] [--cal_num CAL_NUM] [--verbose] | ||
|
|
||
| options: | ||
| -h, --help show this help message and exit | ||
| --fp16 Perform fp16 quantization on the model before running inference | ||
tianleiwu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| --int8 Perform int8 quantization on the model before running inference | ||
| --model Path to the desired model to be run. Default ins ./model.onnx | ||
| --version VERSION Squad dataset version. Default is 1.1. Choices are 1.1 and 2.0 | ||
| --batch BATCH Batch size per inference | ||
| --seq_len SEQ_LEN sequence length of the model. Default is 384 | ||
| --doc_stride DOC_STRIDE | ||
| document stride of the model. Default is 128 | ||
| --cal_num CAL_NUM Number of calibration for QDQ Quantiation in int8. Default is 100 | ||
| --verbose Show verbose output | ||
|
|
||
|
|
||
| ## Model Calibration | ||
| Before running the calibration, please set the configuration properly in order to get better performance. | ||
|
|
||
| * **sequence_lengths** and **doc_stride** : Always consider them together. In order to get better accuracy result, choose doc stride 128 when using sequence length 384. Choose doc stride 32 when using sequence length 128. Generally speaking larger sequence_lengths and doc_stride can have better accuracy. | ||
| * **calib_num** : Default is 100. It's the number of examples in dataset used for calibration. | ||
|
|
||
| When calling `create_calibrator(...)`, following parameters are also configurable. | ||
| * **op_type_to_quantize** : Default is ['MatMul', 'Add']. One thing to remember is that even though quantizing more nodes improves inference latency, it can result in significant accuracy drop. So we don't suggest quantize all op type in the model. | ||
| * **calibrate_method** : Default is CalibrationMethod.MinMax. MinMax (CalibrationMethod.MinMax), Percentile (CalibrationMethod.Percentile) and Entropy (CalibrationMethod.Entropy) are supported. Please notice that generally use entropy algorithm for object detection models and percentile algorithm for NLP BERT models. | ||
| * **extra_options** : Default is {}. It can accept `num_bins`, `num_quantized_bins` and `percentile` as options. If no options are given, it will use internal default settings to run the calibration. When using entropy algorithm, `num_bins` (means number of bins of histogram for collecting floating tensor data) and `num_quantized_bins` (means number of bins of histogram after quantization) can be set with different combinations to test and fine-tune the calibration to get optimal result, for example, {'num_bins':8001, 'num_quantized_bins':255}. When using percentile algorithm, `num_bins` and `percentile` can be set with different values to fine-tune the calibration to get better result, for example, {'num_bins':2048, 'percentile':99.999}. | ||
|
|
||
| ## QDQ Model Generation | ||
| In order to get best performance from MIGraphX, there are some optimizations being done when inserting Q/DQ nodes to the model. | ||
| * When inserting QDQ nodes to 'Add' node, only insert QDQ nodes to 'Add' node which is followed by ReduceMean node | ||
| * Enable per channel quantization on 'MatMul' node's weights. Please see `QDQQuantizer(...)` in the e2e example, the per_channel argument should be True as well as 'QDQOpTypePerChannelSupportToAxis': {'MatMul': 1} should be specified in extra_options argument. You can also modify 'QDQOpTypePerChannelSupportToAxis' to other op types and channel axis if they can increase performance. | ||
|
|
||
| Once QDQ model generation is done, the qdq_model.onnx will be saved. | ||
|
|
||
| ## QDQ Model Evaluation | ||
| Remember to set env variables, ORT_TENSORRT_FP16_ENABLE=1 and ORT_TENSORRT_INT8_ENABLE=1, to run QDQ model. | ||
| We use evaluation tool from Nvidia TensorRT demo BERT repo to evaluate the result based on SQuAD v1.0 and SQuAD v2.0. | ||
|
|
||
| Note: The input names of model in the e2e example is based on Hugging Face Model's naming. If model input names are not correct in your model, please modify the code ort_session.run(["output_start_logits","output_end_logits"], inputs) in the example. | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.