Skip to content
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
359d3eb
Initial Commit of bert quantization with squad v1.1 and v2.0
Jan 22, 2024
2c5add5
Update output of script to show MIGraphX
Jan 22, 2024
8501a42
Add Latency measurement for inferences
Jan 22, 2024
149a86b
Add additional input options for debugging as well as io_binding for …
Jan 23, 2024
ae7d18e
Fix path for vocab.txt
Jan 25, 2024
cefcaef
Add additional flags for resnet50 int8 run
Mar 18, 2024
dd8b66d
update compute_data()
Apr 2, 2024
c2c8861
Fix error with json serialization of calibration data
Apr 19, 2024
24d7067
Update script to enforce calibration cache name and change flags base…
Apr 24, 2024
14b6ec4
Use model path directly
TedThemistokleous Apr 26, 2024
5383c0a
Remove usage of path for e2e bert model
May 4, 2024
f501474
Load onnx model before calibration begins
May 4, 2024
92b848b
Remove false from onnx load
May 4, 2024
56586a2
Remove strided data reader
TedThemistokleous May 8, 2024
ef737c0
Arg changes for bert script
TedThemistokleous Jun 11, 2024
167c805
Merge branch 'main' into add_migx_bert_squad_quant_example
TedThemistokleous Jun 11, 2024
109fdfe
Fix merge conflicts
TedThemistokleous Jun 12, 2024
3eeccd1
use ort_session instead of session
TedThemistokleous Jun 20, 2024
7173cee
Fix error message with sample size
TedThemistokleous Jun 21, 2024
31de72c
Additional changes to handle various model inputs
TedThemistokleous Jul 3, 2024
de45c1b
Add sequence length to mxr file output naming
Jul 3, 2024
62ea38d
Add query length input parameter
Jul 4, 2024
e9dcf09
Modify script to add padding for now due to varying batch size used t…
Jul 5, 2024
6125ba8
Set toggle for save_load of models
Jul 9, 2024
518dcb7
Add EP option and gate out model run with save/load
TedThemistokleous Jul 11, 2024
5240ddc
Fix querry_length to query_len
TedThemistokleous Jul 16, 2024
7757055
Add option for calibration EP data selection
Aug 6, 2024
09814cf
Additional Fixes for save_load and adding CPU EP option
Aug 7, 2024
a258158
Only print quantizer info for int8 runs
Aug 9, 2024
a026198
Update README
TedThemistokleous Oct 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions quantization/nlp/bert/migraphx/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# BERT QDQ Quantization in ONNX for MIGraphX
There are two main steps for the quantization:
1. Calibration is done based on SQuAD dataset to get dynamic range of floating point tensors in the model
2. Q/DQ nodes with dynamic range (scale and zero-point) are inserted to the model

After Quantization is done, you can evaluate the QDQ model by running evaluation script we provide or your own script.

The **e2e_migraphx_bert_example.py** is an end-to-end example for you to reference and run.

## Requirements
* Please build from latest ONNX Runtime source (see [here](https://onnxruntime.ai/docs/build/eps.html#migraphx)) for now.
We plan to include TensorRT QDQ support later in ONNX Runtime 1.11 for [ORT Python GPU Package](https://pypi.org/project/onnxruntime-gpu/)
* MIGraphX 2.8 and above
* ROCm 5.7 and above (For calibration data)
* Python 3+
* numpy
* The onnx model used in the script is converted from Hugging Face BERT model. https://huggingface.co/transformers/serialization.html#converting-an-onnx-model-using-the-transformers-onnx-package
* We use [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset as default dataset which is included in the repo. If you want to use other dataset to do calibration/evaluation, please either follow the format of squad/dev-1.1.json to add dataset or you have to write your own pre-processing method to parse your dataset.

Some utility functions for dataset processing, data reader and evaluation are from Nvidia TensorRT demo BERT repo,
https://github.com/NVIDIA/TensorRT/tree/master/demo/BERT

Code TensorRT example has been reused for the MIGraphX Execution Provider to showcase how simple it is to convert over CUDA and TensorRT code into MIGraphX and ROCm within Onnxruntime. Just change the desired Execution Provider and install the proper requirements (ROCm and MIGraphX) and run your script as you did with CUDA.

We've also added a few more input args to the script to help finetune the inference you'd like to run. Feel free to use the --help when running


usage: e2e_migraphx_bert_example.py [-h] [--fp16] [--int8] [--model] [--version VERSION] [--batch BATCH] [--seq_len SEQ_LEN] [--doc_stride DOC_STRIDE] [--cal_num CAL_NUM] [--verbose]

options:
-h, --help show this help message and exit
--fp16 Perform fp16 quantization on the model before running inference
--int8 Perform int8 quantization on the model before running inference
--model Path to the desired model to be run. Default ins ./model.onnx
--version VERSION Squad dataset version. Default is 1.1. Choices are 1.1 and 2.0
--batch BATCH Batch size per inference
--seq_len SEQ_LEN sequence length of the model. Default is 384
--doc_stride DOC_STRIDE
document stride of the model. Default is 128
--cal_num CAL_NUM Number of calibration for QDQ Quantiation in int8. Default is 100
--verbose Show verbose output


## Model Calibration
Before running the calibration, please set the configuration properly in order to get better performance.

* **sequence_lengths** and **doc_stride** : Always consider them together. In order to get better accuracy result, choose doc stride 128 when using sequence length 384. Choose doc stride 32 when using sequence length 128. Generally speaking larger sequence_lengths and doc_stride can have better accuracy.
* **calib_num** : Default is 100. It's the number of examples in dataset used for calibration.

When calling `create_calibrator(...)`, following parameters are also configurable.
* **op_type_to_quantize** : Default is ['MatMul', 'Add']. One thing to remember is that even though quantizing more nodes improves inference latency, it can result in significant accuracy drop. So we don't suggest quantize all op type in the model.
* **calibrate_method** : Default is CalibrationMethod.MinMax. MinMax (CalibrationMethod.MinMax), Percentile (CalibrationMethod.Percentile) and Entropy (CalibrationMethod.Entropy) are supported. Please notice that generally use entropy algorithm for object detection models and percentile algorithm for NLP BERT models.
* **extra_options** : Default is {}. It can accept `num_bins`, `num_quantized_bins` and `percentile` as options. If no options are given, it will use internal default settings to run the calibration. When using entropy algorithm, `num_bins` (means number of bins of histogram for collecting floating tensor data) and `num_quantized_bins` (means number of bins of histogram after quantization) can be set with different combinations to test and fine-tune the calibration to get optimal result, for example, {'num_bins':8001, 'num_quantized_bins':255}. When using percentile algorithm, `num_bins` and `percentile` can be set with different values to fine-tune the calibration to get better result, for example, {'num_bins':2048, 'percentile':99.999}.

## QDQ Model Generation
In order to get best performance from MIGraphX, there are some optimizations being done when inserting Q/DQ nodes to the model.
* When inserting QDQ nodes to 'Add' node, only insert QDQ nodes to 'Add' node which is followed by ReduceMean node
* Enable per channel quantization on 'MatMul' node's weights. Please see `QDQQuantizer(...)` in the e2e example, the per_channel argument should be True as well as 'QDQOpTypePerChannelSupportToAxis': {'MatMul': 1} should be specified in extra_options argument. You can also modify 'QDQOpTypePerChannelSupportToAxis' to other op types and channel axis if they can increase performance.

Once QDQ model generation is done, the qdq_model.onnx will be saved.

## QDQ Model Evaluation
Remember to set env variables, ORT_TENSORRT_FP16_ENABLE=1 and ORT_TENSORRT_INT8_ENABLE=1, to run QDQ model.
We use evaluation tool from Nvidia TensorRT demo BERT repo to evaluate the result based on SQuAD v1.0 and SQuAD v2.0.

Note: The input names of model in the e2e example is based on Hugging Face Model's naming. If model input names are not correct in your model, please modify the code ort_session.run(["output_start_logits","output_end_logits"], inputs) in the example.

Loading