You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
```bash
(optimum-onnx) (base) ilyas@hf-dgx-01:~/optimum-onnx$ time optimum-cli export onnx -h
usage: optimum-cli export onnx [-h] -m MODEL [--task TASK] [--opset OPSET] [--device DEVICE] [--dtype {fp32,fp16,bf16}]
[--optimize {O1,O2,O3,O4}] [--monolith] [--no-post-process] [--variant VARIANT] [--framework {pt}] [--atol ATOL]
[--cache_dir CACHE_DIR] [--trust-remote-code] [--pad_token_id PAD_TOKEN_ID]
[--library-name {transformers,diffusers,timm,sentence_transformers}] [--model-kwargs MODEL_KWARGS]
[--no-dynamic-axes] [--no-constant-folding] [--slim] [--batch_size BATCH_SIZE] [--sequence_length SEQUENCE_LENGTH]
[--num_choices NUM_CHOICES] [--width WIDTH] [--height HEIGHT] [--num_channels NUM_CHANNELS]
[--feature_size FEATURE_SIZE] [--nb_max_frames NB_MAX_FRAMES] [--audio_sequence_length AUDIO_SEQUENCE_LENGTH]
[--point_batch_size POINT_BATCH_SIZE] [--nb_points_per_image NB_POINTS_PER_IMAGE]
[--visual_seq_length VISUAL_SEQ_LENGTH]
output
options:
-h, --help show this help message and exit
Required arguments:
-m MODEL, --model MODEL
Model ID on huggingface.co or path on disk to load model from.
output Path indicating the directory where to store the generated ONNX model.
Optional arguments:
--task TASK The task to export the model for. If not specified, the task will be auto-inferred from the model's metadata or files.
For decoder models, use `xxx-with-past` to export the model using past key values in the decoder.Available tasks depend
on the model, but are among the following list: ['audio-classification', 'audio-frame-classification', 'audio-xvector',
'automatic-speech-recognition', 'depth-estimation', 'document-question-answering', 'feature-extraction', 'fill-mask',
'image-classification', 'image-segmentation', 'image-text-to-text', 'image-to-image', 'image-to-text', 'inpainting',
'keypoint-detection', 'mask-generation', 'masked-im', 'multiple-choice', 'object-detection', 'question-answering',
'reinforcement-learning', 'semantic-segmentation', 'sentence-similarity', 'text-classification', 'text-generation',
'text-to-audio', 'text-to-image', 'text2text-generation', 'time-series-forecasting', 'token-classification', 'visual-
question-answering', 'zero-shot-image-classification', 'zero-shot-object-detection'].
--opset OPSET If specified, ONNX opset version to export the model with. Otherwise, the default opset for the given model architecture
will be used.
--device DEVICE The device to use to do the export. Defaults to "cpu".
--dtype {fp32,fp16,bf16}
The floating point precision to use for the export. Supported options: fp32 (float32), fp16 (float16), bf16 (bfloat16).
--optimize {O1,O2,O3,O4}
Allows to run ONNX Runtime optimizations directly during the export. Some of these optimizations are specific to ONNX
Runtime, and the resulting ONNX will not be usable with other runtime as OpenVINO or TensorRT. Possible options: - O1:
Basic general optimizations - O2: Basic and extended general optimizations, transformers-specific fusions - O3: Same as
O2 with GELU approximation - O4: Same as O3 with mixed precision (fp16, GPU-only, requires `--device cuda`)
--monolith Forces to export the model as a single ONNX file. By default, the ONNX exporter may break the model in several ONNX
files, for example for encoder-decoder models where the encoder should be run only once while the decoder is looped over.
--no-post-process Allows to disable any post-processing done by default on the exported ONNX models. For example, the merging of decoder
and decoder-with-past models into a single ONNX model file to reduce memory usage.
--variant VARIANT Select a variant of the model to export.
--framework {pt} The framework to use for the ONNX export. If not provided, will attempt to use the local checkpoint's original framework
or what is available in the environment.
--atol ATOL If specified, the absolute difference tolerance when validating the model. Otherwise, the default atol for the model will
be used.
--cache_dir CACHE_DIR
Path indicating where to store cache.
--trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should only be set for
repositories you trust and in which you have read the code, as it will execute on your local machine arbitrary code
present in the model repository.
--pad_token_id PAD_TOKEN_ID
This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it.
--library-name {transformers,diffusers,timm,sentence_transformers}
The library on the model. If not provided, will attempt to infer the local checkpoint's library
--model-kwargs MODEL_KWARGS
Any kwargs passed to the model forward, or used to customize the export for a given model.
--no-dynamic-axes Disable dynamic axes during ONNX export
--no-constant-folding
PyTorch-only argument. Disables PyTorch ONNX export constant folding.
--slim Enables onnxslim optimization.
Input shapes (if necessary, this allows to override the shapes of the input given to the ONNX exporter, that requires an example input).:
--batch_size BATCH_SIZE
Text tasks only. Batch size to use in the example input given to the ONNX export.
--sequence_length SEQUENCE_LENGTH
Text tasks only. Sequence length to use in the example input given to the ONNX export.
--num_choices NUM_CHOICES
Text tasks only. Num choices to use in the example input given to the ONNX export.
--width WIDTH Image tasks only. Width to use in the example input given to the ONNX export.
--height HEIGHT Image tasks only. Height to use in the example input given to the ONNX export.
--num_channels NUM_CHANNELS
Image tasks only. Number of channels to use in the example input given to the ONNX export.
--feature_size FEATURE_SIZE
Audio tasks only. Feature size to use in the example input given to the ONNX export.
--nb_max_frames NB_MAX_FRAMES
Audio tasks only. Maximum number of frames to use in the example input given to the ONNX export.
--audio_sequence_length AUDIO_SEQUENCE_LENGTH
Audio tasks only. Audio sequence length to use in the example input given to the ONNX export.
--point_batch_size POINT_BATCH_SIZE
For Segment Anything. It corresponds to how many segmentation masks we want the model to predict per input point.
--nb_points_per_image NB_POINTS_PER_IMAGE
For Segment Anything. It corresponds to the number of points per segmentation masks.
--visual_seq_length VISUAL_SEQ_LENGTH
Visual sequence length
real 0m0,086s
user 0m0,082s
sys 0m0,004s
```
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,7 +38,7 @@ For more information on the ONNX export, please check the [documentation](https:
38
38
39
39
#### Inference
40
40
41
-
Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seemless manner using [ONNX Runtime](https://onnxruntime.ai/) in the backend:
41
+
Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seamless manner using [ONNX Runtime](https://onnxruntime.ai/) in the backend:
Copy file name to clipboardExpand all lines: docs/source/onnx/usage_guides/export_a_model.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -317,7 +317,7 @@ For tasks that require only a single ONNX file (e.g. encoder-only), an exported
317
317
318
318
### Customize the export of Transformers models with custom modeling
319
319
320
-
Optimum supports the export of Transformers models with custom modeling that use [`trust_remote_code=True`](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModel.from_pretrained.trust_remote_code), not officially supported in the Transormers library but usable with its functionality as [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) and [generation](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationMixin.generate).
320
+
Optimum supports the export of Transformers models with custom modeling that use [`trust_remote_code=True`](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModel.from_pretrained.trust_remote_code), not officially supported in the Transformers library but usable with its functionality as [pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) and [generation](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationMixin.generate).
321
321
322
322
Examples of such models are [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) and [mosaicml/mpt-30b](https://huggingface.co/mosaicml/mpt-30b).
Copy file name to clipboardExpand all lines: docs/source/onnxruntime/usage_guides/gpu.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -126,7 +126,7 @@ Due to current limitations in ONNX Runtime, it is not possible to use quantized
126
126
127
127
[IOBinding](https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding) is an efficient way to avoid expensive data copying when using GPUs. By default, ONNX Runtime will copy the input from the CPU (even if the tensors are already copied to the targeted device), and assume that outputs also need to be copied back to the CPU from GPUs after the run. These data copying overheads between the host and devices are expensive, and __can lead to worse inference latency than vanilla PyTorch__ especially for the decoding process.
128
128
129
-
To avoid the slowdown, 🤗 Optimum adopts the IOBinding to copy inputs onto GPUs and pre-allocate memory for outputs prior the inference. When instanciating the `ORTModel`, set the value of the argument `use_io_binding` to choose whether to turn on the IOBinding during the inference. `use_io_binding` is set to `True` by default, if you choose CUDA as execution provider.
129
+
To avoid the slowdown, 🤗 Optimum adopts the IOBinding to copy inputs onto GPUs and pre-allocate memory for outputs prior the inference. When instantiating the `ORTModel`, set the value of the argument `use_io_binding` to choose whether to turn on the IOBinding during the inference. `use_io_binding` is set to `True` by default, if you choose CUDA as execution provider.
"The task to export the model for. If not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among:"
48
-
f" {TasksManager.get_all_tasks()}. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder."
51
+
"The task to export the model for. If not specified, the task will be auto-inferred from the model's metadata or files. "
52
+
"For tasks that generate text, add the `xxx-with-past` suffix to export the model using past key values caching. "
53
+
f"Available tasks depend on the model, but are among the following list: {ALL_TASKS}."
49
54
),
50
55
)
51
56
optional_group.add_argument(
@@ -107,12 +112,8 @@ def parse_args_onnx(parser):
107
112
"--framework",
108
113
type=str,
109
114
choices=["pt"],
110
-
default=None,
111
-
help=(
112
-
"The framework to use for the ONNX export."
113
-
" If not provided, will attempt to use the local checkpoint's original framework"
114
-
" or what is available in the environment."
115
-
),
115
+
default="pt",
116
+
help="The framework to use for the export. Defaults to 'pt' for PyTorch.",
0 commit comments