diff --git a/docs/source/backends-qualcomm.md b/docs/source/backends-qualcomm.md index 74089885fcf..f401cfe0658 100644 --- a/docs/source/backends-qualcomm.md +++ b/docs/source/backends-qualcomm.md @@ -286,6 +286,13 @@ The model, inputs, and output location are passed to `qnn_executorch_runner` by Please refer to `$EXECUTORCH_ROOT/examples/qualcomm/scripts/` and `$EXECUTORCH_ROOT/examples/qualcomm/oss_scripts/` to the list of supported models. +Each script demonstrates: +- Model export (torch.export) +- Quantization (PTQ/QAT) +- Lowering and compilation to QNN delegate + +Deployment on device or HTP emulator + ## How to Support a Custom Model in HTP Backend ### Step-by-Step Implementation Guide @@ -389,12 +396,86 @@ with open(model_name, "wb") as f: print(f"Model successfully exported to {model_name}") ``` -## What is coming? +## Deep Dive + +### Partitioner API + +The **QnnPartitioner** identifies and groups supported subgraphs for execution on the QNN backend. +It uses `QnnOperatorSupport` to check node-level compatibility with the Qualcomm backend via QNN SDK APIs. + +The partitioner tags supported nodes with a `delegation_tag` and handles constants, buffers, and mutable states appropriately. +Please checkout [QNNPartitioner](https://github.com/pytorch/executorch/blob/main/backends/qualcomm/partition/qnn_partitioner.py#L125) for the latest changes. It mostly supports the following 4 inputs, and only compile spec is required +```python +class QnnPartitioner(Partitioner): + """ + QnnPartitioner identifies subgraphs that can be lowered to QNN backend, by tagging nodes for delegation, + and manages special cases such as mutable buffers and consumed constants. + """ + + def __init__( + self, + compiler_specs: List[CompileSpec], + skip_node_id_set: set = None, + skip_node_op_set: set = None, + skip_mutable_buffer: bool = False, + ): + ... +``` + +### Quantization +Quantization in the QNN backend supports multiple data bit-widths and training modes (PTQ/QAT). +The QnnQuantizer defines quantization configurations and annotations compatible with Qualcomm hardware. + +Supported schemes include: +- 8a8w (default) +- 16a16w +- 16a8w +- 16a4w +- 16a4w_block + + +Highlights: +- QuantDtype enumerates bit-width combinations for activations and weights. +- ModuleQConfig manages per-layer quantization behavior and observers. +- QnnQuantizer integrates with PT2E prepare/convert flow to annotate and quantize models. + +Supports: - - Improve the performance for llama3-8B-Instruct and support batch prefill. - - We will support pre-compiled binaries from [Qualcomm AI Hub](https://aihub.qualcomm.com/). +- Per-channel and per-block quantization + +- Custom quant annotation via custom_quant_annotations + +- Skipping specific nodes or ops + +- Per-module customization via submodule_qconfig_list + +For details, see: backends/qualcomm/quantizer/quantizer.py + +### Operator Support +[The full operator support matrix](https://github.com/pytorch/executorch/tree/f32cdc3de6f7176d70a80228f1a60bcd45d93437/backends/qualcomm/builders#operator-support-status is tracked and frequently updated in the ExecuTorch repository. + +It lists: +- Supported PyTorch ops (aten.*, custom ops) +- Planned ops +- Deprecated ops + +This matrix directly corresponds to the implementations in: [executorch/backends/qualcomm/builders/node_visitors/*.py](https://github.com/pytorch/executorch/tree/main/backends/qualcomm/builders) + +### Custom Ops Support + +You can extend QNN backend support for your own operators. +Follow the [tutorial](https://github.com/pytorch/executorch/tree/f32cdc3de6f7176d70a80228f1a60bcd45d93437/examples/qualcomm/custom_op#custom-operator-support): + +It covers: +- Writing new NodeVisitor for your op +- Registering via @register_node_visitor +- Creating and linking libQnnOp*.so for the delegate +- Testing and verifying custom kernels on HTP ## FAQ If you encounter any issues while reproducing the tutorial, please file a github [issue](https://github.com/pytorch/executorch/issues) on ExecuTorch repo and tag use `#qcom_aisw` tag + + ### Debugging tips + - Before trying any complicated models, try out [a simple model example](https://github.com/pytorch/executorch/tree/f32cdc3de6f7176d70a80228f1a60bcd45d93437/examples/qualcomm#simple-examples-to-verify-the-backend-is-working) and see it if works one device.