|
1 | 1 | # ExecuTorch Optimum Module |
2 | 2 |
|
3 | | -This module provides integration utilities for exporting and optimizing transformer models for ExecuTorch runtime. It contains specialized wrapper classes and utilities to make pre-trained models from Hugging Face Transformers compatible with `torch.export` and ExecuTorch execution. A lot of code is forked from `optimum-executorch` and adopted from `transformers`. We put it in ExecuTorch so that we can fast iterate on the stack. Eventually we want to upstream changes to `transformers` and `optimum-executorch`. |
| 3 | +This module provides ExecuTorch-specific optimizations and integrations for transformer models. It focuses on runtime-specific features that are not available in the upstream transformers or optimum-executorch libraries. |
4 | 4 |
|
5 | 5 | ## Overview |
6 | 6 |
|
7 | | -The optimum module bridges the gap between Hugging Face Transformers models and ExecuTorch by providing: |
| 7 | +This streamlined module contains only ExecuTorch-specific components: |
8 | 8 |
|
9 | | -- Exportable wrapper modules for different model types |
10 | | -- Custom cache implementations for efficient inference |
11 | | -- Utilities for model configuration and optimization |
12 | | -- Integration with ExecuTorch's custom operators |
| 9 | +- Custom cache implementations optimized for ExecuTorch runtime |
| 10 | +- Custom SDPA implementations for ExecuTorch operators |
| 11 | +- XNNPACK backend integration and optimization passes |
| 12 | +- ExecuTorch-specific utilities |
| 13 | + |
| 14 | +For general model export functionality, use `optimum-executorch` which provides a comprehensive recipe system and CLI interface. |
13 | 15 |
|
14 | 16 | ## Key Components |
15 | 17 |
|
16 | | -### Exportable Modules |
| 18 | +### Custom Cache Implementations |
17 | 19 |
|
18 | | -#### `TorchExportableModuleWithHybridCache` |
19 | | -A wrapper module that makes decoder-only language models exportable with `torch.export` using `HybridCache`. This is a forked version of [`TorchExportableModuleForDecoderOnlyLM`](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py#L391) with some modifications to support `inputs_embeds`. |
| 20 | +#### `ETCustomStaticCache` and `ETCustomHybridCache` |
| 21 | +Custom KV cache implementations that inherit from Hugging Face's caches but use ExecuTorch's `CustomKVCache` and `CustomRingKVCache` for optimal runtime performance. |
20 | 22 |
|
21 | | -**Note**: This class should be upstreamed to transformers. We keep it here so that we can iterate quickly. |
| 23 | +### Custom SDPA |
22 | 24 |
|
23 | | -#### `TorchExportableModuleForImageTextLM` |
24 | | -A wrapper for text decoder model in a vision-language model. It is very similar to [`TorchExportableModuleForDecoderOnlyLM`](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/executorch.py#L30) but instead of taking `input_ids` this module takes `inputs_embeds`. This is because we want to be able to take both token embeddings and image embeddings as inputs. |
| 25 | +#### `get_custom_sdpa_for_ring_kv_cache` |
| 26 | +Custom Scaled Dot-Product Attention implementation optimized for ExecuTorch's ring buffer caches and sliding window attention. |
25 | 27 |
|
26 | | -**Note**: This class should be upstreamed to transformers. We keep it here so that we can iterate quickly. |
| 28 | +### XNNPACK Integration |
27 | 29 |
|
28 | | -#### `ImageEncoderExportableModule` |
29 | | -A wrapper for vision encoder models that projects vision features to language model space. Commonly implemented as `get_image_features()` in HuggingFace transformers. For example: [`Gemma3Model.get_image_features()`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma3/modeling_gemma3.py#L794). |
| 30 | +#### `export_to_executorch_with_xnnpack` |
| 31 | +ExecuTorch-specific XNNPACK backend integration with custom optimization passes: |
| 32 | +- `RemovePaddingIdxEmbeddingPass`: Removes padding_idx from embedding operations |
| 33 | +- Memory planning and quantization optimizations |
| 34 | +- Backend delegation analysis and debugging |
30 | 35 |
|
31 | | -#### `ImageTextToTextExportableModule` |
32 | | -A wrapper of `torch.nn.Module` for `image-text-to-text` task. Provides `export()` API that generates an `ExportedProgram`. It will be consumed by `xnnpack.py` recipe to generate ExecuTorch program. |
| 36 | +### Utilities |
33 | 37 |
|
34 | | -### Custom Implementations |
35 | | -These are mostly copied from `optimum-executorch`. We put them here so that they can be reused by `integrations.py` and `xnnpack.py` recipe. |
| 38 | +- `save_config_to_constant_methods`: ExecuTorch-specific configuration utilities |
| 39 | +- Model metadata extraction for runtime optimization |
36 | 40 |
|
37 | | -- **Custom KV Cache**: Optimized key-value cache implementations for ExecuTorch |
38 | | -- **Custom SDPA**: Scaled Dot-Product Attention optimizations |
39 | | -- **XNNPACK Integration**: Lower to XNNPACK backend for optimized inference on CPU |
| 41 | +## Usage |
40 | 42 |
|
41 | | -### Utilities |
| 43 | +For multimodal model export, use optimum-executorch: |
42 | 44 |
|
43 | | -- Configuration saving and constant method generation |
44 | | -- Model metadata extraction |
45 | | -- Export helper functions |
| 45 | +```bash |
| 46 | +# Export with optimum-executorch CLI |
| 47 | +optimum-cli export executorch \ |
| 48 | + --model google/gemma-3-4b-it \ |
| 49 | + --task image-text-to-text \ |
| 50 | + --recipe xnnpack \ |
| 51 | + --use_custom_sdpa \ |
| 52 | + --use_custom_kv_cache |
| 53 | +``` |
46 | 54 |
|
47 | | -## Usage |
| 55 | +```python |
| 56 | +# Or via Python API |
| 57 | +from optimum.executorch import ExecuTorchModelForCausalLM |
| 58 | + |
| 59 | +model = ExecuTorchModelForCausalLM.from_pretrained( |
| 60 | + "google/gemma-3-4b-it", |
| 61 | + task="image-text-to-text", |
| 62 | + recipe="xnnpack", |
| 63 | + use_custom_sdpa=True, |
| 64 | + use_custom_kv_cache=True |
| 65 | +) |
| 66 | +``` |
| 67 | + |
| 68 | +For ExecuTorch-specific XNNPACK optimizations: |
48 | 69 |
|
49 | 70 | ```python |
50 | | -from transformers import PretrainedConfig |
51 | | -from executorch.extension.llm.optimum.image_text_to_text import load_image_text_to_text_model |
| 71 | +from optimum.exporters.executorch.integrations import ImageTextToTextExportableModule |
52 | 72 | from executorch.extension.llm.optimum.xnnpack import export_to_executorch_with_xnnpack |
53 | | -from executorch.extension.llm.optimum.modeling import ExecuTorchModelForImageTextToTextCausalLM |
54 | 73 |
|
55 | | -model_id = "google/gemma-3-4b-it" |
| 74 | +# Load model using optimum-executorch |
| 75 | +module = ImageTextToTextExportableModule(model, use_custom_kv_cache=True, use_custom_sdpa=True) |
56 | 76 |
|
57 | | -module = load_image_text_to_text_model( |
58 | | - model_id, |
59 | | - use_custom_sdpa=True, |
60 | | - use_custom_kv_cache=True, |
61 | | - qlinear=True, |
62 | | - qembedding=True, |
63 | | -) |
64 | | -model = export_to_executorch_with_xnnpack(module) |
65 | | -et_model = ExecuTorchModelForImageTextToTextCausalLM(model, PretrainedConfig.from_pretrained(model_id)) |
| 77 | +# Apply ExecuTorch-specific XNNPACK optimizations |
| 78 | +executorch_program = export_to_executorch_with_xnnpack(module) |
66 | 79 | ``` |
67 | 80 |
|
| 81 | +## Architecture |
| 82 | + |
| 83 | +This module follows the recommended approach: |
| 84 | +1. **General export functionality**: Use `optimum-executorch` |
| 85 | +2. **Multimodal support**: Enhanced `transformers.integrations.executorch` |
| 86 | +3. **ExecuTorch-specific optimizations**: This module |
| 87 | + |
| 88 | +This separation ensures: |
| 89 | +- No code duplication between repositories |
| 90 | +- Leverages mature optimum-executorch infrastructure |
| 91 | +- Focuses ExecuTorch module on runtime-specific optimizations |
| 92 | +- Maintains unified user experience through optimum-executorch CLI/API |
| 93 | + |
68 | 94 | ## Testing |
69 | 95 |
|
70 | 96 | Run tests with: |
|
0 commit comments