|
1 | | -# Exporting LLMs with Optimum ExecuTorch |
| 1 | +# Exporting LLMs with HuggingFace's Optimum ExecuTorch |
2 | 2 |
|
3 | 3 | [Optimum ExecuTorch](https://github.com/huggingface/optimum-executorch) provides a streamlined way to export Hugging Face transformer models to ExecuTorch format. It offers seamless integration with the Hugging Face ecosystem, making it easy to export models directly from the Hugging Face Hub. |
4 | 4 |
|
5 | 5 | ## Overview |
6 | 6 |
|
7 | | -Optimum ExecuTorch supports a much wider variety of model architectures compared to ExecuTorch's native `export_llm` API. While `export_llm` focuses on a limited set of highly optimized models (Llama, Qwen, Phi, and SmolLM) with advanced features like SpinQuant and attention sink, Optimum ExecuTorch can export diverse architectures including Gemma, Mistral, GPT-2, BERT, T5, Whisper, and many others. |
| 7 | +Optimum ExecuTorch supports a much wider variety of model architectures compared to ExecuTorch's native `export_llm` API. While `export_llm` focuses on a limited set of highly optimized models (Llama, Qwen, Phi, and SmolLM) with advanced features like SpinQuant and attention sink, Optimum ExecuTorch can export diverse architectures including Gemma, Mistral, GPT-2, BERT, T5, Whisper, Voxtral, and many others. |
8 | 8 |
|
9 | 9 | ### Use Optimum ExecuTorch when: |
10 | 10 | - You need to export models beyond the limited set supported by `export_llm` |
@@ -130,7 +130,30 @@ For detailed examples of exporting each model type, see the [Optimum ExecuTorch |
130 | 130 |
|
131 | 131 | ## Running Exported Models |
132 | 132 |
|
133 | | -After exporting your model to a `.pte` file, you can run it on device: |
| 133 | +### Verifying Output with Python |
| 134 | + |
| 135 | +After exporting, you can verify the model output in Python before deploying to device using classes from `modeling.py`, such as the `ExecuTorchModelForCausalLM` class for LLMs: |
| 136 | + |
| 137 | +```python |
| 138 | +from optimum.executorch import ExecuTorchModelForCausalLM |
| 139 | +from transformers import AutoTokenizer |
| 140 | + |
| 141 | +# Load the exported model |
| 142 | +model = ExecuTorchModelForCausalLM.from_pretrained("./smollm2_exported") |
| 143 | +tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct") |
| 144 | + |
| 145 | +# Generate text |
| 146 | +generated_text = model.text_generation( |
| 147 | + tokenizer=tokenizer, |
| 148 | + prompt="Once upon a time", |
| 149 | + max_seq_len=128, |
| 150 | +) |
| 151 | +print(generated_text) |
| 152 | +``` |
| 153 | + |
| 154 | +### Running on Device |
| 155 | + |
| 156 | +After verifying your model works correctly, deploy it to device: |
134 | 157 |
|
135 | 158 | - [Running with C++](run-with-c-plus-plus.md) - Run exported models using ExecuTorch's C++ runtime |
136 | 159 | - [Running on Android](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/android) - Deploy to Android devices |
|
0 commit comments