|
| 1 | +# Python Bindings for MultimodalRunner |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This project provides Python bindings for the ExecuTorch MultimodalRunner, enabling Python developers to easily use the multimodal LLM runner for processing mixed inputs (text, images, audio) and generating text outputs. |
| 6 | + |
| 7 | +## Architecture |
| 8 | + |
| 9 | +The MultimodalRunner is designed for Large Language Models that can process multimodal inputs and generate text outputs. It supports models like: |
| 10 | +- LLaVA (vision-language models) |
| 11 | +- CLIP-based models |
| 12 | +- Speech-to-text models |
| 13 | +- Other multimodal transformers |
| 14 | + |
| 15 | +### Key Components |
| 16 | + |
| 17 | +1. **MultimodalRunner** - Main runner class for multimodal inference |
| 18 | +2. **MultimodalInput** - Handles different input modalities (text, image, audio) |
| 19 | +3. **GenerationConfig** - Configuration for text generation parameters |
| 20 | +4. **Stats** - Performance monitoring and statistics |
| 21 | +5. **Tokenizer** - Text tokenization and decoding |
| 22 | + |
| 23 | +## Project Structure |
| 24 | + |
| 25 | +``` |
| 26 | +extension/llm/runner/ |
| 27 | +├── multimodal_runner_pybindings.cpp # Python bindings implementation (NEW) |
| 28 | +├── __init__.py # Python package initialization (NEW) |
| 29 | +├── multimodal_runner.py # Python wrapper classes (NEW) |
| 30 | +├── utils.py # Utility functions (NEW) |
| 31 | +├── CMakeLists.txt # Existing - update to include Python bindings |
| 32 | +└── test/ |
| 33 | + ├── test_multimodal_runner.py # Unit tests for Python bindings (NEW) |
| 34 | + └── test_generation.py # Generation tests (NEW) |
| 35 | + └── [existing test files] # Existing C++ tests remain here |
| 36 | +``` |
| 37 | + |
| 38 | +Note: We'll reuse the root-level `setup.py` and update the existing `CMakeLists.txt` rather than creating new ones. |
| 39 | + |
| 40 | +## Action Items |
| 41 | + |
| 42 | +### 1. Core Implementation Tasks |
| 43 | + |
| 44 | +#### High Priority |
| 45 | +- [x] ~~**Create Python bindings file** (`multimodal_runner_pybindings.cpp`)~~ |
| 46 | + - [x] ~~Bind MultimodalRunner class~~ |
| 47 | + - [x] ~~Bind MultimodalInput and helper functions~~ |
| 48 | + - [x] ~~Bind GenerationConfig struct~~ |
| 49 | + - [x] ~~Bind Stats class for performance monitoring~~ |
| 50 | + - [x] ~~Implement error handling and exception translation~~ |
| 51 | + |
| 52 | +#### Medium Priority |
| 53 | +- [x] ~~**Update existing CMakeLists.txt** in `extension/llm/runner/`~~ |
| 54 | + - [x] ~~Add Python bindings target when EXECUTORCH_BUILD_PYBIND is enabled~~ |
| 55 | + - [x] ~~Configure pybind11 integration~~ |
| 56 | + - [x] ~~Link with extension_llm_runner library~~ |
| 57 | + - [x] ~~Handle tokenizers dependency~~ |
| 58 | + - [x] ~~Set up proper include paths~~ |
| 59 | + |
| 60 | +- [x] ~~**Update root-level setup.py**~~ |
| 61 | + - [x] ~~Add multimodal_runner to the extensions list~~ |
| 62 | + - [x] ~~Ensure proper build configuration~~ |
| 63 | + - [x] ~~Handle platform-specific configurations~~ |
| 64 | + |
| 65 | +#### Low Priority |
| 66 | +- [x] ~~**Create Python wrapper files** in `extension/llm/runner/`~~ |
| 67 | + - [x] ~~`__init__.py` - Package initialization~~ |
| 68 | + - [x] ~~`multimodal_runner.py` - High-level Python API~~ |
| 69 | + - [x] ~~`utils.py` - Utility functions for input preprocessing~~ |
| 70 | + |
| 71 | +### 2. Build System Integration |
| 72 | + |
| 73 | +- [ ] **Integrate with main CMake build** |
| 74 | + - [ ] Add Python bindings compilation when EXECUTORCH_BUILD_PYBIND is enabled |
| 75 | + - [ ] Update extension/llm/runner/CMakeLists.txt to build multimodal_runner_pybindings.cpp |
| 76 | + - [ ] Ensure proper dependency resolution |
| 77 | + |
| 78 | +- [ ] **Handle dependencies** |
| 79 | + - [ ] Link against existing tokenizers Python bindings |
| 80 | + - [ ] Ensure Module and other dependencies are available |
| 81 | + - [ ] Handle pybind11 version requirements |
| 82 | + |
| 83 | +### 3. Input/Output Handling |
| 84 | + |
| 85 | +- [ ] **Implement MultimodalInput Python bindings** |
| 86 | + - [ ] Support for text inputs |
| 87 | + - [ ] Support for image inputs (numpy arrays, PIL Images) |
| 88 | + - [ ] Support for audio inputs (if applicable) |
| 89 | + - [ ] Mixed input ordering support |
| 90 | + |
| 91 | +- [ ] **Implement callbacks** |
| 92 | + - [ ] Token generation callback |
| 93 | + - [ ] Statistics callback |
| 94 | + - [ ] Progress reporting |
| 95 | + |
| 96 | +### 4. Testing and Documentation |
| 97 | + |
| 98 | +- [ ] **Create comprehensive tests** |
| 99 | + - [ ] Unit tests for bindings |
| 100 | + - [ ] Integration tests with sample models |
| 101 | + - [ ] Performance benchmarks |
| 102 | + - [ ] Memory leak tests |
| 103 | + |
| 104 | +- [ ] **Write documentation** |
| 105 | + - [ ] API documentation with examples |
| 106 | + - [ ] Installation guide |
| 107 | + - [ ] Usage tutorials |
| 108 | + - [ ] Model compatibility guide |
| 109 | + |
| 110 | +### 5. Example Scripts |
| 111 | + |
| 112 | +- [ ] **Create example scripts** |
| 113 | + - [ ] Basic text generation |
| 114 | + - [ ] Image + text (vision-language) example |
| 115 | + - [ ] Batch processing example |
| 116 | + - [ ] Streaming generation example |
| 117 | + |
| 118 | +## Installation Instructions |
| 119 | + |
| 120 | +### Prerequisites |
| 121 | + |
| 122 | +- Python >= 3.8 |
| 123 | +- CMake >= 3.18 |
| 124 | +- C++17 compatible compiler |
| 125 | +- PyTorch (for tensor operations) |
| 126 | +- pybind11 >= 2.6.0 |
| 127 | + |
| 128 | +### Building from Source |
| 129 | + |
| 130 | +```bash |
| 131 | +# Clone the repository |
| 132 | +git clone https://github.com/pytorch/executorch.git |
| 133 | +cd executorch |
| 134 | + |
| 135 | +# Install dependencies |
| 136 | +pip install -r requirements.txt |
| 137 | + |
| 138 | +# Build with Python bindings enabled |
| 139 | +python setup.py install --cmake-args="-DEXECUTORCH_BUILD_PYBIND=ON" |
| 140 | + |
| 141 | +# Or for development |
| 142 | +pip install -e . --config-settings editable_mode=compat |
| 143 | +``` |
| 144 | + |
| 145 | +### Running Tests |
| 146 | + |
| 147 | +```bash |
| 148 | +# Run the multimodal runner Python tests |
| 149 | +python -m pytest extension/llm/runner/test/test_multimodal_runner.py -v |
| 150 | +``` |
| 151 | + |
| 152 | +## Usage Example |
| 153 | + |
| 154 | +```python |
| 155 | +from executorch.extension.llm.runner import MultimodalRunner, GenerationConfig |
| 156 | +from executorch.extension.llm.runner.utils import make_text_input, make_image_input |
| 157 | +import numpy as np |
| 158 | + |
| 159 | +# Initialize the runner |
| 160 | +runner = MultimodalRunner( |
| 161 | + model_path="path/to/model.pte", |
| 162 | + tokenizer_path="path/to/tokenizer.bin" |
| 163 | +) |
| 164 | + |
| 165 | +# Create multimodal inputs |
| 166 | +image_array = np.random.rand(224, 224, 3) # Example image |
| 167 | +inputs = [ |
| 168 | + make_text_input("Describe this image:"), |
| 169 | + make_image_input(image_array) # numpy array or PIL Image |
| 170 | +] |
| 171 | + |
| 172 | +# Configure generation |
| 173 | +config = GenerationConfig( |
| 174 | + max_new_tokens=100, |
| 175 | + temperature=0.7, |
| 176 | + top_p=0.9 |
| 177 | +) |
| 178 | + |
| 179 | +# Generate text with callbacks |
| 180 | +def on_token(token): |
| 181 | + print(token, end='', flush=True) |
| 182 | + |
| 183 | +def on_stats(stats): |
| 184 | + print(f"\nTokens/sec: {stats.tokens_per_second:.2f}") |
| 185 | + |
| 186 | +runner.generate(inputs, config, token_callback=on_token, stats_callback=on_stats) |
| 187 | + |
| 188 | +# Or simpler usage without callbacks |
| 189 | +response = runner.generate_text(inputs, config) |
| 190 | +print(response) |
| 191 | +``` |
| 192 | + |
| 193 | +## Technical Considerations |
| 194 | + |
| 195 | +### Memory Management |
| 196 | +- Python bindings should properly handle memory ownership |
| 197 | +- Use shared_ptr/unique_ptr appropriately |
| 198 | +- Implement proper cleanup in destructors |
| 199 | + |
| 200 | +### Threading and GIL |
| 201 | +- Consider GIL release during long-running operations |
| 202 | +- Ensure thread safety for callbacks |
| 203 | +- Handle Python exceptions in C++ code |
| 204 | + |
| 205 | +### Performance |
| 206 | +- Minimize data copying between Python and C++ |
| 207 | +- Use move semantics where possible |
| 208 | +- Consider zero-copy tensor operations |
| 209 | + |
| 210 | +## Dependencies |
| 211 | + |
| 212 | +### Required |
| 213 | +- executorch core libraries |
| 214 | +- extension_llm_runner |
| 215 | +- tokenizers library |
| 216 | +- pybind11 |
| 217 | + |
| 218 | +### Optional |
| 219 | +- numpy (for array handling) |
| 220 | +- PIL/Pillow (for image processing) |
| 221 | +- torch (for tensor operations) |
| 222 | + |
| 223 | +## Contributing |
| 224 | + |
| 225 | +Please follow the ExecuTorch contribution guidelines. Key points: |
| 226 | +- Code should be formatted with clang-format |
| 227 | +- Python code should follow PEP 8 |
| 228 | +- Add comprehensive tests for new features |
| 229 | +- Update documentation as needed |
| 230 | + |
| 231 | +## License |
| 232 | + |
| 233 | +This project is licensed under the BSD-style license found in the LICENSE file in the root directory of the ExecuTorch repository. |
| 234 | + |
| 235 | +## Next Steps |
| 236 | + |
| 237 | +1. **Review and approve this plan** with the team |
| 238 | +2. **Start with core bindings** implementation |
| 239 | +3. **Test with existing models** (LLaVA, etc.) |
| 240 | +4. **Gather feedback** from early users |
| 241 | +5. **Iterate and improve** based on usage patterns |
| 242 | + |
| 243 | +## Questions for Discussion |
| 244 | + |
| 245 | +1. Should we support async generation? |
| 246 | +2. What level of integration with PyTorch tensors is needed? |
| 247 | +3. Should we provide pre-built wheels or source-only distribution? |
| 248 | +4. How should we handle model loading and caching? |
| 249 | +5. What additional utilities would be helpful for users? |
0 commit comments