ElasticMM is an efficient and scalable serving system for large multimodal models (LMMs). It introduces Elastic Multimodal Parallelism (EMP), a new parallelization strategy that optimize resource utilization and system throughput for both text-only and multimodal inference workloads — up to 4.2× lower TTFT & 3.2–4.5× higher throughput vs vLLM.
- [2025/11] – Added V1 backend (beta) for next-generation scheduling and serving.
- [2025/10] – Open-sourced ElasticMM with full support for the vLLM V0 backend.
- [2025/09] – Our paper ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism was accepted as an Oral presentation at NeurIPS 2025.
- Elastic Resource Allocation: Dynamic GPU allocation between text and multimodal workloads
- Hierarchical Scheduling: Two-level scheduling architecture for optimal resource management
- Modality-Aware Load Balancing: Intelligent load balancing based on workload patterns
- Real-time Auto-scaling: Automatic scaling based on demand and performance metrics
- Multi-GPU Support: Efficient utilization of multiple GPU instances
- OpenAI-Compatible API: Easy integration with existing applications
- Python: 3.8 or higher
- CUDA: 11.8 or higher (for GPU support)
- NCCL: For multi-GPU communication (usually included with PyTorch)
# Clone the repository
git clone https://github.com/CapitalLiu/ElasticMM.git
cd ElasticMM
# Install the package
pip install -e .
# Or install dependencies directly
pip install -r requirements.txtBefore running inference, we recommend calibrating the system parameters for your hardware configuration. This offline profiling step helps optimize performance for your specific GPU setup.
# Run calibration to profile your machine's parameters
python examples/calibrate_gain_cost.pyThe calibration process will:
- Profile GPU memory and compute capabilities
- Measure KV cache transfer bandwidth
- Calculate optimal resource allocation parameters
- Generate configuration files for your hardware
Note: This step is optional but highly recommended for optimal performance.
For a basic example demonstrating ElasticMM's core functionality:
python examples/simple_usage.pyThis example shows:
- Basic system initialization
- Request submission and processing
- Output collection and handling
For a complete online service demonstration with dynamic request generation:
python examples/demo_with_workload.pyThis demo includes:
- Full system deployment with proxy and scheduler
- Dynamic request generation (text-only and multimodal)
- Real-time load balancing and auto-scaling
- Performance monitoring and statistics
- Minimum: 8 GPUs
- Recommended: 8+ GPUs with high-bandwidth interconnects (NVLink/InfiniBand) for optimal performance
- Memory: Sufficient GPU memory for your target model (typically 20GB+ per GPU)
ElasticMM implements a hierarchical architecture with two main levels:
- Modality Level: Allocates GPU instances between text-only and multimodal workloads
- Stage Level: Manages encoding, prefill, and decoding stages within each modality group
The system automatically balances resources based on real-time demand and performance metrics, ensuring optimal utilization across different workload types.
- High Throughput: Optimized for maximum requests per second
- Low Latency: Minimized time-to-first-token (TTFT)
- Efficient Resource Usage: Dynamic allocation prevents resource waste
- Scalable: Supports from single GPU to multi-node deployments
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Built on top of vLLM for efficient LLM serving
- Inspired by research in elastic computing and multimodal systems
- Thanks to the open-source community for various dependencies
If you find ElasticMM useful in your research or production deployments, please cite our NeurIPS 2025 paper:
@inproceedings{liu2025elasticmm,
title = {ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism},
author = {Liu, Zedong and Cheng, Shenggan and Tan, Guangming and You, Yang and Tao, Dingwen},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2025},
url = {https://arxiv.org/abs/2507.10069}
}
ElasticMM - Making multimodal AI more efficient and accessible.