ElasticMM: Elastic and Efficient MLLM Serving System

ElasticMM is an efficient and scalable serving system for large multimodal models (LMMs). It introduces Elastic Multimodal Parallelism (EMP), a new parallelization strategy that optimize resource utilization and system throughput for both text-only and multimodal inference workloads — up to 4.2× lower TTFT & 3.2–4.5× higher throughput vs vLLM.

Latest News 🔥

[2025/11] – Added V1 backend (beta) for next-generation scheduling and serving.
[2025/10] – Open-sourced ElasticMM with full support for the vLLM V0 backend.
[2025/09] – Our paper ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism was accepted as an Oral presentation at NeurIPS 2025.

🚀 Key Features

Elastic Resource Allocation: Dynamic GPU allocation between text and multimodal workloads
Hierarchical Scheduling: Two-level scheduling architecture for optimal resource management
Modality-Aware Load Balancing: Intelligent load balancing based on workload patterns
Real-time Auto-scaling: Automatic scaling based on demand and performance metrics
Multi-GPU Support: Efficient utilization of multiple GPU instances
OpenAI-Compatible API: Easy integration with existing applications

🛠️ Installation

Prerequisites

Python: 3.8 or higher
CUDA: 11.8 or higher (for GPU support)
NCCL: For multi-GPU communication (usually included with PyTorch)

Install from Source

# Clone the repository
git clone https://github.com/CapitalLiu/ElasticMM.git
cd ElasticMM

# Install the package
pip install -e .

# Or install dependencies directly
pip install -r requirements.txt

🚀 Quick Start

Step 1: System Calibration (Recommended)

Before running inference, we recommend calibrating the system parameters for your hardware configuration. This offline profiling step helps optimize performance for your specific GPU setup.

# Run calibration to profile your machine's parameters
python examples/calibrate_gain_cost.py

The calibration process will:

Profile GPU memory and compute capabilities
Measure KV cache transfer bandwidth
Calculate optimal resource allocation parameters
Generate configuration files for your hardware

Note: This step is optional but highly recommended for optimal performance.

Step 2: Simple Usage Example

For a basic example demonstrating ElasticMM's core functionality:

python examples/simple_usage.py

This example shows:

Basic system initialization
Request submission and processing
Output collection and handling

Step 3: Online Service with Dynamic Workload

For a complete online service demonstration with dynamic request generation:

python examples/demo_with_workload.py

This demo includes:

Full system deployment with proxy and scheduler
Dynamic request generation (text-only and multimodal)
Real-time load balancing and auto-scaling
Performance monitoring and statistics

System Requirements

⚠️ Important: ElasticMM requires a minimum of 8 GPUs to run with the default configuration (2 for text-only workloads + 6 for multimodal workloads).

Minimum: 8 GPUs
Recommended: 8+ GPUs with high-bandwidth interconnects (NVLink/InfiniBand) for optimal performance
Memory: Sufficient GPU memory for your target model (typically 20GB+ per GPU)

🏗️ Architecture

ElasticMM implements a hierarchical architecture with two main levels:

Modality Level: Allocates GPU instances between text-only and multimodal workloads
Stage Level: Manages encoding, prefill, and decoding stages within each modality group

The system automatically balances resources based on real-time demand and performance metrics, ensuring optimal utilization across different workload types.

📊 Performance

High Throughput: Optimized for maximum requests per second
Low Latency: Minimized time-to-first-token (TTFT)
Efficient Resource Usage: Dynamic allocation prevents resource waste
Scalable: Supports from single GPU to multi-node deployments

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

Built on top of vLLM for efficient LLM serving
Inspired by research in elastic computing and multimodal systems
Thanks to the open-source community for various dependencies

📚 Citation

If you find ElasticMM useful in your research or production deployments, please cite our NeurIPS 2025 paper:

@inproceedings{liu2025elasticmm,
  title     = {ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism},
  author    = {Liu, Zedong and Cheng, Shenggan and Tan, Guangming and You, Yang and Tao, Dingwen},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2025},
  url       = {https://arxiv.org/abs/2507.10069}
}

ElasticMM - Making multimodal AI more efficient and accessible.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
calibration_results		calibration_results
elasticmm		elasticmm
examples		examples
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ElasticMM: Elastic and Efficient MLLM Serving System

Latest News 🔥

🚀 Key Features

🛠️ Installation

Prerequisites

Install from Source

🚀 Quick Start

Step 1: System Calibration (Recommended)

Step 2: Simple Usage Example

Step 3: Online Service with Dynamic Workload

System Requirements

🏗️ Architecture

📊 Performance

📄 License

🙏 Acknowledgments

📚 Citation

About

Uh oh!

Releases 1

Packages

Languages

License

hpdps-group/ElasticMM

Folders and files

Latest commit

History

Repository files navigation

ElasticMM: Elastic and Efficient MLLM Serving System

Latest News 🔥

🚀 Key Features

🛠️ Installation

Prerequisites

Install from Source

🚀 Quick Start

Step 1: System Calibration (Recommended)

Step 2: Simple Usage Example

Step 3: Online Service with Dynamic Workload

System Requirements

🏗️ Architecture

📊 Performance

📄 License

🙏 Acknowledgments

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages