Estimate GPU Memory Requirements for Large Language Model (LLM) Serving with vLLM
- Website: https://llm-gpu-calc-59lxcbck4pxs2zifsgilwq.streamlit.app/
- Related Blog: https://medium.com/@kimdoil1211/how-much-gpu-memory-do-you-really-need-for-efficient-llm-serving-4d26d5b8b95b
LLM-GPU-Calc is a lightweight tool designed to estimate GPU memory usage for LLM inference, specifically for vLLM-based serving. It helps AI practitioners, engineers, and researchers optimize GPU resource allocation based on key model parameters, KV cache size, and concurrent user requirements.
🔹 Key Features
- 📊 Estimate Required GPU Memory for LLM inference
- ⚡ Breakdown of KV Cache, Model Weights, and Activation Memory
- 🔍 Supports Parameter Data Types (FP16, FP8, INT4, etc.)
- 🖥️ Optimize Concurrent User Handling for Efficient Serving
- 🔗 Integrates with Hugging Face Model API for Configurations
Clone the repository and install dependencies:
git clone https://github.com/gjgjos/LLM-GPU-Calc.git
cd LLM-GPU-Calc
pip install -r requirements.txt
Run the Streamlit-based UI:
streamlit run app.py
or
python -m streamlit run app.py
The required GPU memory for inference is calculated using the following formula:
Required GPU Memory = [(model_weight + non_torch_memory + pytorch_activation_peak_memory)
+ kv_cache_memory_per_batch * concurrent_users] / gpu_memory_utilization
Component | Description |
---|---|
Model Weight | Memory occupied by model parameters |
KV Cache Memory | Stores key-value pairs for transformer attention |
Non-Torch Memory | Memory used for CUDA and etc |
PyTorch Activation Peak Memory | Memory used for intermediate activations |
GPU Utilization Factor | Fraction of GPU memory allocated for inference |
kv_cache_memory_per_batch = (2 * kv_attention_heads * head_dim * num_layers * kv_data_type_size) * sequence_length
kv_attention_heads
: Number of key-value attention headshead_dim
: Dimensionality of each attention headnum_layers
: Number of transformer layerssequence_length
: Sum of input and output tokens
When available KV cache memory is exhausted, requests are queued, increasing Time to First Token (TTFT). The maximum number of concurrent users a GPU can support is:
max_concurrent_users = available_kv_cache_memory // kv_cache_memory_per_batch
Assumptions:
- Total GPU Memory: 40GB
- GPU Utilization: 90% (0.9)
- Model Weight: 15GB
- Non-Torch Memory: 400MB
- PyTorch Activation Peak Memory: 1GB
Calculation:
available_kv_cache_memory = (40 * 0.9 - 15 - 0.4 - 1) = 19.6 GB
If each batch requires 200MB for KV cache:
max_concurrent_users = 19.6GB // 200MB = 98 users
- ✅ Support for multi-GPU parallel inference
- ✅ Advanced profiling tools for real-time monitoring
- ✅ Integration with Kubernetes for scalable LLM deployment
We welcome contributions! Feel free to open an issue, submit a pull request, or improve documentation. Let me know if you need any modifications. 😊
📌 Author: gjgjos
📌 GitHub: LLM-GPU-Calc