GitHub - gjgjos/LLM-GPU-Calc: A simple and efficient tool to calculate GPU memory requirements for serving LLM.

📌 LLM-GPU-Calc

Estimate GPU Memory Requirements for Large Language Model (LLM) Serving with vLLM

🚀 Overview

LLM-GPU-Calc is a lightweight tool designed to estimate GPU memory usage for LLM inference, specifically for vLLM-based serving. It helps AI practitioners, engineers, and researchers optimize GPU resource allocation based on key model parameters, KV cache size, and concurrent user requirements.

🔹 Key Features

📊 Estimate Required GPU Memory for LLM inference
⚡ Breakdown of KV Cache, Model Weights, and Activation Memory
🔍 Supports Parameter Data Types (FP16, FP8, INT4, etc.)
🖥️ Optimize Concurrent User Handling for Efficient Serving
🔗 Integrates with Hugging Face Model API for Configurations

🔧 Installation

Clone the repository and install dependencies:

git clone https://github.com/gjgjos/LLM-GPU-Calc.git
cd LLM-GPU-Calc
pip install -r requirements.txt

📌 Usage

Run the Streamlit-based UI:

streamlit run app.py
or
python -m streamlit run app.py

🧠 Understanding GPU Memory Calculation

The required GPU memory for inference is calculated using the following formula:

Required GPU Memory = [(model_weight + non_torch_memory + pytorch_activation_peak_memory)
+ kv_cache_memory_per_batch * concurrent_users] / gpu_memory_utilization

Key Components

Component	Description
Model Weight	Memory occupied by model parameters
KV Cache Memory	Stores key-value pairs for transformer attention
Non-Torch Memory	Memory used for CUDA and etc
PyTorch Activation Peak Memory	Memory used for intermediate activations
GPU Utilization Factor	Fraction of GPU memory allocated for inference

KV Cache Calculation

kv_cache_memory_per_batch = (2 * kv_attention_heads * head_dim * num_layers * kv_data_type_size) * sequence_length

kv_attention_heads: Number of key-value attention heads
head_dim: Dimensionality of each attention head
num_layers: Number of transformer layers
sequence_length: Sum of input and output tokens

📈 Estimating Maximum Concurrent Users

When available KV cache memory is exhausted, requests are queued, increasing Time to First Token (TTFT). The maximum number of concurrent users a GPU can support is:

max_concurrent_users = available_kv_cache_memory // kv_cache_memory_per_batch

Example Calculation

Assumptions:

Total GPU Memory: 40GB
GPU Utilization: 90% (0.9)
Model Weight: 15GB
Non-Torch Memory: 400MB
PyTorch Activation Peak Memory: 1GB

Calculation:

available_kv_cache_memory = (40 * 0.9 - 15 - 0.4 - 1) = 19.6 GB

If each batch requires 200MB for KV cache:

max_concurrent_users = 19.6GB // 200MB = 98 users

🛠 Future Improvements

✅ Support for multi-GPU parallel inference
✅ Advanced profiling tools for real-time monitoring
✅ Integration with Kubernetes for scalable LLM deployment

🙌 Contributions

We welcome contributions! Feel free to open an issue, submit a pull request, or improve documentation. Let me know if you need any modifications. 😊

📬 Contact

📌 Author: gjgjos
📌 GitHub: LLM-GPU-Calc

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.devcontainer		.devcontainer
.gitignore		.gitignore
README.md		README.md
app.py		app.py
overview.jpg		overview.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📌 LLM-GPU-Calc

🚀 Overview

🔧 Installation

📌 Usage

🧠 Understanding GPU Memory Calculation

Key Components

KV Cache Calculation

📈 Estimating Maximum Concurrent Users

Example Calculation

🛠 Future Improvements

🙌 Contributions

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

gjgjos/LLM-GPU-Calc

Folders and files

Latest commit

History

Repository files navigation

📌 LLM-GPU-Calc

🚀 Overview

🔧 Installation

📌 Usage

🧠 Understanding GPU Memory Calculation

Key Components

KV Cache Calculation

📈 Estimating Maximum Concurrent Users

Example Calculation

🛠 Future Improvements

🙌 Contributions

📬 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages