A comprehensive ecosystem for simulating and analyzing Large Language Model (LLM) performance across diverse hardware platforms. This repository provides accurate memory estimation, inference simulation, performance modeling, and training simulation for LLM deployment and fine-tuning.
simulator/
├── BudSimulator/ # Full-stack web application for LLM analysis
│ ├── frontend/ # React TypeScript UI
│ │ └── src/
│ │ ├── components/ # Reusable UI components
│ │ ├── services/ # API service layer
│ │ └── types/ # TypeScript interfaces
│ ├── apis/ # FastAPI backend
│ │ └── routers/ # API route handlers
│ │ ├── models.py # Model validation, memory calculation
│ │ ├── hardware.py # Hardware management, recommendations
│ │ ├── usecases.py # Usecase management, SLO validation
│ │ └── training.py # Training simulation APIs
│ └── Website/ # Streamlit dashboard
│ └── pages/ # Comparison tools
│
└── llm-memory-calculator/ # Core LLM performance modeling engine
└── src/ # Python package with GenZ framework
├── genz/ # Roofline-based performance modeling
│ ├── LLM_inference/ # Prefill/decode simulation
│ └── LLM_training/ # Training simulation
└── training/ # Training memory & cluster optimization
| Feature | Inference | Training |
|---|---|---|
| Memory Estimation | ✅ Weights + KV Cache + Activations | ✅ + Gradients + Optimizer States |
| Performance Modeling | ✅ Prefill, Decode, Speculative | ✅ Forward, Backward, Communication |
| Parallelism | ✅ TP, PP, EP | ✅ TP, PP, DP, EP, ZeRO 0-3 |
| Hardware Support | ✅ 57 profiles (GPU/TPU/ASIC/CPU) | ✅ Same |
| Cost Estimation | ✅ Per-request | ✅ Per-training-run |
| SLO Validation | ✅ TTFT, E2E, Throughput | ✅ Time to completion |
from llm_memory_calculator.genz.LLM_inference import (
prefill_moddeling, # First token latency simulation
decode_moddeling, # Token generation simulation
spec_prefill_modeling, # Speculative decoding
)
from llm_memory_calculator.genz.LLM_inference.best_parallelization import (
get_best_parallization_strategy, # Find optimal TP/PP
get_pareto_optimal_performance, # Pareto frontier analysis
)
from llm_memory_calculator.genz.LLM_inference.platform_size import (
get_minimum_system_size, # Minimum nodes required
)from llm_memory_calculator.genz.LLM_inference import prefill_moddeling
result = prefill_moddeling(
model='meta-llama/Llama-3.1-8B',
batch_size=4,
input_tokens=2048,
system_name='H100_GPU',
bits='bf16',
tensor_parallel=1,
pipeline_parallel=1,
)
print(f"TTFT: {result['Latency(ms)']:.1f} ms")
print(f"Throughput: {result['Throughput_tokens_per_sec']:.0f} tokens/s")from llm_memory_calculator.genz.LLM_inference import decode_moddeling
result = decode_moddeling(
model='meta-llama/Llama-3.1-8B',
batch_size=4,
input_tokens=2048,
output_tokens=256,
Bb=4, # Beam size
system_name='H100_GPU',
bits='bf16',
tensor_parallel=1,
)
print(f"Decode Latency: {result['Latency(ms)']:.1f} ms")
print(f"Output Throughput: {result['Throughput_tokens_per_sec']:.0f} tokens/s")from llm_memory_calculator.genz.LLM_inference.best_parallelization import (
get_best_parallization_strategy
)
df = get_best_parallization_strategy(
stage='decode',
model='meta-llama/Llama-3.1-70B',
total_nodes=8,
batch_size=16,
beam_size=4,
input_tokens=2048,
output_tokens=256,
system_name='H100_GPU',
bits='bf16',
)
print(df) # DataFrame with TP, PP, Latency, Throughputfrom llm_memory_calculator import calculate_memory
memory = calculate_memory(
model="meta-llama/Llama-3.1-8B", # HuggingFace ID or config dict
batch_size=4,
sequence_length=2048,
precision="bf16",
)
print(f"Model Weights: {memory.weights_memory_gb:.2f} GB")
print(f"KV Cache: {memory.kv_cache_gb:.2f} GB")
print(f"Activations: {memory.activations_gb:.2f} GB")
print(f"Total: {memory.total_memory_gb:.2f} GB")| Stage | Description | Models Required |
|---|---|---|
| SFT | Supervised Fine-Tuning | 1 (policy) |
| DPO | Direct Preference Optimization | 2 (policy + reference) |
| PPO | Proximal Policy Optimization | 4 (actor + critic + reference + reward) |
| GRPO | Group Relative Policy Optimization | 1 (with group sampling) |
| KTO | Kahneman-Tversky Optimization | 2 (policy + reference) |
| ORPO | Odds Ratio Preference Optimization | 1 (combined loss) |
| SimPO | Simple Preference Optimization | 1 (reference-free) |
| IPO | Identity Preference Optimization | 1 (reference-free) |
| RM | Reward Modeling | 1 (reward model) |
| Method | Trainable % | Memory Savings |
|---|---|---|
| Full | 100% | None |
| LoRA | ~0.5% | ~70% |
| QLoRA | ~0.5% | ~85% |
| DoRA | ~0.5% | ~70% |
| PiSSA | ~0.5% | ~70% |
| Freeze | Variable | Variable |
from llm_memory_calculator.training import TrainingMemoryCalculator
calculator = TrainingMemoryCalculator()
estimate = calculator.calculate_training_memory(
config="meta-llama/Llama-3.1-8B",
batch_size=4,
seq_length=2048,
precision="bf16",
method="lora", # full, lora, qlora, freeze, dora, pissa
optimizer="adamw",
gradient_checkpointing=True,
lora_rank=16,
)
print(f"Weight Memory: {estimate.weight_memory_gb:.2f} GB")
print(f"Gradient Memory: {estimate.gradient_memory_gb:.2f} GB")
print(f"Optimizer Memory: {estimate.optimizer_memory_gb:.2f} GB")
print(f"Activation Memory: {estimate.activation_memory_gb:.2f} GB")
print(f"Total Memory: {estimate.total_memory_gb:.2f} GB")from llm_memory_calculator.genz.LLM_training import training_modeling
result = training_modeling(
model='meta-llama/Llama-3.1-8B',
training_stage='sft', # sft, dpo, ppo, grpo, kto, orpo, simpo, rm
method='lora',
batch_size=4,
seq_length=2048,
system_name='H100_GPU',
num_gpus=8,
tensor_parallel=1,
data_parallel=8,
zero_stage=2,
optimizer='adamw',
lora_rank=16,
)
print(f"Step Time: {result.step_time_ms:.1f} ms")
print(f"Throughput: {result.tokens_per_second:.0f} tokens/s")
print(f"Memory/GPU: {result.memory_per_gpu_gb:.1f} GB")
print(f"MFU: {result.model_flops_utilization:.1%}")from llm_memory_calculator.genz.LLM_training import get_best_training_parallelization
config, result = get_best_training_parallelization(
model='meta-llama/Llama-3.1-70B',
total_gpus=64,
batch_size=4,
seq_length=4096,
system_name='H100_GPU',
)
print(f"Optimal: TP={config.tensor_parallel}, PP={config.pipeline_parallel}, DP={config.data_parallel}")
print(f"Throughput: {result.tokens_per_second:.0f} tokens/s")- NVIDIA: A100 (40GB, 80GB), H100, H200, GH200, B100, GB200, V100, RTX 4090/4080/3090, L40S, A10G
- AMD: MI300X, MI325X, MI210, MI100
- Google TPUs: TPUv4, TPUv5e, TPUv5p, TPUv6
- Intel: Gaudi3, MAX 1550, MAX 1100
- AWS: Trainium, Inferentia
- Specialty: Cerebras WSE-2/3, Groq LPU, SambaNova SN40L
- Intel Xeon: Sapphire Rapids, Emerald Rapids, Granite Rapids
- AMD EPYC: Milan, Genoa, Bergamo
- ARM: NVIDIA Grace, AWS Graviton3/4
| Endpoint | Method | Description |
|---|---|---|
/api/models/validate |
POST | Validate model URL/ID from HuggingFace |
/api/models/{model_id}/config |
GET | Get model architecture details |
/api/models/calculate |
POST | Calculate inference memory requirements |
/api/models/compare |
POST | Compare multiple models' memory |
/api/models/analyze |
POST | Analyze efficiency across sequence lengths |
/api/models/list |
GET | List all available models |
/api/models/popular |
GET | Get popular models with logos |
/api/models/filter |
GET | Advanced filtering (author, type, params) |
/api/hardware |
GET | List hardware with filters |
/api/hardware/filter |
GET | Advanced hardware filtering |
/api/hardware/recommend |
POST | Get hardware recommendations |
/api/usecases |
GET/POST | Usecase CRUD operations |
/api/usecases/{id} |
GET/PUT/DELETE | Single usecase operations |
/api/usecases/{id}/recommendations |
POST | Model/hardware recommendations for usecase |
/api/usecases/{id}/optimize-hardware |
POST | GenZ-based optimization sweep |
| Endpoint | Method | Description |
|---|---|---|
/api/simulator/hardware |
GET | List all 57 hardware profiles |
/api/simulator/estimate-training |
POST | Estimate training memory |
/api/simulator/recommend-cluster |
POST | Cluster recommendations (cost/speed) |
/api/simulator/check-fit |
POST | Check if training fits on hardware |
/api/simulator/estimate-time |
POST | Estimate training time and cost |
curl "http://localhost:8000/api/hardware?type=gpu&min_memory=40"curl -X POST http://localhost:8000/api/models/calculate \
-H "Content-Type: application/json" \
-d '{
"model_id": "meta-llama/Llama-3.1-8B",
"batch_size": 8,
"seq_length": 4096,
"precision": "bf16"
}'curl -X POST "http://localhost:8000/api/usecases/chatbot-1/recommendations" \
-H "Content-Type: application/json" \
-d '{
"batch_sizes": [1, 4, 8],
"model_categories": ["8B", "70B"]
}'curl -X POST http://localhost:8000/api/simulator/estimate-training \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B",
"method": "lora",
"batch_size": 4,
"seq_length": 2048,
"optimizer": "adamw",
"lora_rank": 16
}'curl -X POST http://localhost:8000/api/simulator/recommend-cluster \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B",
"method": "full",
"prefer_cost": true,
"max_gpus": 32
}'A modern TypeScript React application for interactive LLM analysis.
-
Hardware Browser: Searchable catalog with advanced filtering
- Filter by type, manufacturer, memory, FLOPS, bandwidth
- Sort by performance, cost, efficiency
- Detailed specs with tooltips
- Vendor and cloud pricing information
- Model compatibility matrix
-
Usecase Management: Configure inference workloads
- Industry and tag-based filtering
- Latency profiles: real-time, interactive, responsive, batch
- SLO configuration (TTFT, E2E, inter-token latency)
- Token range configuration
-
AI Optimization: GenZ-powered recommendations
- Batch size and model size selection
- Optimization modes: Cost, Speed, Balanced
- SLO compliance indicators
- Deployment guidance
-
Model Details: Architecture analysis
- Parameters, attention type, model type
- Memory requirements at various sequence lengths
- Links to HuggingFace
cd BudSimulator/frontend
npm install
npm start # Opens at http://localhost:3000An interactive analytical dashboard for performance visualization.
-
Home: GenZ framework overview and documentation
- Supported models and hardware
- Quantization options (FP32, BF16, INT8, INT4, INT2)
- Parallelism strategies visualization
-
Usecase Comparison: Compare performance across use cases
- Pre-configured use cases: Q&A, Summarization, Chatbots, Code Gen
- Scatter plots: TTFT vs Throughput with performance zones
- Bar charts: Latency, Throughput, Total Response Time
-
Model Comparison: Compare models at varying batch sizes
- Multi-model selection
- Batch sweep visualization (1-256)
- Prefill/Decode phase analysis
- Demand curve generation
-
Platform Comparison: Compare hardware accelerators
- Multi-platform selection with custom specs
- Hardware datasheet links
- Performance quadrant analysis
- Memory requirement checks
cd BudSimulator/Website
streamlit run Home.py # Opens at http://localhost:8501| Function | Description |
|---|---|
prefill_moddeling() |
Simulate first token latency (TTFT) |
decode_moddeling() |
Simulate token generation with KV cache growth |
spec_prefill_modeling() |
Speculative decoding simulation |
get_best_parallization_strategy() |
Find optimal TP/PP for inference |
get_pareto_optimal_performance() |
Pareto frontier analysis |
get_minimum_system_size() |
Calculate minimum nodes required |
| Function | Description |
|---|---|
TrainingMemoryCalculator |
Calculate training memory requirements |
TrainingClusterSelector |
Recommend optimal cluster configurations |
estimate_training_time() |
Estimate training time and cost |
auto_configure_training() |
Auto-configure optimal training setup |
build_llamafactory_config() |
Generate LlamaFactory YAML config |
build_deepspeed_config() |
Generate DeepSpeed JSON config |
| Function | Description |
|---|---|
training_modeling() |
Full training step simulation |
training_modeling_for_stage() |
Stage-aware training simulation |
get_best_training_parallelization() |
Find optimal parallelism strategy |
estimate_dpo_training() |
DPO-specific estimation |
estimate_ppo_training() |
PPO-specific estimation |
validate_against_benchmark() |
Validate against published benchmarks |
| Function | Description |
|---|---|
calculate_memory() |
Calculate inference memory requirements |
estimate_max_batch_size() |
Max batch for given GPU memory |
estimate_max_sequence_length() |
Max sequence for given constraints |
analyze_attention_efficiency() |
Analyze attention type efficiency |
| Phase | Latency | Throughput | Memory |
|---|---|---|---|
| Prefill | 45 ms | 181,689 tok/s | 17.2 GB |
| Decode (256 tokens) | 312 ms | 3,282 tok/s | 18.1 GB |
| Stage | Method | Weight | Gradient | Optimizer | Activation | Total/GPU |
|---|---|---|---|---|---|---|
| SFT | Full | 17.7 GB | 35.3 GB | 70.7 GB | 10.9 GB | 148.0 GB |
| SFT | LoRA | 17.7 GB | 0.1 GB | 0.3 GB | 10.9 GB | 31.9 GB |
| SFT | QLoRA | 4.4 GB | 0.1 GB | 0.3 GB | 10.9 GB | 17.3 GB |
| PPO | Full | 17.7 GB | 35.3 GB | 70.7 GB | 10.9 GB | 323.0 GB* |
| DPO | LoRA | 17.7 GB | 0.1 GB | 0.3 GB | 10.9 GB | 51.3 GB* |
*Includes reference/reward models
| Method | Time | Cost | Throughput | MFU |
|---|---|---|---|---|
| Full | 6.8h | $259 | 40,824 tok/s | 24.9% |
| LoRA | 6.8h | $259 | 40,824 tok/s | 24.9% |
cd BudSimulator
python setup.py # Automated setup
# Server runs at http://localhost:8000
# API docs at http://localhost:8000/docs
# Frontend at http://localhost:3000 (after npm start)cd llm-memory-calculator
pip install -e .cd BudSimulator/Website
pip install -r requirements.txt
streamlit run Home.py# Test inference
cd llm-memory-calculator
pytest tests/ -v -k "inference"
# Test training module
pytest tests/training/ -v
# Test full API
cd BudSimulator
python comprehensive_api_test.py
# Quick validation
python -c "
from llm_memory_calculator.genz.LLM_inference import prefill_moddeling
result = prefill_moddeling(
model='meta-llama/Llama-3.1-8B',
batch_size=4,
input_tokens=2048,
system_name='H100_GPU',
bits='bf16',
)
print(f'TTFT: {result[\"Latency(ms)\"]:.1f}ms')
"The simulator has been validated against published benchmarks:
- MLPerf Training results for LLaMA-2 70B
- DeepSpeed ZeRO efficiency measurements
- Megatron-LM throughput benchmarks
- Hardware vendor specifications (NVIDIA, AMD, Google)
Typical accuracy:
- Memory estimation: ±10%
- Throughput estimation: ±15%
- Training time: ±20%
┌─────────────────────────────────────────────────────────────────────────┐
│ User Interfaces │
├─────────────────┬─────────────────────────┬─────────────────────────────┤
│ React Web UI │ Streamlit Dashboard │ REST API Clients │
│ (Port 3000) │ (Port 8501) │ (curl, Python, etc) │
└────────┬────────┴───────────┬─────────────┴──────────────┬──────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ FastAPI Backend (Port 8000) │
├─────────────────┬─────────────────┬─────────────────┬───────────────────┤
│ /api/models │ /api/hardware │ /api/usecases │ /api/simulator │
│ - validate │ - list │ - CRUD │ - estimate │
│ - config │ - filter │ - recommend │ - recommend │
│ - calculate │ - recommend │ - optimize │ - check-fit │
│ - compare │ │ │ - estimate-time │
└────────┬────────┴────────┬────────┴────────┬────────┴────────┬──────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ llm-memory-calculator Package │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ GenZ Engine │ │
│ ├────────────────────────┬────────────────────────────────────────┤ │
│ │ LLM_inference/ │ LLM_training/ │ │
│ │ - prefill_moddeling │ - training_modeling │ │
│ │ - decode_moddeling │ - get_best_parallelization │ │
│ │ - spec_decode │ - training_stages │ │
│ │ - best_parallelism │ - validation │ │
│ └────────────────────────┴────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Training Module │ │
│ │ - TrainingMemoryCalculator - auto_configure_training │ │
│ │ - TrainingClusterSelector - build_llamafactory_config │ │
│ │ - estimate_training_time - build_deepspeed_config │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Hardware Configs (57 Profiles) │ │
│ │ GPUs | TPUs | ASICs | Accelerators | CPUs │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
We welcome contributions! Please see our Contributing Guide.
# Clone and setup
git clone https://github.com/BudEcosystem/simulator.git
cd simulator
pip install -e llm-memory-calculator/
cd BudSimulator && pip install -r requirements.txtThis project is licensed under the MIT License - see the LICENSE file for details.
- Built on the GenZ-LLM Analyzer framework
- Validated against MLPerf Training benchmarks
- Hardware specs from official vendor documentation
- Model configs from HuggingFace Hub
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Built with care by the Bud Ecosystem team