diff --git a/docs/HARDWARE_REQUIREMENTS.md b/docs/HARDWARE_REQUIREMENTS.md new file mode 100644 index 000000000..f0d66c7f7 --- /dev/null +++ b/docs/HARDWARE_REQUIREMENTS.md @@ -0,0 +1,142 @@ +# Hardware Requirements for Bud-Stack Platform + +## Executive Summary + +Bud-Stack is a comprehensive multi-service platform for AI/ML model deployment and cluster management. This document provides infrastructure requirements for Cloud Service Providers (CSPs) and organizations planning to deploy the platform. + +### Platform Overview + +The platform consists of: +- **14 Microservices** (Application, cluster management, ML optimization, model registry, etc.) +- **Core Infrastructure** (Databases, message queues, object storage, authentication) +- **Observability Stack** (Metrics, logging, distributed tracing) +- **High-Performance Gateway** (Rust-based API routing) + +--- + +## Infrastructure Requirements Summary + +### AI-In-A-Box - OEM + +| Resource | Requirement | +|----------|-------------| +| **CPU Cores** | 32 cores | +| **Memory (RAM)** | 64 GiB | +| **Storage (SSD)** | 200 GiB | +| **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) | +| **Kubernetes** | Version 1.29+ | + +**Max concurrency**: Upto 100 concurrent users + +--- + +### Enterprise deployment + +| Resource | Requirement | +|----------|-------------| +| **CPU Cores** | 96 cores | +| **Memory (RAM)** | 384 GiB | +| **Storage (SSD)** | 5 TiB | +| **Network Bandwidth** | 10 Gbps | +| **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) | +| **Kubernetes** | Version 1.29+ | + +**Max concurreny**: Upto 1000 concurrent users + +--- + +### CSP Deployment + +| Resource | Requirement | +|----------|-------------| +| **CPU Cores** | 120-200 cores | +| **Memory (RAM)** | 0.5 - 1 TiB | +| **Storage (SSD)** | 10 - 20 TiB | +| **Network Bandwidth** | 10-40 Gbps | +| **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) | +| **Kubernetes** | Version 1.29+ | + +**Concurrency**: 10000+ + +--- + +## Detailed Architecture + +### Node Pool Breakdown + +Production deployments use specialized node pools for optimal resource allocation: + +| Node Pool | Purpose | Node Spec | Count | Total Resources | +|-----------|---------|-----------|-------|-----------------| +| **Control Plane** | Databases, state management | 8 vCPU, 32GB RAM, 500GB SSD | 3-5 | 24-40 vCPU, 96-160GB RAM | +| **Application** | Microservices, APIs | 16 vCPU, 32GB RAM, 200GB SSD | 5-10 | 80-160 vCPU, 160-320GB RAM | +| **Data Plane** | Analytics, storage, messaging | 16 vCPU, 64GB RAM, 1TB SSD | 3-5 | 48-80 vCPU, 192-320GB RAM | +| **Gateway** | API gateway, ingress | 8 vCPU, 16GB RAM, 100GB SSD | 2-3 | 16-24 vCPU, 32-48GB RAM | + + +--- + +### Persistent Storage Breakdown + +| Component | Size (Min) | Size (Recommended) | Performance | +|-----------|------------|-------------------|-------------| +| **Databases** (PostgreSQL) | 10 GiB | 100-200 GiB | 3,000-10,000 IOPS, <10ms latency | +| **Analytics** (ClickHouse) | 30 GiB | 200-500 GiB | 5,000-20,000 IOPS, <5ms latency | +| **Object Storage** (Models, Datasets) | 50 GiB | 500 GiB-1 TiB | 1,000-5,000 IOPS, <20ms latency | +| **Message Queue** (Kafka) | 20 GiB | 100-200 GiB | 2,000-10,000 IOPS, <10ms latency | +| **Application Data** | 50 GiB | 100-200 GiB | Standard SSD | +| **Backups** | - | 500 GiB-1 TiB | Standard/Archive | + +**Total Storage**: +- **Minimum**: 256 GiB +- **Recommended**: 2 TiB + +### Storage Type Requirements + +- **Premium SSD/NVMe**: Required for databases (PostgreSQL, ClickHouse) +- **Standard SSD**: Acceptable for application data, metrics +- **Network Storage**: Supported for shared volumes (NFS, Azure Files, EFS) + +--- + +## Network Requirements + +| Traffic Type | Minimum | Recommended | Notes | +|--------------|---------|-------------|-------| +| **Inter-Node** | 5 Gbps | 10 Gbps | Between cluster nodes | +| **Internet Ingress** | 1 Gbps | 5 Gbps | API traffic, model uploads | +| **Internet Egress** | 1 Gbps | 5 Gbps | Model downloads, webhooks | + +--- + +## High Availability Scenarios + +### Standard HA Configuration + +| Component | Configuration | Failover Time | +|-----------|--------------|---------------| +| **Kubernetes Masters** | 3 nodes (multi-zone) | <30 seconds | +| **PostgreSQL** | 1 master + 2 replicas | <1 minute | +| **ClickHouse** | 3-node cluster | <2 minutes | +| **Redis** | 3-node Sentinel | <10 seconds | +| **Microservices** | 3+ replicas | Immediate | +| **Gateway** | 3+ replicas | Immediate | + + +### Key HA Features + +- **Auto-Scaling**: HPA enabled for all stateless services (CPU/memory threshold: 75%) +- **Health Checks**: Liveness/readiness probes on all pods (5-second intervals) +- **Anti-Affinity**: Pods distributed across zones to prevent single point of failure +- **PodDisruptionBudget**: Minimum 50% pods available during updates +- **Backup Schedule**: Daily database backups, 30-day retention, WAL archiving + + + +## Required Software + +- **Kubernetes**: Version 1.29+ +- **Helm**: Version 3.10 or higher +- **Container Runtime**: containerd 1.6+ +- **kubectl**: Matching Kubernetes version +- **Operating System**: Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+