Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions docs/HARDWARE_REQUIREMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Hardware Requirements for Bud-Stack Platform

## Executive Summary

Bud-Stack is a comprehensive multi-service platform for AI/ML model deployment and cluster management. This document provides infrastructure requirements for Cloud Service Providers (CSPs) and organizations planning to deploy the platform.

### Platform Overview

The platform consists of:
- **14 Microservices** (Application, cluster management, ML optimization, model registry, etc.)
- **Core Infrastructure** (Databases, message queues, object storage, authentication)
- **Observability Stack** (Metrics, logging, distributed tracing)
- **High-Performance Gateway** (Rust-based API routing)

---

## Infrastructure Requirements Summary

### Minimum Requirements (Development/Testing)

| Resource | Requirement |
|----------|-------------|
| **CPU Cores** | 32 cores |
| **Memory (RAM)** | 64 GiB |
| **Storage (SSD)** | 200 GiB |
| **Network Bandwidth** | 1 Gbps |
| **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) |
| **Kubernetes** | Version 1.29+ |

**Typical Configuration**: 3 nodes × (8 vCPU, 16GB RAM, 100GB SSD)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The 'Typical Configuration' does not meet the minimum requirements listed in the table above.

  • CPU: 3 nodes × 8 vCPU = 24 vCPU, which is less than the required 32 cores.
  • RAM: 3 nodes × 16GB RAM = 48GB RAM, which is less than the required 64 GiB.

This is misleading for users setting up a development environment. Please adjust the typical configuration to meet or exceed the minimums. For example, you could use 4 nodes.

Suggested change
**Typical Configuration**: 3 nodes × (8 vCPU, 16GB RAM, 100GB SSD)
**Typical Configuration**: 4 nodes × (8 vCPU, 16GB RAM, 100GB SSD)


---

### Recommended Requirements (Staging/Small Production)

| Resource | Requirement |
|----------|-------------|
| **CPU Cores** | 60-80 cores |
| **Memory (RAM)** | 80-120 GiB |
| **Storage (SSD)** | 500-1,000 GiB |
| **Network Bandwidth** | 5-10 Gbps |
| **Operating System** | Linux (Ubuntu 20.04+, RHEL 8+, or OpenShift 4.12+) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Ubuntu version for 'Recommended Requirements' is listed as 20.04+, which is older than the 22.04+ requirement for 'Minimum' and 'Production' environments. For consistency, please update this to 22.04+.

Suggested change
| **Operating System** | Linux (Ubuntu 20.04+, RHEL 8+, or OpenShift 4.12+) |
| **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) |

| **Kubernetes** | Version 1.25+ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Kubernetes version for 'Recommended Requirements' is listed as 1.25+, which is inconsistent with the 1.29+ requirement for 'Minimum' and 'Production' environments. To ensure consistency across the document, please update this to 1.29+.

Suggested change
| **Kubernetes** | Version 1.25+ |
| **Kubernetes** | Version 1.29+ |


**Typical Configuration**: 5-7 nodes × (16 vCPU, 32GB RAM, 200GB SSD)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The upper bound of the 'Typical Configuration' (7 nodes) significantly exceeds the ranges specified in the 'Recommended Requirements' table. For instance, a 7-node cluster provides 112 vCPU (vs. 60-80 recommended) and 1.4 TB storage (vs. 500-1000 GiB recommended). This is confusing. Please revise the typical configuration to align better with the recommended ranges.


**Use Case**: Staging environments, small production (<100 AI models, moderate traffic)

---

### Production Requirements (Large Scale)

| Resource | Requirement |
|----------|-------------|
| **CPU Cores** | 120-200 cores |
| **Memory (RAM)** | 250-500 GiB |
| **Storage (SSD)** | 2-5 TiB |
| **Network Bandwidth** | 10-40 Gbps |
| **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) |
| **Kubernetes** | Version 1.29+ |
Comment on lines 50 to 57
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The resource requirements in this summary table are inconsistent with the totals derived from the 'Detailed Production Architecture' section below. For example, you specify 120-200 CPU cores here, but the detailed breakdown sums to 168-304 vCPU. Similar discrepancies exist for RAM and Storage. To avoid confusion for capacity planning, this summary table should be updated to accurately reflect the totals from the detailed breakdown (after correcting the calculation errors in that section).


**Typical Configuration**: 15-25 nodes with specialized node pools (see below)

**Use Case**: Production environments (>100 AI models, high traffic, mission-critical)

---

## Detailed Production Architecture

### Node Pool Breakdown

Production deployments use specialized node pools for optimal resource allocation:

| Node Pool | Purpose | Node Spec | Count | Total Resources |
|-----------|---------|-----------|-------|-----------------|
| **Control Plane** | Databases, state management | 8 vCPU, 32GB RAM, 500GB SSD | 3-5 | 24-40 vCPU, 96-160GB RAM |
| **Application** | Microservices, APIs | 16 vCPU, 32GB RAM, 200GB SSD | 5-10 | 80-160 vCPU, 160-320GB RAM |
| **Data Plane** | Analytics, storage, messaging | 16 vCPU, 64GB RAM, 1TB SSD | 3-5 | 48-80 vCPU, 192-320GB RAM |
| **Gateway** | API gateway, ingress | 8 vCPU, 16GB RAM, 100GB SSD | 2-3 | 16-24 vCPU, 32-48GB RAM |


**Total Production Resources**: 168-304 vCPU, 480-848GB RAM, 3-6TB storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The total storage calculation for production resources appears incorrect. Based on the 'Node Pool Breakdown' table, the storage range should be 5.7-9.8 TB, not 3-6 TB.

Calculation:

  • Min: (3 * 500GB) + (5 * 200GB) + (3 * 1TB) + (2 * 100GB) = 5.7 TB
  • Max: (5 * 500GB) + (10 * 200GB) + (5 * 1TB) + (3 * 100GB) = 9.8 TB

Please update this line to reflect the correct total.

Suggested change
**Total Production Resources**: 168-304 vCPU, 480-848GB RAM, 3-6TB storage
**Total Production Resources**: 168-304 vCPU, 480-848GB RAM, 5.7-9.8TB storage

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fix production storage totals to match node pool sums

The totals listed after the node pool table claim 3-6TB of storage, but summing the node specs just above gives roughly 5.7-9.8TB (e.g., 3–5 × 500GB control plane + 5–10 × 200GB application + 3–5 × 1TB data + 2–3 × 100GB gateway). The incorrect figure would understate disk requirements by more than 40%, which can lead infrastructure planners to severely under‑provision storage for production deployments.

Useful? React with 👍 / 👎.

---

## Storage Requirements

### Persistent Storage Breakdown

| Component | Size (Min) | Size (Recommended) | Performance |
|-----------|------------|-------------------|-------------|
| **Databases** (PostgreSQL) | 10 GiB | 100-200 GiB | 3,000-10,000 IOPS, <10ms latency |
| **Analytics** (ClickHouse) | 30 GiB | 200-500 GiB | 5,000-20,000 IOPS, <5ms latency |
| **Object Storage** (Models, Datasets) | 50 GiB | 500 GiB-1 TiB | 1,000-5,000 IOPS, <20ms latency |
| **Message Queue** (Kafka) | 20 GiB | 100-200 GiB | 2,000-10,000 IOPS, <10ms latency |
| **Application Data** | 50 GiB | 100-200 GiB | Standard SSD |
| **Backups** | - | 500 GiB-1 TiB | Standard/Archive |

**Total Storage**:
- **Minimum**: 256 GiB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The 'Minimum' total storage is listed as 256 GiB, but summing the 'Size (Min)' column in the 'Persistent Storage Breakdown' table gives 160 GiB (10+30+50+20+50). This discrepancy is confusing. Please either correct the total or clarify what other storage components are included in this figure.

- **Recommended Small**: 1.5-2 TiB
- **Recommended Large**: 3-6 TiB

### Storage Type Requirements

- **Premium SSD/NVMe**: Required for databases (PostgreSQL, ClickHouse)
- **Standard SSD**: Acceptable for application data, metrics
- **Network Storage**: Supported for shared volumes (NFS, Azure Files, EFS)

---

## Network Requirements

| Traffic Type | Minimum | Recommended | Notes |
|--------------|---------|-------------|-------|
| **Inter-Node** | 1 Gbps | 5 Gbps | Between cluster nodes |
| **Internet Ingress** | 5 Gbps | 10 Gbps | API traffic, model uploads |
| **Internet Egress** | 5 Gbps | 10 Gbps | Model downloads, webhooks |

---

## Prerequisites

### Required Software

- **Kubernetes**: Version 1.29+
- **Helm**: Version 3.10 or higher
- **Container Runtime**: containerd 1.6+
- **kubectl**: Matching Kubernetes version
- **Operating System**: Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+