-
Notifications
You must be signed in to change notification settings - Fork 3
docs: add comprehensive hardware requirements documentation for CSPs #780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,129 @@ | ||||||
| # Hardware Requirements for Bud-Stack Platform | ||||||
|
|
||||||
| ## Executive Summary | ||||||
|
|
||||||
| Bud-Stack is a comprehensive multi-service platform for AI/ML model deployment and cluster management. This document provides infrastructure requirements for Cloud Service Providers (CSPs) and organizations planning to deploy the platform. | ||||||
|
|
||||||
| ### Platform Overview | ||||||
|
|
||||||
| The platform consists of: | ||||||
| - **14 Microservices** (Application, cluster management, ML optimization, model registry, etc.) | ||||||
| - **Core Infrastructure** (Databases, message queues, object storage, authentication) | ||||||
| - **Observability Stack** (Metrics, logging, distributed tracing) | ||||||
| - **High-Performance Gateway** (Rust-based API routing) | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ## Infrastructure Requirements Summary | ||||||
|
|
||||||
| ### Minimum Requirements (Development/Testing) | ||||||
|
|
||||||
| | Resource | Requirement | | ||||||
| |----------|-------------| | ||||||
| | **CPU Cores** | 32 cores | | ||||||
| | **Memory (RAM)** | 64 GiB | | ||||||
| | **Storage (SSD)** | 200 GiB | | ||||||
| | **Network Bandwidth** | 1 Gbps | | ||||||
| | **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) | | ||||||
| | **Kubernetes** | Version 1.29+ | | ||||||
|
|
||||||
| **Typical Configuration**: 3 nodes × (8 vCPU, 16GB RAM, 100GB SSD) | ||||||
|
|
||||||
| --- | ||||||
|
|
||||||
| ### Recommended Requirements (Staging/Small Production) | ||||||
|
|
||||||
| | Resource | Requirement | | ||||||
| |----------|-------------| | ||||||
| | **CPU Cores** | 60-80 cores | | ||||||
| | **Memory (RAM)** | 80-120 GiB | | ||||||
| | **Storage (SSD)** | 500-1,000 GiB | | ||||||
| | **Network Bandwidth** | 5-10 Gbps | | ||||||
| | **Operating System** | Linux (Ubuntu 20.04+, RHEL 8+, or OpenShift 4.12+) | | ||||||
|
||||||
| | **Operating System** | Linux (Ubuntu 20.04+, RHEL 8+, or OpenShift 4.12+) | | |
| | **Operating System** | Linux (Ubuntu 22.04+, RHEL 8+, or OpenShift 4.12+) | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Kubernetes version for 'Recommended Requirements' is listed as 1.25+, which is inconsistent with the 1.29+ requirement for 'Minimum' and 'Production' environments. To ensure consistency across the document, please update this to 1.29+.
| | **Kubernetes** | Version 1.25+ | | |
| | **Kubernetes** | Version 1.29+ | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upper bound of the 'Typical Configuration' (7 nodes) significantly exceeds the ranges specified in the 'Recommended Requirements' table. For instance, a 7-node cluster provides 112 vCPU (vs. 60-80 recommended) and 1.4 TB storage (vs. 500-1000 GiB recommended). This is confusing. Please revise the typical configuration to align better with the recommended ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resource requirements in this summary table are inconsistent with the totals derived from the 'Detailed Production Architecture' section below. For example, you specify 120-200 CPU cores here, but the detailed breakdown sums to 168-304 vCPU. Similar discrepancies exist for RAM and Storage. To avoid confusion for capacity planning, this summary table should be updated to accurately reflect the totals from the detailed breakdown (after correcting the calculation errors in that section).
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total storage calculation for production resources appears incorrect. Based on the 'Node Pool Breakdown' table, the storage range should be 5.7-9.8 TB, not 3-6 TB.
Calculation:
- Min: (3 * 500GB) + (5 * 200GB) + (3 * 1TB) + (2 * 100GB) = 5.7 TB
- Max: (5 * 500GB) + (10 * 200GB) + (5 * 1TB) + (3 * 100GB) = 9.8 TB
Please update this line to reflect the correct total.
| **Total Production Resources**: 168-304 vCPU, 480-848GB RAM, 3-6TB storage | |
| **Total Production Resources**: 168-304 vCPU, 480-848GB RAM, 5.7-9.8TB storage |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix production storage totals to match node pool sums
The totals listed after the node pool table claim 3-6TB of storage, but summing the node specs just above gives roughly 5.7-9.8TB (e.g., 3–5 × 500GB control plane + 5–10 × 200GB application + 3–5 × 1TB data + 2–3 × 100GB gateway). The incorrect figure would understate disk requirements by more than 40%, which can lead infrastructure planners to severely under‑provision storage for production deployments.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'Typical Configuration' does not meet the minimum requirements listed in the table above.
3 nodes × 8 vCPU = 24 vCPU, which is less than the required32 cores.3 nodes × 16GB RAM = 48GB RAM, which is less than the required64 GiB.This is misleading for users setting up a development environment. Please adjust the typical configuration to meet or exceed the minimums. For example, you could use 4 nodes.