Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
242 changes: 242 additions & 0 deletions docs/clusters/cluster.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
---
title: "Cluster Module (Bud Admin)"
description: "Managing CPU and GPU clusters from the Bud Admin console"
---

## Description

The Bud Admin cluster module gives platform, MLOps, and DevOps teams a single control plane to register, govern, and operate CPU and GPU clusters. It is designed for hybrid and multi-cloud footprints where GenAI workloads span inference APIs, training jobs, evaluations, and interactive playground traffic. The module pairs operational controls (quotas, autoscaling, scheduling) with governance (RBAC, audit trails) so that teams can move fast without risking runaway spend or compliance gaps.

Bud’s cluster experience mirrors the rest of the admin console: declarative defaults, safe self-service, and deep observability. GPU-first organizations can maximize utilization with pool-aware scheduling, while CPU clusters can handle supporting services, control-plane workloads, and cost-efficient inference.


## USPs (Unique Selling Propositions)

### 1. Unified control plane for CPU and GPU fleets

Operate heterogenous clusters (CPU-only, GPU-only, mixed) from one console, with consistent policies for quotas, networking, security, and routing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in 'heterogenous'. The correct spelling is 'heterogeneous'.

Operate heterogeneous clusters (CPU-only, GPU-only, mixed) from one console, with consistent policies for quotas, networking, security, and routing.


### 2. Enterprise governance baked in

Cluster actions respect Bud RBAC, project scoping, and audit logging. Every create, edit, and delete is tracked; permissions align with infra-admin roles and project boundaries.

### 3. Purpose-built for GenAI traffic

GPU-aware scheduling, pool-based allocations, and model/route affinity keep interactive agents and batch training predictable. Autoscaling and queueing policies are tuned for latency-sensitive inference and bursty workloads.

### 4. Multi-cloud and on-prem friendly

Register Kubernetes clusters from public clouds or on-prem; attach custom runtimes, registries, and CNI settings without rewriting your topology.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better clarity and formality in documentation, it's recommended to use the full term 'on-premises' instead of the abbreviation 'on-prem'.

Register Kubernetes clusters from public clouds or on-premises; attach custom runtimes, registries, and CNI settings without rewriting your topology.


### 5. Safety rails for cost and reliability

Quotas, budget guards, health gates, and preflight checks reduce misconfiguration. Templates accelerate secure-by-default setups for production, staging, and sandbox environments.


## Features

### 3.1 Cluster registration

- Guided registration for CPU, GPU, or mixed clusters with configurable networking, logging, and storage.
- Support for cloud-managed and self-managed Kubernetes distributions.

### 3.2 Node pools & GPU-aware scheduling

- Define node pools by instance type, GPU SKU, and availability zone.
- Enable bin-packing and topology hints to maximize GPU occupancy.
- Reserve pools for model-serving, batch training, or control-plane services.

### 3.3 Autoscaling & quotas

- Horizontal and vertical autoscaling presets per pool.
- Budget and quota controls per project/team with soft and hard limits.
- Scale-to-zero for bursty agents; warm pools for low-latency inference.

### 3.4 Networking, security, and compliance

- CNI and ingress configuration with support for private endpoints.
- Namespace/project isolation with network policies and pod security standards.
- Secrets management and image-signature enforcement for registries.

### 3.5 Observability & diagnostics

- Live health status (nodes, GPU readiness, control-plane components).
- Metrics and logs tabs with time-window filters and saved views.
- Event timeline for deployments, reschedules, failures, and admin actions.

### 3.6 Integrations & runtime controls

- Connect to model registries and OCI registries for runtime images.
- Attach storage classes for datasets, checkpoints, and artifacts.
- Webhooks for incident management, cost alerts, and guardrail violations.


## How-to Guides

### 4.1 Accessing the cluster module

Log in to your Bud AI Foundry dashboard using SSO or your credentials.
Click on **Clusters** from the side menu.
Click on **+Cluster**.
Comment on lines +81 to +83
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For step-by-step instructions, using a numbered list improves readability and makes the process clearer for the user to follow.

1. Log in to your Bud AI Foundry dashboard using SSO or your credentials.
2. Click on **Clusters** from the side menu.
3. Click on **+Cluster**.



### 4.2 Add a new cluster

Click **+Cluster**.
Choose **Create New Cluster**.
Choose cloud provider.
Select cloud credentials and click **Next**.
Cluster is added and displayed on the listing page.
Comment on lines +89 to +93
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make these instructions easier to follow, consider formatting them as a numbered list. This clearly outlines the sequence of actions for the user.

1. Click **+Cluster**.
2. Choose **Create New Cluster**.
3. Choose cloud provider.
4. Select cloud credentials and click **Next**.
5. Cluster is added and displayed on the listing page.



### 4.3 Add an existing cluster

Click **+Cluster**.
Choose **Connect to Existing Cluster**.
Provide cluster name, ingress URL, and upload the configuration file.
Click **Next**.
Cluster is added and displayed on the listing page.
Comment on lines +99 to +103
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a numbered list for these steps will improve the clarity and readability of the instructions.

1. Click **+Cluster**.
2. Choose **Connect to Existing Cluster**.
3. Provide cluster name, ingress URL, and upload the configuration file.
4. Click **Next**.
5. Cluster is added and displayed on the listing page.



### 4.4 Edit a cluster

Open the cluster detail page from the listing.
Click the edit icon and update data (name, ingress URL).
Save changes to refresh the entry and downstream.
Comment on lines +109 to +111
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Formatting these instructions as a numbered list will make them clearer and easier for users to follow.

1. Open the cluster detail page from the listing.
2. Click the edit icon and update data (name, ingress URL).
3. Save changes to refresh the entry and downstream.



### 4.5 Delete a cluster

Open the cluster detail page.
Choose **Delete** from the Actions menu.
Confirm removal to detach the cluster.
Ensure dependent models or applications are redirected before finalizing deletion.
Bud decommissions workloads, drains nodes, and revokes credentials before final removal. Audit logs record the deletion.
Comment on lines +117 to +121
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These steps would be more readable and easier to follow if formatted as a numbered list.

1. Open the cluster detail page.
2. Choose **Delete** from the Actions menu.
3. Confirm removal to detach the cluster.
4. Ensure dependent models or applications are redirected before finalizing deletion.
5. Bud decommissions workloads, drains nodes, and revokes credentials before final removal. Audit logs record the deletion.



### 4.6 General tab

Open the cluster detail page from the listing.
Review summary data including health, version, owners, tags, and environment.
Check recent events for configuration or deployment changes.
Copy identifiers such as cluster ID or kubeconfig context for support requests.
Comment on lines +127 to +130
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve readability, consider using a numbered list for these instructions.

1. Open the cluster detail page from the listing.
2. Review summary data including health, version, owners, tags, and environment.
3. Check recent events for configuration or deployment changes.
4. Copy identifiers such as cluster ID or kubeconfig context for support requests.



### 4.7 Deployments tab

Open the cluster detail page from the listing.
Go to **Deployments**.
View active workloads with status, routing, and rollout versions.
Click a deployment to see pods, rollout progress, and failure reasons.
Trigger a restart, pause, or rollback if a deployment is unhealthy.
Comment on lines +136 to +140
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Formatting these steps as a numbered list would make the instructions clearer for the user.

1. Open the cluster detail page from the listing.
2. Go to **Deployments**.
3. View active workloads with status, routing, and rollout versions.
4. Click a deployment to see pods, rollout progress, and failure reasons.
5. Trigger a restart, pause, or rollback if a deployment is unhealthy.



### 4.8 Nodes tab

Open the cluster detail page from the listing.
Navigate to **Nodes** to see pools, capacity, and GPU SKUs.
Select a node pool to view desired, minimum, and maximum nodes.
Adjust autoscaling parameters or cordon/uncordon nodes before maintenance.
Drill into a node for labels, taints, and recent health signals.
Comment on lines +146 to +150
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These instructions would be more readable if they were formatted as a numbered list.

1. Open the cluster detail page from the listing.
2. Navigate to **Nodes** to see pools, capacity, and GPU SKUs.
3. Select a node pool to view desired, minimum, and maximum nodes.
4. Adjust autoscaling parameters or cordon/uncordon nodes before maintenance.
5. Drill into a node for labels, taints, and recent health signals.



### 4.9 Analytics tab

Open the cluster detail page from the listing.
Go to **Analytics**.
Choose a time window to review latency, throughput, and utilization metrics.
Filter charts by project, pool, or workload type to isolate anomalies.
Export dashboards or save views for recurring reviews.
Comment on lines +156 to +160
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a numbered list for these steps will improve clarity and make the guide easier to follow.

1. Open the cluster detail page from the listing.
2. Go to **Analytics**.
3. Choose a time window to review latency, throughput, and utilization metrics.
4. Filter charts by project, pool, or workload type to isolate anomalies.
5. Export dashboards or save views for recurring reviews.



### 4.10 Settings tab

Open the cluster detail page from the listing.
Navigate to **Settings**.
Update networking (ingress, endpoints), security (registries, secrets), and compliance gates.
Set quotas or budget alerts for projects mapped to the cluster.
Save changes and verify that health checks pass after updates.
Comment on lines +166 to +170
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make these instructions more scannable and easier to follow, please format them as a numbered list.

1. Open the cluster detail page from the listing.
2. Navigate to **Settings**.
3. Update networking (ingress, endpoints), security (registries, secrets), and compliance gates.
4. Set quotas or budget alerts for projects mapped to the cluster.
5. Save changes and verify that health checks pass after updates.



### 4.11 Modify permissions for clusters

Open the user management page and navigate to the user’s detail view.
Assign view access for users who should only view the cluster listing.
Grant manage permissions to users who can add clusters and perform edits or deletions.
Save updates to enforce access across the catalog and all cluster actions.
Comment on lines +176 to +179
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Formatting these steps as a numbered list will improve readability and make the process clearer.

1. Open the user management page and navigate to the user’s detail view.
2. Assign view access for users who should only view the cluster listing.
3. Grant manage permissions to users who can add clusters and perform edits or deletions.
4. Save updates to enforce access across the catalog and all cluster actions.



## FAQ

### Q1. Which clusters are supported?

CPU, GPU, and mixed Kubernetes clusters from public clouds or on-prem are supported. GPU scheduling honors pool labels and SKUs so latency-sensitive routes stay predictable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and formality in documentation, it's better to use 'on-premises' instead of the abbreviated 'on-prem'.

CPU, GPU, and mixed Kubernetes clusters from public clouds or on-premises are supported. GPU scheduling honors pool labels and SKUs so latency-sensitive routes stay predictable.



### Q2. Who can create or edit clusters?

Users with infra or platform admin permissions (per RBAC) can create/edit/delete clusters. Changes are scoped to their allowed projects and are fully audited.


### Q3. How does Bud prevent runaway GPU spend?

Quotas and budgets cap CPU/GPU/memory and cost per project; autoscaling policies can enforce scale-to-zero, warm pools, and max nodes per pool. Alerts fire when thresholds are crossed.


### Q4. Can I pin certain models or routes to GPU pools?

Yes. Label pools (e.g., `gpu=hopper`, `workload=model-serving`) and set affinity/taints in your model or route configuration. The scheduler honors these hints.


### Q5. What observability is available?

The detail page surfaces health, metrics, logs, and events. You can stream to external sinks, export diagnostics bundles, and set alert destinations for incidents or budget breaches.


### Q6. How are deletes handled safely?

Deletes require confirmation, drain workloads, revoke credentials, and capture an audit log entry. Dependent projects and routes are surfaced before final removal.


### Q7. Can we operate across multiple clouds?

Yes. Register clusters from different clouds or on-prem. Policies, quotas, and security templates remain consistent, and pools can be tagged by region/zone for routing and failover.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To maintain a formal tone and consistency throughout the documentation, please use 'on-premises' instead of 'on-prem'.

Yes. Register clusters from different clouds or on-premises. Policies, quotas, and security templates remain consistent, and pools can be tagged by region/zone for routing and failover.



### Q8. How do GPU-first orgs benefit?

GPU-aware scheduling, pool-level bin-packing, and warm pools keep inference latency low while maximizing occupancy. Budget controls and alerts keep expensive SKUs in check.


### Q9. Does the module support compliance needs?

Yes. Pod security standards, network policies, signed images, secrets management, and full audit trails help align with enterprise security and regulatory requirements.


### Q10. What if networking settings change after creation?

Use the Networking tab to update ingress/TLS and IP policies. The module applies rolling updates when possible and surfaces disruptions (e.g., endpoint changes) before applying.

Loading