Skip to content

Commit 19fb908

Browse files
Add Team Documentation Plus the 5_16_25 release (#57)
* Update links in GETTING_STARTED_README.md and variables.tf to reflect the latest release version. Add new JSON blueprints for team creation and job submission, along with a comprehensive README for the Teams feature in OCI AI Blueprints. * Add new release details for 2025-05-16 in QuickStartVersions.md, including Terraform, OCI AI Blueprints, Helm Chart, and Container versions.
1 parent 0a58fb6 commit 19fb908

File tree

6 files changed

+312
-3
lines changed

6 files changed

+312
-3
lines changed

GETTING_STARTED_README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ This guide helps you install and use **OCI AI Blueprints** for the first time. Y
2424

2525
Instead of creating an OKE cluster manually, you can deploy a **VCN + OKE cluster** in one click. Use the button below to open Oracle Cloud’s Resource Manager:
2626

27-
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-04-22/cluster_release-2025-04-22.zip)
27+
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-05-16/release_2025_05_16_cluster.zip)
2828

2929
1. Click **Deploy to Oracle Cloud** above.
3030
2. In **Create Stack**:
@@ -42,7 +42,7 @@ Now that your cluster is ready, follow these steps to install OCI AI Blueprints
4242

4343
1. Click the **Deploy to Oracle Cloud** button below to open another Resource Manager stack—this one for OCI AI Blueprints:
4444

45-
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-04-22/app_release-2025-04-22.zip)
45+
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-05-16/release_2025_05_16_app.zip)
4646

4747
2. In the **Create Stack** wizard:
4848
- Provide a **name** (e.g., _oci-ai-blueprints-stack_).

cluster_creation_terraform/variables.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# Licensed under the Universal Permissive License v 1.0 as shown at http://oss.oracle.com/licenses/upl.
33
#
44
variable "oci_ai_blueprints_link_variable" {
5-
default = "https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-04-22/app_release-2025-04-22.zip"
5+
default = "https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-05-16/release_2025_05_16_app.zip"
66
}
77

88
# OKE Variables
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Teams
2+
3+
**Teams** feature in OCI AI Blueprints lets admins enforce resource quotas and fair sharing between teams to decide when and where a job (batch, HPC, and AI/ML workloads) should wait or run within the cluster.
4+
5+
Each bucket (a _team_) has hard _nominal quotas_, soft _borrowing_ / _lending_ limits, an optional _priority threshold_, and a friendly name you reference in any job blueprint.
6+
Behind the scenes, the blueprint engine uses Kueue and wires up a `ClusterQueue`, `LocalQueue`, and a `Cohort` so workloads from different teams share idle capacity fairly while respecting their quotas. ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/))
7+
8+
**Note: Make sure that your OCI AI Blueprints instance has been updated since 5/16/25 to ensure that the Kueue operator is installed.**
9+
10+
---
11+
12+
## What is a “Team”?
13+
14+
Including `recipe_mode: team` and the `team` object to a blueprint creates a new team.
15+
Submitting one:
16+
17+
1. Creates a **ClusterQueue** that owns the quotas for your team — CPU, memory, GPU counts per shape. ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/))
18+
2. Creates a namespaced **LocalQueue** so jobs in that namespace can enqueue against the ClusterQueue. ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/))
19+
3. Joins that ClusterQueue to a **Cohort** so it can _borrow_ unused quota from sibling queues and _lend_ when idle. All teams share the same cohort across the entire blueprint engine ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/))
20+
21+
---
22+
23+
## When should I use Teams?
24+
25+
- **Multi-tenant clusters** – isolate business units, research groups, or customers while still sharing idle GPU/CPU.
26+
- **Fair-share batch environments** – let high-priority jobs pre-empt low-priority work within quota rules.
27+
- **Capacity planning** – express org-level GPU budgets in code and track consumption with teams
28+
29+
---
30+
31+
## Core Concepts
32+
33+
**Team**
34+
35+
`ClusterQueue` + `LocalQueue` + `Namespace`
36+
37+
A **Team** is a logical grouping backed by a Kueue `ClusterQueue` (defining its `nominalQuota`, `lendingLimit`, and `borrowingLimit`) plus a corresponding `LocalQueue` in a dedicated Kubernetes `Namespace`, which together guarantee each team’s reserved capacity and enable it to borrow or lend idle resources within the shared cluster.
38+
39+
- **Example:**
40+
If you create a team called **“research”** with a `nominalQuota` of 10 GPUs, a `borrowingLimit` of 4 GPUs, and a `lendingLimit` of 4 GPUs, OCI AI Blueprints will spin up a `ClusterQueue` named “research-cluster-queue” configured with those limits and a `LocalQueue` named "research-local-queue" in the
41+
"research-namespace" `namespace`. Any job you submit in that namespace automatically enters the “research-local-queue” `LocalQueue`, giving it up to 10 GPUs guaranteed, the ability to borrow up to 4 GPUs when others are idle, and the willingness to lend up to 4 GPUs back to the cohort when it has idle capacity.
42+
43+
**Nominal Quota**
44+
The `nominalQuota` is the guaranteed amount of resources reserved for a team that it can always use, independent of other teams’ activity.
45+
46+
- **Example:**
47+
If **Team A** has a `nominalQuota` of 10 GPUs, those 10 GPUs are always exclusively available to Team A before any borrowing or lending is considered.
48+
49+
**Borrowing Limit**
50+
51+
The `borrowingLimit` is the maximum extra resources a team may temporarily use beyond its nominal quota when there’s idle capacity in the cluster.
52+
53+
- **Example:**
54+
If **Team A** has a `nominalQuota` of 10 GPUs and a `borrowingLimit` of 4 GPUs, it can consume up to 14 GPUs whenever other teams aren’t using theirs, but no more.
55+
56+
**Lending Limit**
57+
58+
The `lendingLimit` is the maximum idle resources a team is willing to offer into the shared pool for other teams to borrow.
59+
60+
- **Example:**
61+
If **Team A** has a `nominalQuota` of 10 GPUs but is only using 6, and its `lendingLimit` is 4 GPUs, then up to 4 of its unused GPUs become available for others to borrow.
62+
63+
**Priority Threshold**
64+
65+
The `priorityThreshold` set at the team level assigns a single priority value to all of that team’s workloads and determines which teams’ jobs may exceed their nominal quotas when extra resources are available.
66+
67+
- **Example:**
68+
If **Team A** has `priorityThreshold: 100` and **Team B** has `priorityThreshold: 50`, then when idle GPUs exist, Team A’s workloads (priority 100) will be allowed to borrow first; Team B’s workloads (priority 50) can borrow only if resources remain after Team A has taken theirs.
69+
70+
**Cohort**
71+
72+
A **Cohort** is the single, cluster-wide sharing group that all teams belong to, enabling them to borrow from and lend to one another according to their configured borrowing and lending limits.
73+
74+
- **Example:**
75+
If **Team A** and **Team B** are both in the cluster cohort, then when Team A has idle GPUs it can lend up to its lending limit, and Team B can borrow from that shared pool (up to its borrowing limit), and vice versa — ensuring resources never sit idle in the cluster.
76+
77+
---
78+
79+
## Team Blueprint Schema (`recipe_mode: "team"`)
80+
81+
```json
82+
{
83+
"recipe_mode": "team",
84+
"deployment_name": "team_creation",
85+
"team": {
86+
"team_name": "random_team",
87+
"priority_threshold": 100,
88+
"quotas": [
89+
{
90+
"shape_name": "BM.GPU.H100.8",
91+
"cpu_nominal_quota": "10",
92+
"cpu_borrowing_limit": "4",
93+
"cpu_lending_limit": "4",
94+
"mem_nominal_quota": "10",
95+
"mem_borrowing_limit": "4",
96+
"mem_lending_limit": "4",
97+
"gpu_nominal_quota": "10",
98+
"gpu_borrowing_limit": "4",
99+
"gpu_lending_limit": "4"
100+
}
101+
]
102+
}
103+
}
104+
```
105+
106+
### Parameter Reference
107+
108+
Field
109+
110+
Description
111+
112+
`team_name`
113+
114+
Friendly name; becomes the `ClusterQueue` and `LocalQueue` name.
115+
116+
`priority_threshold`
117+
118+
Minimum workload priority allowed to borrow quota when nominal is exhausted.
119+
120+
`quotas[]`
121+
122+
List of shapes and per-resource quotas. Values are strings but parsed as quantities (e.g., `"10"``10`).
123+
124+
`*_nominal_quota`
125+
126+
Hard cap always available to this team.
127+
128+
`*_borrowing_limit`
129+
130+
Extra units the team may _borrow_ from cohort siblings when idle capacity exists.
131+
132+
`*_lending_limit`
133+
134+
Idle units this team is willing to _lend_ to others.
135+
136+
---
137+
138+
## Using a Team in a Job Blueprint
139+
140+
Once a team exists, reference it from any job:
141+
142+
```jsonc
143+
{
144+
"recipe_mode": "job",
145+
"deployment_name": "job_deployment",
146+
"recipe_node_shape": "VM.GPU.A10.2",
147+
"recipe_team_info": { "team_name": "random_team" },
148+
...
149+
}
150+
151+
```
152+
153+
The blueprint engine:
154+
155+
1. Adds the `kueue.x-k8s.io/queue-name` label so Kueue enqueues the Workload into the correct `LocalQueue`.
156+
2. Leaves pod scheduling to the default kube-scheduler once the Workload is **admitted** by Kueue.
157+
158+
---
159+
160+
## FAQ
161+
162+
**Q: Can a team’s `borrowing_limit` exceed its `nominal_quota`?**
163+
A: Yes. Kueue allows `borrowingLimit` to be any non-negative quantity; it simply caps _how much_ the queue may exceed its nominal quota when idle resources are available.
164+
165+
**Q: What happens if multiple jobs exceed their team’s quota at the same priority?**
166+
A: They queue FIFO inside the `LocalQueue`. When capacity frees up, Kueue admits them in order while honoring priority and borrowing rules.
167+
168+
**Q: How do I delete a team?**
169+
A: Undeploy the deployment you created to create the team. Make sure to undeploy all workloads for that team first.
170+
171+
---
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
{
2+
"recipe_id": "healthcheck",
3+
"recipe_mode": "job",
4+
"deployment_name": "create_job_with_team",
5+
"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:healthcheck_v0.3",
6+
"recipe_node_shape": "VM.GPU.A10.2",
7+
"recipe_use_shared_node_pool": true,
8+
"recipe_team_info": {
9+
"team_name": "randomteam"
10+
},
11+
"output_object_storage": [
12+
{
13+
"bucket_name": "healthcheck2",
14+
"mount_location": "/healthcheck_results",
15+
"volume_size_in_gbs": 20
16+
}
17+
],
18+
"recipe_container_command_args": [
19+
"--dtype",
20+
"float16",
21+
"--output_dir",
22+
"/healthcheck_results",
23+
"--expected_gpus",
24+
"A10:2,A100:0,H100:0"
25+
],
26+
"recipe_replica_count": 1,
27+
"recipe_nvidia_gpu_count": 2,
28+
"recipe_node_pool_size": 1,
29+
"recipe_node_boot_volume_size_in_gbs": 200,
30+
"recipe_ephemeral_storage_size": 100,
31+
"recipe_shared_memory_volume_size_limit_in_mb": 1000,
32+
"recipe_container_cpu_count": 4,
33+
"recipe_container_memory_size": 20
34+
}
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
{
2+
"recipe_mode": "team",
3+
"deployment_name": "create_team",
4+
"team": {
5+
"team_name": "randomteam",
6+
"priority_threshold": 100,
7+
"quotas": [
8+
{
9+
"shape_name": "BM.GPU.H100.8",
10+
"cpu_nominal_quota": "10",
11+
"cpu_borrowing_limit": "4",
12+
"cpu_lending_limit": "4",
13+
"mem_nominal_quota": "10",
14+
"mem_borrowing_limit": "4",
15+
"mem_lending_limit": "4",
16+
"gpu_nominal_quota": "10",
17+
"gpu_borrowing_limit": "4",
18+
"gpu_lending_limit": "4"
19+
},
20+
{
21+
"shape_name": "VM.GPU.A10.2",
22+
"cpu_nominal_quota": "10",
23+
"cpu_borrowing_limit": "4",
24+
"cpu_lending_limit": "4",
25+
"mem_nominal_quota": "10",
26+
"mem_borrowing_limit": "4",
27+
"mem_lending_limit": "4",
28+
"gpu_nominal_quota": "10",
29+
"gpu_borrowing_limit": "4",
30+
"gpu_lending_limit": "4"
31+
}
32+
]
33+
}
34+
}

docs/software_versions/QuickStartVersions.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,76 @@ The following table describes software versions for tagged releases of this quic
44

55
This will be replaced as soon as we start tagging. Wanted framework in place.
66

7+
<details>
8+
<summary><strong>release-2025-05-16</strong></summary>
9+
10+
## Cluster Creation Terraform
11+
12+
### Terraform / Provider Versions
13+
14+
| Component Type | Component Name | Component Source | Component Version |
15+
| :------------: | :------------: | :------------------: | :---------------: |
16+
| Language | Terraform | hashicorp | >=1.5 |
17+
| Provider | oci | oracle/oci | >=5 |
18+
| Provider | kubernetes | hashicorp/kubernetes | >=2.27 |
19+
| Provider | helm | hashicorp/helm | >=2.12 |
20+
| Provider | tls | hashicorp/tls | >=4 |
21+
| Provider | local | hashicorp/local | >=2.5 |
22+
| Provider | random | hashicorp/random | >=3.6 |
23+
24+
### Oracle Services
25+
26+
| Service | Version |
27+
| :----------------------: | :-----: |
28+
| Oracle Kubernetes Engine | v1.31.1 |
29+
30+
---
31+
32+
---
33+
34+
## OCI AI Blueprints Terraform
35+
36+
### Terraform / Provider Versions
37+
38+
| Component Type | Component Name | Component Source | Component Version |
39+
| :------------: | :------------: | :------------------: | :---------------: |
40+
| Language | Terraform | hashicorp | >=1.1 |
41+
| Provider | oci | oracle/oci | 4 <= version < 5 |
42+
| Provider | kubernetes | hashicorp/kubernetes | >=2 |
43+
| Provider | helm | hashicorp/helm | >=2 |
44+
| Provider | tls | hashicorp/tls | >=4 |
45+
| Provider | local | hashicorp/local | >=2 |
46+
| Provider | random | hashicorp/random | >=3 |
47+
48+
### Helm Chart Versions
49+
50+
| Chart Name | Version | Chart URL |
51+
| :-----------------: | :-----: | :------------------------------------------------: |
52+
| Grafana | 6.47.1 | https://grafana.github.io/helm-charts |
53+
| Prometheus | 19.0.1 | https://prometheus-community.github.io/helm-charts |
54+
| Metrics Server | 3.8.3 | https://kubernetes-sigs.github.io/metrics-server |
55+
| Ingress Nginx | 4.4.0 | https://kubernetes.github.io/ingress-nginx |
56+
| MLFlow | 0.16.5 | https://community-charts.github.io/helm-charts |
57+
| NVIDIA GPU Operator | v25.3.0 | https://helm.ngc.nvidia.com/nvidia |
58+
| Keda | 2.17.0 | https://kedacore.github.io/charts |
59+
| LeaderWorkerSet | 0.1.0 | local |
60+
| Kueue | 0.11.4 | oci://registry.k8s.io/kueue/charts |
61+
62+
### Container Versions
63+
64+
| Container | Version | Repository |
65+
| :----------------------- | :-----: | :------------------------------------------------: |
66+
| oci-corrino-cp | latest | iad.ocir.io/iduyx1qnmway/corrino-devops-repository |
67+
| oci-ai-blueprints-portal | latest | iad.ocir.io/iduyx1qnmway/corrino-devops-repository |
68+
69+
### Oracle Services
70+
71+
| Service | Version |
72+
| :------------------------: | :-----: |
73+
| Oracle Autonomous Database | 19c |
74+
75+
</details>
76+
777
<details>
878
<summary><strong>release-2025-04-22</strong></summary>
979

0 commit comments

Comments
 (0)