Add Team Documentation Plus the 5_16_25 release (#57)

grantneumanoracle · web-flow · commit 19fb90810b35 · 2025-05-16T14:43:58.000-07:00
* Update links in GETTING_STARTED_README.md and variables.tf to reflect the latest release version. Add new JSON blueprints for team creation and job submission, along with a comprehensive README for the Teams feature in OCI AI Blueprints.

* Add new release details for 2025-05-16 in QuickStartVersions.md, including Terraform, OCI AI Blueprints, Helm Chart, and Container versions.
diff --git a/GETTING_STARTED_README.md b/GETTING_STARTED_README.md
@@ -24,7 +24,7 @@ This guide helps you install and use **OCI AI Blueprints** for the first time. Y
 
 Instead of creating an OKE cluster manually, you can deploy a **VCN + OKE cluster** in one click. Use the button below to open Oracle Cloud’s Resource Manager:
 
-[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-04-22/cluster_release-2025-04-22.zip)
+[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-05-16/release_2025_05_16_cluster.zip)
 
 1. Click **Deploy to Oracle Cloud** above.
 2. In **Create Stack**:
@@ -42,7 +42,7 @@ Now that your cluster is ready, follow these steps to install OCI AI Blueprints
 
 1. Click the **Deploy to Oracle Cloud** button below to open another Resource Manager stack—this one for OCI AI Blueprints:
 
-   [![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-04-22/app_release-2025-04-22.zip)
+   [![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-05-16/release_2025_05_16_app.zip)
 
 2. In the **Create Stack** wizard:
    - Provide a **name** (e.g., _oci-ai-blueprints-stack_).
diff --git a/cluster_creation_terraform/variables.tf b/cluster_creation_terraform/variables.tf
@@ -2,7 +2,7 @@
 # Licensed under the Universal Permissive License v 1.0 as shown at http://oss.oracle.com/licenses/upl.
 # 
 variable "oci_ai_blueprints_link_variable" {
-  default = "https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-04-22/app_release-2025-04-22.zip"
+  default = "https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-05-16/release_2025_05_16_app.zip"
 }
 
 # OKE Variables
diff --git a/docs/sample_blueprints/teams/README.md b/docs/sample_blueprints/teams/README.md
@@ -0,0 +1,171 @@
+# Teams
+
+**Teams** feature in OCI AI Blueprints lets admins enforce resource quotas and fair sharing between teams to decide when and where a job (batch, HPC, and AI/ML workloads) should wait or run within the cluster.
+
+Each bucket (a _team_) has hard _nominal quotas_, soft _borrowing_ / _lending_ limits, an optional _priority threshold_, and a friendly name you reference in any job blueprint.  
+Behind the scenes, the blueprint engine uses Kueue and wires up a `ClusterQueue`, `LocalQueue`, and a `Cohort` so workloads from different teams share idle capacity fairly while respecting their quotas. ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/))
+
+**Note: Make sure that your OCI AI Blueprints instance has been updated since 5/16/25 to ensure that the Kueue operator is installed.**
+
+---
+
+## What is a “Team”?
+
+Including `recipe_mode: team` and the `team` object to a blueprint creates a new team.  
+Submitting one:
+
+1.  Creates a **ClusterQueue** that owns the quotas for your team — CPU, memory, GPU counts per shape. ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/))
+2.  Creates a namespaced **LocalQueue** so jobs in that namespace can enqueue against the ClusterQueue. ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/))
+3.  Joins that ClusterQueue to a **Cohort** so it can _borrow_ unused quota from sibling queues and _lend_ when idle. All teams share the same cohort across the entire blueprint engine ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/))
+
+---
+
+## When should I use Teams?
+
+- **Multi-tenant clusters** – isolate business units, research groups, or customers while still sharing idle GPU/CPU.
+- **Fair-share batch environments** – let high-priority jobs pre-empt low-priority work within quota rules.
+- **Capacity planning** – express org-level GPU budgets in code and track consumption with teams
+
+---
+
+## Core Concepts
+
+**Team**
+
+`ClusterQueue` + `LocalQueue` + `Namespace`
+
+A **Team** is a logical grouping backed by a Kueue `ClusterQueue` (defining its `nominalQuota`, `lendingLimit`, and `borrowingLimit`) plus a corresponding `LocalQueue` in a dedicated Kubernetes `Namespace`, which together guarantee each team’s reserved capacity and enable it to borrow or lend idle resources within the shared cluster.
+
+- **Example:**  
+  If you create a team called **“research”** with a `nominalQuota` of 10 GPUs, a `borrowingLimit` of 4 GPUs, and a `lendingLimit` of 4 GPUs, OCI AI Blueprints will spin up a `ClusterQueue` named “research-cluster-queue” configured with those limits and a `LocalQueue` named "research-local-queue" in the
+  "research-namespace" `namespace`. Any job you submit in that namespace automatically enters the “research-local-queue” `LocalQueue`, giving it up to 10 GPUs guaranteed, the ability to borrow up to 4 GPUs when others are idle, and the willingness to lend up to 4 GPUs back to the cohort when it has idle capacity.
+
+**Nominal Quota**
+The `nominalQuota` is the guaranteed amount of resources reserved for a team that it can always use, independent of other teams’ activity.
+
+- **Example:**  
+  If **Team A** has a `nominalQuota` of 10 GPUs, those 10 GPUs are always exclusively available to Team A before any borrowing or lending is considered.
+
+**Borrowing Limit**
+
+The `borrowingLimit` is the maximum extra resources a team may temporarily use beyond its nominal quota when there’s idle capacity in the cluster.
+
+- **Example:**  
+  If **Team A** has a `nominalQuota` of 10 GPUs and a `borrowingLimit` of 4 GPUs, it can consume up to 14 GPUs whenever other teams aren’t using theirs, but no more.
+
+**Lending Limit**
+
+The `lendingLimit` is the maximum idle resources a team is willing to offer into the shared pool for other teams to borrow.
+
+- **Example:**  
+  If **Team A** has a `nominalQuota` of 10 GPUs but is only using 6, and its `lendingLimit` is 4 GPUs, then up to 4 of its unused GPUs become available for others to borrow.
+
+**Priority Threshold**
+
+The `priorityThreshold` set at the team level assigns a single priority value to all of that team’s workloads and determines which teams’ jobs may exceed their nominal quotas when extra resources are available.
+
+- **Example:**  
+  If **Team A** has `priorityThreshold: 100` and **Team B** has `priorityThreshold: 50`, then when idle GPUs exist, Team A’s workloads (priority 100) will be allowed to borrow first; Team B’s workloads (priority 50) can borrow only if resources remain after Team A has taken theirs.
+
+**Cohort**
+
+A **Cohort** is the single, cluster-wide sharing group that all teams belong to, enabling them to borrow from and lend to one another according to their configured borrowing and lending limits.
+
+- **Example:**  
+  If **Team A** and **Team B** are both in the cluster cohort, then when Team A has idle GPUs it can lend up to its lending limit, and Team B can borrow from that shared pool (up to its borrowing limit), and vice versa — ensuring resources never sit idle in the cluster.
+
+---
+
+## Team Blueprint Schema (`recipe_mode: "team"`)
+
+```json
+{
+  "recipe_mode": "team",
+  "deployment_name": "team_creation",
+  "team": {
+    "team_name": "random_team",
+    "priority_threshold": 100,
+    "quotas": [
+      {
+        "shape_name": "BM.GPU.H100.8",
+        "cpu_nominal_quota": "10",
+        "cpu_borrowing_limit": "4",
+        "cpu_lending_limit": "4",
+        "mem_nominal_quota": "10",
+        "mem_borrowing_limit": "4",
+        "mem_lending_limit": "4",
+        "gpu_nominal_quota": "10",
+        "gpu_borrowing_limit": "4",
+        "gpu_lending_limit": "4"
+      }
+    ]
+  }
+}
+```
+
+### Parameter Reference
+
+Field
+
+Description
+
+`team_name`
+
+Friendly name; becomes the `ClusterQueue` and `LocalQueue` name.
+
+`priority_threshold`
+
+Minimum workload priority allowed to borrow quota when nominal is exhausted.
+
+`quotas[]`
+
+List of shapes and per-resource quotas. Values are strings but parsed as quantities (e.g., `"10"` → `10`).
+
+`*_nominal_quota`
+
+Hard cap always available to this team.
+
+`*_borrowing_limit`
+
+Extra units the team may _borrow_ from cohort siblings when idle capacity exists.
+
+`*_lending_limit`
+
+Idle units this team is willing to _lend_ to others.
+
+---
+
+## Using a Team in a Job Blueprint
+
+Once a team exists, reference it from any job:
+
+```jsonc
+{
+  "recipe_mode": "job",
+  "deployment_name": "job_deployment",
+  "recipe_node_shape": "VM.GPU.A10.2",
+  "recipe_team_info": { "team_name": "random_team" },
+  ...
+}
+
+```
+
+The blueprint engine:
+
+1.  Adds the `kueue.x-k8s.io/queue-name` label so Kueue enqueues the Workload into the correct `LocalQueue`.
+2.  Leaves pod scheduling to the default kube-scheduler once the Workload is **admitted** by Kueue.
+
+---
+
+## FAQ
+
+**Q: Can a team’s `borrowing_limit` exceed its `nominal_quota`?**  
+A: Yes. Kueue allows `borrowingLimit` to be any non-negative quantity; it simply caps _how much_ the queue may exceed its nominal quota when idle resources are available.
+
+**Q: What happens if multiple jobs exceed their team’s quota at the same priority?**  
+A: They queue FIFO inside the `LocalQueue`. When capacity frees up, Kueue admits them in order while honoring priority and borrowing rules.
+
+**Q: How do I delete a team?**  
+A: Undeploy the deployment you created to create the team. Make sure to undeploy all workloads for that team first.
+
+---
diff --git a/docs/sample_blueprints/teams/create_job_with_team.json b/docs/sample_blueprints/teams/create_job_with_team.json
@@ -0,0 +1,34 @@
+{
+  "recipe_id": "healthcheck",
+  "recipe_mode": "job",
+  "deployment_name": "create_job_with_team",
+  "recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:healthcheck_v0.3",
+  "recipe_node_shape": "VM.GPU.A10.2",
+  "recipe_use_shared_node_pool": true,
+  "recipe_team_info": {
+    "team_name": "randomteam"
+  },
+  "output_object_storage": [
+    {
+      "bucket_name": "healthcheck2",
+      "mount_location": "/healthcheck_results",
+      "volume_size_in_gbs": 20
+    }
+  ],
+  "recipe_container_command_args": [
+    "--dtype",
+    "float16",
+    "--output_dir",
+    "/healthcheck_results",
+    "--expected_gpus",
+    "A10:2,A100:0,H100:0"
+  ],
+  "recipe_replica_count": 1,
+  "recipe_nvidia_gpu_count": 2,
+  "recipe_node_pool_size": 1,
+  "recipe_node_boot_volume_size_in_gbs": 200,
+  "recipe_ephemeral_storage_size": 100,
+  "recipe_shared_memory_volume_size_limit_in_mb": 1000,
+  "recipe_container_cpu_count": 4,
+  "recipe_container_memory_size": 20
+}
diff --git a/docs/sample_blueprints/teams/create_team.json b/docs/sample_blueprints/teams/create_team.json
@@ -0,0 +1,34 @@
+{
+  "recipe_mode": "team",
+  "deployment_name": "create_team",
+  "team": {
+    "team_name": "randomteam",
+    "priority_threshold": 100,
+    "quotas": [
+      {
+        "shape_name": "BM.GPU.H100.8",
+        "cpu_nominal_quota": "10",
+        "cpu_borrowing_limit": "4",
+        "cpu_lending_limit": "4",
+        "mem_nominal_quota": "10",
+        "mem_borrowing_limit": "4",
+        "mem_lending_limit": "4",
+        "gpu_nominal_quota": "10",
+        "gpu_borrowing_limit": "4",
+        "gpu_lending_limit": "4"
+      },
+      {
+        "shape_name": "VM.GPU.A10.2",
+        "cpu_nominal_quota": "10",
+        "cpu_borrowing_limit": "4",
+        "cpu_lending_limit": "4",
+        "mem_nominal_quota": "10",
+        "mem_borrowing_limit": "4",
+        "mem_lending_limit": "4",
+        "gpu_nominal_quota": "10",
+        "gpu_borrowing_limit": "4",
+        "gpu_lending_limit": "4"
+      }
+    ]
+  }
+}
diff --git a/docs/software_versions/QuickStartVersions.md b/docs/software_versions/QuickStartVersions.md
@@ -4,6 +4,76 @@ The following table describes software versions for tagged releases of this quic
 
 This will be replaced as soon as we start tagging. Wanted framework in place.
 
+<details>
+<summary><strong>release-2025-05-16</strong></summary>
+
+## Cluster Creation Terraform
+
+### Terraform / Provider Versions
+
+| Component Type | Component Name |   Component Source   | Component Version |
+| :------------: | :------------: | :------------------: | :---------------: |
+|    Language    |   Terraform    |      hashicorp       |       >=1.5       |
+|    Provider    |      oci       |      oracle/oci      |        >=5        |
+|    Provider    |   kubernetes   | hashicorp/kubernetes |      >=2.27       |
+|    Provider    |      helm      |    hashicorp/helm    |      >=2.12       |
+|    Provider    |      tls       |    hashicorp/tls     |        >=4        |
+|    Provider    |     local      |   hashicorp/local    |       >=2.5       |
+|    Provider    |     random     |   hashicorp/random   |       >=3.6       |
+
+### Oracle Services
+
+|         Service          | Version |
+| :----------------------: | :-----: |
+| Oracle Kubernetes Engine | v1.31.1 |
+
+---
+
+---
+
+## OCI AI Blueprints Terraform
+
+### Terraform / Provider Versions
+
+| Component Type | Component Name |   Component Source   | Component Version |
+| :------------: | :------------: | :------------------: | :---------------: |
+|    Language    |   Terraform    |      hashicorp       |       >=1.1       |
+|    Provider    |      oci       |      oracle/oci      | 4 <= version < 5  |
+|    Provider    |   kubernetes   | hashicorp/kubernetes |        >=2        |
+|    Provider    |      helm      |    hashicorp/helm    |        >=2        |
+|    Provider    |      tls       |    hashicorp/tls     |        >=4        |
+|    Provider    |     local      |   hashicorp/local    |        >=2        |
+|    Provider    |     random     |   hashicorp/random   |        >=3        |
+
+### Helm Chart Versions
+
+|     Chart Name      | Version |                     Chart URL                      |
+| :-----------------: | :-----: | :------------------------------------------------: |
+|       Grafana       | 6.47.1  |       https://grafana.github.io/helm-charts        |
+|     Prometheus      | 19.0.1  | https://prometheus-community.github.io/helm-charts |
+|   Metrics Server    |  3.8.3  |  https://kubernetes-sigs.github.io/metrics-server  |
+|    Ingress Nginx    |  4.4.0  |     https://kubernetes.github.io/ingress-nginx     |
+|       MLFlow        | 0.16.5  |   https://community-charts.github.io/helm-charts   |
+| NVIDIA GPU Operator | v25.3.0 |         https://helm.ngc.nvidia.com/nvidia         |
+|        Keda         | 2.17.0  |         https://kedacore.github.io/charts          |
+|   LeaderWorkerSet   |  0.1.0  |                       local                        |
+|        Kueue        | 0.11.4  |         oci://registry.k8s.io/kueue/charts         |
+
+### Container Versions
+
+| Container                | Version |                     Repository                     |
+| :----------------------- | :-----: | :------------------------------------------------: |
+| oci-corrino-cp           | latest  | iad.ocir.io/iduyx1qnmway/corrino-devops-repository |
+| oci-ai-blueprints-portal | latest  | iad.ocir.io/iduyx1qnmway/corrino-devops-repository |
+
+### Oracle Services
+
+|          Service           | Version |
+| :------------------------: | :-----: |
+| Oracle Autonomous Database |   19c   |
+
+</details>
+
 <details>
 <summary><strong>release-2025-04-22</strong></summary>
 

Original file line number	Diff line number	Diff line change
`@@ -2,7 +2,7 @@`
`2`	`2`	`# Licensed under the Universal Permissive License v 1.0 as shown at http://oss.oracle.com/licenses/upl.`
`3`	`3`	`#`
`4`	`4`	`variable "oci_ai_blueprints_link_variable" {`
`5`		`- default = "https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-04-22/app_release-2025-04-22.zip"`
	`5`	`+ default = "https://cloud.oracle.com/resourcemanager/stacks/create?region=home&zipUrl=https://github.com/oracle-quickstart/oci-ai-blueprints/releases/download/release-2025-05-16/release_2025_05_16_app.zip"`
`6`	`6`	`}`
`7`	`7`
`8`	`8`	`# OKE Variables`