|
| 1 | +# Teams |
| 2 | + |
| 3 | +**Teams** feature in OCI AI Blueprints lets admins enforce resource quotas and fair sharing between teams to decide when and where a job (batch, HPC, and AI/ML workloads) should wait or run within the cluster. |
| 4 | + |
| 5 | +Each bucket (a _team_) has hard _nominal quotas_, soft _borrowing_ / _lending_ limits, an optional _priority threshold_, and a friendly name you reference in any job blueprint. |
| 6 | +Behind the scenes, the blueprint engine uses Kueue and wires up a `ClusterQueue`, `LocalQueue`, and a `Cohort` so workloads from different teams share idle capacity fairly while respecting their quotas. ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/)) |
| 7 | + |
| 8 | +**Note: Make sure that your OCI AI Blueprints instance has been updated since 5/16/25 to ensure that the Kueue operator is installed.** |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## What is a “Team”? |
| 13 | + |
| 14 | +Including `recipe_mode: team` and the `team` object to a blueprint creates a new team. |
| 15 | +Submitting one: |
| 16 | + |
| 17 | +1. Creates a **ClusterQueue** that owns the quotas for your team — CPU, memory, GPU counts per shape. ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/)) |
| 18 | +2. Creates a namespaced **LocalQueue** so jobs in that namespace can enqueue against the ClusterQueue. ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/)) |
| 19 | +3. Joins that ClusterQueue to a **Cohort** so it can _borrow_ unused quota from sibling queues and _lend_ when idle. All teams share the same cohort across the entire blueprint engine ([Kueue Docs](https://kueue.sigs.k8s.io/docs/overview/)) |
| 20 | + |
| 21 | +--- |
| 22 | + |
| 23 | +## When should I use Teams? |
| 24 | + |
| 25 | +- **Multi-tenant clusters** – isolate business units, research groups, or customers while still sharing idle GPU/CPU. |
| 26 | +- **Fair-share batch environments** – let high-priority jobs pre-empt low-priority work within quota rules. |
| 27 | +- **Capacity planning** – express org-level GPU budgets in code and track consumption with teams |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +## Core Concepts |
| 32 | + |
| 33 | +**Team** |
| 34 | + |
| 35 | +`ClusterQueue` + `LocalQueue` + `Namespace` |
| 36 | + |
| 37 | +A **Team** is a logical grouping backed by a Kueue `ClusterQueue` (defining its `nominalQuota`, `lendingLimit`, and `borrowingLimit`) plus a corresponding `LocalQueue` in a dedicated Kubernetes `Namespace`, which together guarantee each team’s reserved capacity and enable it to borrow or lend idle resources within the shared cluster. |
| 38 | + |
| 39 | +- **Example:** |
| 40 | + If you create a team called **“research”** with a `nominalQuota` of 10 GPUs, a `borrowingLimit` of 4 GPUs, and a `lendingLimit` of 4 GPUs, OCI AI Blueprints will spin up a `ClusterQueue` named “research-cluster-queue” configured with those limits and a `LocalQueue` named "research-local-queue" in the |
| 41 | + "research-namespace" `namespace`. Any job you submit in that namespace automatically enters the “research-local-queue” `LocalQueue`, giving it up to 10 GPUs guaranteed, the ability to borrow up to 4 GPUs when others are idle, and the willingness to lend up to 4 GPUs back to the cohort when it has idle capacity. |
| 42 | + |
| 43 | +**Nominal Quota** |
| 44 | +The `nominalQuota` is the guaranteed amount of resources reserved for a team that it can always use, independent of other teams’ activity. |
| 45 | + |
| 46 | +- **Example:** |
| 47 | + If **Team A** has a `nominalQuota` of 10 GPUs, those 10 GPUs are always exclusively available to Team A before any borrowing or lending is considered. |
| 48 | + |
| 49 | +**Borrowing Limit** |
| 50 | + |
| 51 | +The `borrowingLimit` is the maximum extra resources a team may temporarily use beyond its nominal quota when there’s idle capacity in the cluster. |
| 52 | + |
| 53 | +- **Example:** |
| 54 | + If **Team A** has a `nominalQuota` of 10 GPUs and a `borrowingLimit` of 4 GPUs, it can consume up to 14 GPUs whenever other teams aren’t using theirs, but no more. |
| 55 | + |
| 56 | +**Lending Limit** |
| 57 | + |
| 58 | +The `lendingLimit` is the maximum idle resources a team is willing to offer into the shared pool for other teams to borrow. |
| 59 | + |
| 60 | +- **Example:** |
| 61 | + If **Team A** has a `nominalQuota` of 10 GPUs but is only using 6, and its `lendingLimit` is 4 GPUs, then up to 4 of its unused GPUs become available for others to borrow. |
| 62 | + |
| 63 | +**Priority Threshold** |
| 64 | + |
| 65 | +The `priorityThreshold` set at the team level assigns a single priority value to all of that team’s workloads and determines which teams’ jobs may exceed their nominal quotas when extra resources are available. |
| 66 | + |
| 67 | +- **Example:** |
| 68 | + If **Team A** has `priorityThreshold: 100` and **Team B** has `priorityThreshold: 50`, then when idle GPUs exist, Team A’s workloads (priority 100) will be allowed to borrow first; Team B’s workloads (priority 50) can borrow only if resources remain after Team A has taken theirs. |
| 69 | + |
| 70 | +**Cohort** |
| 71 | + |
| 72 | +A **Cohort** is the single, cluster-wide sharing group that all teams belong to, enabling them to borrow from and lend to one another according to their configured borrowing and lending limits. |
| 73 | + |
| 74 | +- **Example:** |
| 75 | + If **Team A** and **Team B** are both in the cluster cohort, then when Team A has idle GPUs it can lend up to its lending limit, and Team B can borrow from that shared pool (up to its borrowing limit), and vice versa — ensuring resources never sit idle in the cluster. |
| 76 | + |
| 77 | +--- |
| 78 | + |
| 79 | +## Team Blueprint Schema (`recipe_mode: "team"`) |
| 80 | + |
| 81 | +```json |
| 82 | +{ |
| 83 | + "recipe_mode": "team", |
| 84 | + "deployment_name": "team_creation", |
| 85 | + "team": { |
| 86 | + "team_name": "random_team", |
| 87 | + "priority_threshold": 100, |
| 88 | + "quotas": [ |
| 89 | + { |
| 90 | + "shape_name": "BM.GPU.H100.8", |
| 91 | + "cpu_nominal_quota": "10", |
| 92 | + "cpu_borrowing_limit": "4", |
| 93 | + "cpu_lending_limit": "4", |
| 94 | + "mem_nominal_quota": "10", |
| 95 | + "mem_borrowing_limit": "4", |
| 96 | + "mem_lending_limit": "4", |
| 97 | + "gpu_nominal_quota": "10", |
| 98 | + "gpu_borrowing_limit": "4", |
| 99 | + "gpu_lending_limit": "4" |
| 100 | + } |
| 101 | + ] |
| 102 | + } |
| 103 | +} |
| 104 | +``` |
| 105 | + |
| 106 | +### Parameter Reference |
| 107 | + |
| 108 | +Field |
| 109 | + |
| 110 | +Description |
| 111 | + |
| 112 | +`team_name` |
| 113 | + |
| 114 | +Friendly name; becomes the `ClusterQueue` and `LocalQueue` name. |
| 115 | + |
| 116 | +`priority_threshold` |
| 117 | + |
| 118 | +Minimum workload priority allowed to borrow quota when nominal is exhausted. |
| 119 | + |
| 120 | +`quotas[]` |
| 121 | + |
| 122 | +List of shapes and per-resource quotas. Values are strings but parsed as quantities (e.g., `"10"` → `10`). |
| 123 | + |
| 124 | +`*_nominal_quota` |
| 125 | + |
| 126 | +Hard cap always available to this team. |
| 127 | + |
| 128 | +`*_borrowing_limit` |
| 129 | + |
| 130 | +Extra units the team may _borrow_ from cohort siblings when idle capacity exists. |
| 131 | + |
| 132 | +`*_lending_limit` |
| 133 | + |
| 134 | +Idle units this team is willing to _lend_ to others. |
| 135 | + |
| 136 | +--- |
| 137 | + |
| 138 | +## Using a Team in a Job Blueprint |
| 139 | + |
| 140 | +Once a team exists, reference it from any job: |
| 141 | + |
| 142 | +```jsonc |
| 143 | +{ |
| 144 | + "recipe_mode": "job", |
| 145 | + "deployment_name": "job_deployment", |
| 146 | + "recipe_node_shape": "VM.GPU.A10.2", |
| 147 | + "recipe_team_info": { "team_name": "random_team" }, |
| 148 | + ... |
| 149 | +} |
| 150 | + |
| 151 | +``` |
| 152 | + |
| 153 | +The blueprint engine: |
| 154 | + |
| 155 | +1. Adds the `kueue.x-k8s.io/queue-name` label so Kueue enqueues the Workload into the correct `LocalQueue`. |
| 156 | +2. Leaves pod scheduling to the default kube-scheduler once the Workload is **admitted** by Kueue. |
| 157 | + |
| 158 | +--- |
| 159 | + |
| 160 | +## FAQ |
| 161 | + |
| 162 | +**Q: Can a team’s `borrowing_limit` exceed its `nominal_quota`?** |
| 163 | +A: Yes. Kueue allows `borrowingLimit` to be any non-negative quantity; it simply caps _how much_ the queue may exceed its nominal quota when idle resources are available. |
| 164 | + |
| 165 | +**Q: What happens if multiple jobs exceed their team’s quota at the same priority?** |
| 166 | +A: They queue FIFO inside the `LocalQueue`. When capacity frees up, Kueue admits them in order while honoring priority and borrowing rules. |
| 167 | + |
| 168 | +**Q: How do I delete a team?** |
| 169 | +A: Undeploy the deployment you created to create the team. Make sure to undeploy all workloads for that team first. |
| 170 | + |
| 171 | +--- |
0 commit comments