Skip to content

Commit 2486ddd

Browse files
authored
[Doc][KubeRay] eliminate vale errors (#58429)
Fix some vale's error and suggestions on the kai-scheduler document. See #58161 (comment) Signed-off-by: fscnick <[email protected]>
1 parent cb6a60d commit 2486ddd

File tree

3 files changed

+12
-8
lines changed

3 files changed

+12
-8
lines changed

.vale/styles/Google/Acronyms.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ exceptions:
2020
- DEBUG
2121
- DOM
2222
- DPI
23+
- DRF
2324
- ETL
2425
- FAQ
2526
- GCC
@@ -39,8 +40,10 @@ exceptions:
3940
- JSON
4041
- JSONL
4142
- JSX
43+
- KAI
4244
- LESS
4345
- LLDB
46+
- MPS
4447
- NET
4548
- NFS
4649
- NOTE

.vale/styles/config/vocabularies/General/accept.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ GPU(s)?
6363
hostfile
6464
http
6565
HTTP
66+
Karpenter
6667
KServe
6768
KTO
6869
kubectl

doc/source/cluster/kubernetes/k8s-ecosystem/kai-scheduler.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ This guide demonstrates how to use KAI Scheduler for setting up hierarchical que
77
## KAI Scheduler
88

99
[KAI Scheduler](https://github.com/NVIDIA/KAI-Scheduler) is a high-performance, scalable Kubernetes scheduler built for AI/ML workloads. Designed to orchestrate GPU clusters at massive scale, KAI optimizes GPU allocation and supports the full AI lifecycle - from interactive development to large distributed training and inference. Some of the key features are:
10-
- **Bin packing and spread scheduling**: Optimize node usage either by minimizing fragmentation (bin packing) or increasing resiliency and load balancing (spread scheduling)
11-
- **GPU sharing**: Allow KAI to pack multiple Ray workloads from across teams on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time.
10+
- **Bin packing and spread scheduling**: Optimize node usage either by minimizing fragmentation using bin packing or increasing resiliency and load balancing using spread scheduling.
11+
- **GPU sharing**: Allow KAI to consolidate multiple Ray workloads from across teams on the same GPU, letting your organization fit more work onto your existing hardware and reducing idle GPU time.
1212
- **Workload autoscaling**: Scale Ray replicas or workers within min/max while respecting gang constraints
1313
- **Cluster autoscaling**: Compatible with dynamic cloud infrastructures (including auto-scalers like Karpenter)
1414
- **Workload priorities**: Prioritize Ray workloads effectively within queues
@@ -18,7 +18,7 @@ For more details and key features, see [the documentation](https://github.com/NV
1818

1919
### Core components
2020

21-
1. **PodGroups**: PodGroups are atomic units for scheduling and represent one or more interdependent pods that the scheduler execute as a single unit, also known as gang scheduling. They are vital for distributed workloads. KAI Scheduler includes a **PodGrouper** that handles gang scheduling automatically.
21+
1. **PodGroups**: PodGroups are atomic units for scheduling and represent one or more interdependent pods that the scheduler execute as a single unit, also known as gang scheduling. They're vital for distributed workloads. KAI Scheduler includes a **PodGrouper** that handles gang scheduling automatically.
2222

2323
**How PodGrouper works:**
2424
```
@@ -44,7 +44,7 @@ You can arrange queues hierarchically for organizations with multiple teams, for
4444
* Kubernetes cluster with GPU nodes
4545
* NVIDIA GPU Operator
4646
* kubectl configured to access your cluster
47-
* Install KAI Scheduler with gpu-sharing enabled. Choose the desired release version from [KAI Scheduler releases](https://github.com/NVIDIA/KAI-Scheduler/releases) and replace the `<KAI_SCHEDULER_VERSION>` in the following command. It's recommended to choose v0.10.0 or higher version.
47+
* Install KAI Scheduler with GPU-sharing enabled. Choose the desired release version from [KAI Scheduler releases](https://github.com/NVIDIA/KAI-Scheduler/releases) and replace the `<KAI_SCHEDULER_VERSION>` in the following command. It's recommended to choose v0.10.0 or higher version.
4848

4949
```bash
5050
# Install KAI Scheduler
@@ -107,7 +107,7 @@ spec:
107107

108108
```
109109

110-
Note: To make this demo easier to follow, we combined these queue definitions with the RayCluster example in the next step. You can use the single combined YAML file and apply both queues and workloads at once.
110+
Note: To make this demo easier to follow, it combined these queue definitions with the RayCluster example in the next step. You can use the single combined YAML file and apply both queues and workloads at once.
111111

112112
## Step 3: Gang scheduling with KAI Scheduler
113113

@@ -163,7 +163,7 @@ KAI scheduler deployment comes with several predefined priority classes:
163163
- build (100) - use for build/interactive workloads (non-preemptible)
164164
- inference (125) - use for inference workloads (non-preemptible)
165165

166-
You can submit the same workload above with a specific priority. Modify the above example into a build class workload:
166+
You can submit the same workload preceding with a specific priority. Modify the preceding example into a build class workload:
167167

168168
```yaml
169169
labels:
@@ -174,7 +174,7 @@ See the [documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/p
174174
175175
## Step 4: Submitting Ray workers with GPU sharing
176176
177-
This example creates two workers that share a single GPU (0.5 each, with time-slicing) within a RayCluster. See the [YAML file](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml)):
177+
This example creates two workers that share a single GPU, 0.5 each with time-slicing, within a RayCluster. See the [YAML file](https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml)):
178178
179179
```bash
180180
curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-cluster.kai-gpu-sharing.yaml
@@ -201,7 +201,7 @@ kubectl get pods -w
201201
# raycluster-half-gpu-shared-gpu-worker-98tzh 1/1 Running 0 31s
202202
```
203203

204-
Note: GPU sharing with time slicing in this example occurs only at the Kubernetes layer, allowing multiple pods to share a single GPU device. The scheduler doesn't enforce memory isolation, so applications must manage their own usage to prevent interference. For other GPU sharing approaches (e.g., MPS), see the [the KAI documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/gpu-sharing).
204+
Note: GPU sharing with time slicing in this example occurs only at the Kubernetes layer, allowing multiple pods to share a single GPU device. The scheduler doesn't enforce memory isolation, so applications must manage their own usage to prevent interference. For other GPU sharing approaches, for example, MPS, see [the KAI documentation](https://github.com/NVIDIA/KAI-Scheduler/tree/main/docs/gpu-sharing).
205205

206206
### Verify GPU sharing is working
207207

0 commit comments

Comments
 (0)