Skip to content

Commit 0ec6407

Browse files
jobset blog post
1 parent 3caf4d5 commit 0ec6407

File tree

2 files changed

+211
-0
lines changed

2 files changed

+211
-0
lines changed
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
---
2+
layout: blog
3+
title: "Introducing JobSet"
4+
date: 2024-01-13
5+
slug: introducing-jobset
6+
---
7+
8+
**Authors**: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)
9+
10+
In this article, we introduce [JobSet](https://jobset.sigs.k8s.io/), an open source API for
11+
representing distributed jobs. The goal of JobSet is to provide a unified API for distributed ML
12+
training and HPC workloads on Kubernetes.
13+
14+
## Why JobSet?
15+
16+
The Kubernetes community’s recent enhancements to the batch ecosystem on Kubernetes has attracted ML
17+
engineers who have found it to be a natural fit for the requirements of running distributed training
18+
workloads.
19+
20+
Large ML models (particularly LLMs) which cannot fit into the memory of the GPU or TPU chips on a
21+
single host are often distributed across tens of thousands of accelerator chips, which in turn may
22+
span thousands of hosts.
23+
24+
As such, the model training code is often containerized and executed simultaneously on all these
25+
hosts, performing distributed computations which often shard both the model parameters and/or the
26+
training dataset across the target accelerator chips, using communication collective primitives like
27+
all-gather and all-reduce to perform distributed computations and synchronize gradients between
28+
hosts.
29+
30+
These workload characteristics make Kubernetes a great fit for this type of workload, as efficiently
31+
scheduling and managing the lifecycle of containerized applications across a cluster of compute
32+
resources is an area where it shines.
33+
34+
It is also very extensible, allowing developers to define their own Kubernetes APIs, objects, and
35+
controllers which manage the behavior and life cycle of these objects, allowing engineers to develop
36+
custom distributed training orchestration solutions to fit their needs.
37+
38+
However, as distributed ML training techniques continue to evolve, existing Kubernetes primitives do
39+
not adequately model them alone anymore.
40+
41+
Furthermore, the landscape of Kubernetes distributed training orchestration APIs has become
42+
fragmented, and each of the existing solutions in this fragmented landscape has certain limitations
43+
that make it non-optimal for distributed ML training.
44+
45+
For example, the KubeFlow training operator defines custom APIs for different frameworks (e.g.
46+
PyTorchJob, TFJob, MPIJob, etc.); however, each of these job types are in fact a solution fit
47+
specifically to the target framework, each with different semantics and behavior.
48+
49+
On the other hand, the Job API fixed many gaps for running batch workloads, including Indexed
50+
completion mode, higher scalability, Pod failure policies and Pod backoff policy to mention a few of
51+
the most recent enhancements. However, running ML training and HPC workloads using the upstream Job
52+
API requires extra orchestration to fill the following gaps:
53+
54+
Multi-template Pods : Most HPC or ML training jobs include more than one type of Pods. The different
55+
Pods are part of the same workload, but they need to run a different container, request different
56+
resources or have different failure policies. A common example is the driver-worker pattern.
57+
58+
Job groups : Large scale training workloads span multiple network topologies, running across
59+
multiple racks for example. Such workloads are network latency sensitive, and aim to localize
60+
communication and minimize traffic crossing the higher-latency network links. To facilitate this,
61+
the workload needs to be split into groups of Pods each assigned to a network topology.
62+
63+
Inter-Pod communication : Create and manage the resources (e.g. [headless
64+
Services](/docs/concepts/services-networking/service/#headless-services)) necessary to establish
65+
communication between the Pods of a job.
66+
67+
Startup sequencing : Some jobs require a specific start sequence of pods; sometimes the driver is
68+
expected to start first (like Ray or Spark), in other cases the workers are expected to be ready
69+
before starting the driver (like MPI).
70+
71+
JobSet aims to address those gaps using the Job API as a building block to build a richer API for
72+
large-scale distributed HPC and ML use cases.
73+
74+
## How JobSet Works
75+
JobSet models a distributed batch workload as a group of Kubernetes Jobs. This allows a user to
76+
easily specify different pod templates for different distinct groups of pods (e.g. a leader,
77+
workers, parameter servers, etc.).
78+
79+
It uses the abstraction of a ReplicatedJob to manage child Jobs, where a ReplicatedJob is
80+
essentially a Job Template with some desired number of Job replicas specified. This provides a
81+
declarative way to easily create identical child-jobs to run on different islands of accelerators,
82+
without resorting to scripting or Helm charts to generate many versions of the same job but with
83+
different names.
84+
85+
<figure>
86+
<img src="jobset_diagram.svg">
87+
<figcaption><h4>JobSet Architecture</h4></figcaption>
88+
</figure>
89+
90+
Some other key JobSet features which address the problems described above include:
91+
92+
Replicated Jobs : In modern data centers, hardware accelerators like GPUs and TPUs allocated in
93+
islands of homogenous accelerators connected via a specialized, high bandwidth network links. For
94+
example, a user might provision nodes containing a group of hosts co-located on a rack, each with
95+
H100 GPUs, where GPU chips within each host are connected via NVLink, with a NVLink Switch
96+
connecting the multiple NVLinks. TPU Pods are another example of this: TPU ViperLitePods consist of
97+
64 hosts, each with 4 TPU v5e chips attached, all connected via ICI mesh. When running a distributed
98+
training job across multiple of these islands, we often want to partition the workload into a group
99+
of smaller identical jobs, 1 per island, where each pod primarily communicates with the pods within
100+
the same island to do segments of distributed computation, and keeping the gradient synchronization
101+
over DCN (data center network, which is lower bandwidth than ICI) to a bare minimum.
102+
103+
Automatic headless service creation, configuration, and lifecycle management : Pod-to-pod
104+
communication via pod hostname is enabled by default, with automatic configuration and lifecycle
105+
management of the headless service enabling this.
106+
107+
Configurable success policies : JobSet has configurable success policies which target specific
108+
ReplicatedJobs, with operators to target “Any” or “All” of their child jobs. For example, you can
109+
configure the JobSet to be marked complete if and only if all pods that are part of the “worker”
110+
ReplicatedJob are completed.
111+
112+
Configurable failure policies : JobSet has configurable failure policies which allow the user to
113+
specify a maximum number of times the JobSet should be restarted in the event of a failure. If any
114+
job is marked failed, the entire JobSet will be recreated, allowing the workload to resume from the
115+
last checkpoint. When no failure policy is specified, if any job fails, the JobSet simply fails.
116+
117+
Exclusive placement per topology domain : JobSet allows users to express that child jobs have 1:1
118+
exclusive assignment to a topology domain, typically an accelerator island like a rack. For example,
119+
if the JobSet creates two child jobs, then this feature will enforce that the pods of each child job
120+
will be co-located on the same island, and that only one child job is allowed to schedule per
121+
island. This is useful for scenarios where we want to use a distributed data parallel (DDP) training
122+
strategy to train a model using multiple islands of compute resources (GPU racks or TPU slices),
123+
running 1 model replica in each accelerator island, ensuring the forward and backward passes
124+
themselves occur within a single model replica occurs over the high bandwidth interconnect linking
125+
the accelerators chips within the island, and only the gradient synchronization between model
126+
replicas occurs across accelerator islands over the lower bandwidth data center network.
127+
128+
Integration with Kueue : Users can submit JobSets via [Kueue](https://kueue.sigs.k8s.io/) to
129+
oversubscribe their clusters, queue workloads to run as capacity becomes available, prevent partial
130+
scheduling and deadlocks, enable multi-tenancy, and more.
131+
132+
## Example use case
133+
134+
### Distributed ML training on multiple TPU slices with Jax
135+
136+
The following example is a JobSet spec for running a TPU Multislice workload on 4 TPU v5e
137+
[slices](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#slices). To learn more about
138+
TPU concepts and terminology, please refer to these
139+
[docs](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm).
140+
141+
This example uses [Jax](https://jax.readthedocs.io/en/latest/quickstart.html), an ML framework with
142+
native support for Just-In-Time (JIT) compilation targeting TPU chips via
143+
[OpenXLA](https://github.com/openxla). However, you can also use
144+
[PyTorch/XLA](https://pytorch.org/xla/release/2.3/index.html) to do ML training on TPUs.
145+
146+
This example makes use of several JobSet features (both explicitly and implicitly) to support the
147+
unique scheduling requirements of TPU multislice training out-of-the-box with very little
148+
configuration required by the user.
149+
150+
```yaml
151+
# Run a simple Jax workload on
152+
apiVersion: jobset.x-k8s.io/v1alpha2
153+
kind: JobSet
154+
metadata:
155+
name: multislice
156+
annotations:
157+
# Give each child Job exclusive usage of a TPU slice
158+
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
159+
spec:
160+
failurePolicy:
161+
maxRestarts: 3
162+
replicatedJobs:
163+
- name: workers
164+
replicas: 4 # Set to number of TPU slices
165+
template:
166+
spec:
167+
parallelism: 2 # Set to number of VMs per TPU slice
168+
completions: 2 # Set to number of VMs per TPU slice
169+
backoffLimit: 0
170+
template:
171+
spec:
172+
hostNetwork: true
173+
dnsPolicy: ClusterFirstWithHostNet
174+
nodeSelector:
175+
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
176+
cloud.google.com/gke-tpu-topology: 2x4
177+
containers:
178+
- name: jax-tpu
179+
image: python:3.8
180+
ports:
181+
- containerPort: 8471
182+
- containerPort: 8080
183+
securityContext:
184+
privileged: true
185+
command:
186+
- bash
187+
- -c
188+
- |
189+
pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
190+
python -c 'import jax; print("Global device count:", jax.device_count())'
191+
sleep 60
192+
resources:
193+
limits:
194+
google.com/tpu: 4
195+
```
196+
197+
## Future work and getting involved
198+
We have a number of features on the JobSet roadmap planned for development this year, which can be
199+
found in the [JobSet roadmap](https://github.com/kubernetes-sigs/jobset?tab=readme-ov-file#roadmap).
200+
201+
Please feel free to reach out with feedback of any kind. We’re also open to additional contributors,
202+
whether it is to fix or report bugs, or help add new features or write documentation.
203+
204+
You can get in touch with us via our [repo](http://sigs.k8s.io/jobset), [mailing
205+
list](https://groups.google.com/a/kubernetes.io/g/wg-batch) or on
206+
[Slack](https://kubernetes.slack.com/messages/wg-batch).
207+
208+
Last but not least, thanks to all [our
209+
contributors](https://github.com/kubernetes-sigs/jobset/graphs/contributors) who made this project
210+
possible!

content/en/blog/_posts/2025-01-13-introducing-jobset/jobset_diagram.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)