Skip to content

Commit 1ec668b

Browse files
authored
Create xpk-slurm-commands.md (#472)
* Create xpk-slurm-commands.md Add a guide for Slurm like commands in xpk * Update xpk-slurm-commands.md Slurm commands guide- code edits
1 parent 64a5990 commit 1ec668b

File tree

1 file changed

+382
-0
lines changed

1 file changed

+382
-0
lines changed

xpk-slurm-commands.md

Lines changed: 382 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,382 @@
1+
<!--
2+
Copyright 2025 Google LLC
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
https://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
# Use Slurm like commands in XPK to execute workloads on top of GKE
18+
19+
XPK enables intuitive workload scheduling for ML researchers by offering Slurm-like commands and usage patterns.
20+
21+
This document provides a guide to fine-tuning Large Language Models (LLMs) using XPK Slurm Like commands. By leveraging the power of XPK and adapting its familiar Slurm command structures, users can efficiently train and optimize LLMs for specific use cases.
22+
23+
Slurm - XPK commands mapping:
24+
25+
| Slurm command | XPK command |
26+
| --- | --- |
27+
|Slurm login node| xpk shell |
28+
|srun |xpk run |
29+
|sbatch |xpk batch |
30+
|squeue |xpk job ls |
31+
|scancel |xpk job cancel |
32+
|sacct |xpk job info |
33+
|sinfo |xpk info|
34+
|Array jobs| See [Array jobs](#array-jobs) |
35+
|Options |See [Options](#options)|
36+
37+
38+
39+
## Set up the environment
40+
41+
To recreate a usual Slurm setup, first prepare your environment by provisioning the cluster and creating and attaching storage.
42+
43+
1. Export the variables for easier commands manipulation:
44+
45+
```shell
46+
export CLUSTER_NAME="CLUSTER NAME"
47+
export COMPUTE_ZONE="COMPUTE ZONE"
48+
export PROJECT_ID="PROJECT ID"
49+
```
50+
Replace the following variables:
51+
- `CLUSTER NAME` - name of your cluster
52+
- `COMPUTE ZONE `- compute zone the cluster is at
53+
- `PROJECT ID`- id of your project
54+
3. Create a cluster using `xpk cluster create` command and providing machine type and provisioning mode of your choice.
55+
```shell
56+
python3 xpk.py cluster create \
57+
--cluster=$CLUSTER_NAME \
58+
--zone=$COMPUTE_ZONE \
59+
--project=$PROJECT_ID \
60+
--device-type=DEVICE_TYPE \
61+
--num-nodes=NUM_NODES \
62+
--PROVISIONING MODE \
63+
--enable-workload-identity \
64+
--enable-gcpfilestore-csi-driver \
65+
--default-pool-cpu-num-nodes=2
66+
```
67+
68+
Replace the following variables:
69+
- `DEVICE_TYPE`: name of your machine
70+
- `NUM_NODES`: number of worker nodes in the nodepool
71+
- `PROVISIONING MODE`: provide provisioning mode of your choice.\
72+
`--enable-workload-identity` and `--enable-gcpfilestore-csi-driver` options are not required but they will speed up shared file system creation in the next step.
73+
74+
4. Create storage using `xpk storage create` command. XPK supports attaching GCS Bucket and Filestore storages and creating Filestore storage. If you already have the storage, follow the instructions outlined in [Storage](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md#storage.)
75+
```shell
76+
xpk storage create STORAGE_NAME \
77+
--project=$PROJECT_ID \
78+
--zone=$COMPUTE_ZONE \
79+
--cluster=$CLUSTER_NAME \
80+
--type=gcpfilestore \
81+
--size=1024 \
82+
--access-mode=ReadWriteMany \
83+
--vol=home \
84+
--tier=REGIONAL \
85+
--mount-point /home \
86+
--auto-mount=true \
87+
--readonly=false
88+
```
89+
Replace the following variables:
90+
- `STORAGE_NAME` name of your storage
91+
92+
93+
5. Initialize XPK configuration. You can customize the configuration based on your needs, like in the example of Llama 3 finetuning provided below:
94+
95+
```shell
96+
python3 xpk.py config set shell-interactive-command /bin/bash
97+
python3 xpk.py config set shell-working-directory /home/llama3
98+
python3 xpk.py config set shell-image pytorch/pytorch:2.6.0-cuda12.6-cudnn9-devel
99+
python3 xpk.py config set batch-working-directory /home/llama3
100+
python3 xpk.py config set batch-image pytorch/pytorch:2.6.0-cuda12.6-cudnn9-runtime
101+
```
102+
103+
## Prepare and upload scripts
104+
105+
### 1. Prepare scripts
106+
This section specifies the changes needed for Slurm scripts used for batch executions for slurm-like commands in XPK.
107+
108+
Currently xpk batch supports the following Slurm script cases:
109+
1. Batch job with a single task and single step per task.
110+
2. Batch job with multiple parallel tasks and single step per task.
111+
3. Array job with a single task per job and single step per task.
112+
As a result, XPK runs script validation to ensure it executes only the above use cases.
113+
114+
For successful script validation and later job execution, apply the following script updates:
115+
- The number of steps in a task is limited to one. Thus, ensure there is only one step in the job script, invoked by one srun invocation.
116+
- Ensure there is only one srun invocation per script and it is the final command in the script.
117+
- Do not include other Slurm commands invocation within the script (e.g. scontrol, sinfo etc.).
118+
119+
### 2. xpk shell | Slurm login node - download scripts, models and data sets:
120+
Through the xpk shell you can access the shared file system or edit files (e.g. when quick model changes are needed). It is the equivalent of the Slurm login node. To access the remote system use xpk shell command:
121+
```shell
122+
python3 xpk.py shell \
123+
--project $PROJECT \
124+
--zone $ZONE \
125+
--cluster $CLUSTER
126+
```
127+
128+
This should open a console on the cluster with /home/llama3 set as the shell’s current working directory config. The subsequent commands in this section should be run on the login node within the XPK shell command.
129+
130+
### 3. Create Python virtual environment and activate it
131+
132+
While in shell, run the below command:
133+
```shell
134+
python3 -m venv ./llama3_env
135+
source ./llama3_env/bin/activate
136+
```
137+
As an alternative user may want to use conda - https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
138+
### 4. Upload your training scripts and training data to the created storage.
139+
While in shell, run the below commands:
140+
```shell
141+
python3 <<EOF
142+
import urllib.request
143+
urllib.request.urlretrieve("https://raw.githubusercontent.com/AI-Hypercomputer/xpk/refs/heads/slurm-fixes/examples/llama-3.1-finetuning/requirements.txt", "requirements.txt")
144+
urllib.request.urlretrieve("https://raw.githubusercontent.com/AI-Hypercomputer/xpk/refs/heads/slurm-fixes/examples/llama-3.1-finetuning/train.py", "train.py")
145+
urllib.request.urlretrieve("https://raw.githubusercontent.com/AI-Hypercomputer/xpk/refs/heads/slurm-fixes/examples/llama-3.1-finetuning/training_data.jsonl", "training_data.jsonl")
146+
EOF
147+
```
148+
149+
### 5. Install necessary Python libraries
150+
While in shell, run the below commands:
151+
```shell
152+
pip install -r requirements.txt
153+
```
154+
155+
### 6. Download llama 3.1 model weights
156+
While in shell, download the model from a models platform e.g. HuggingFace
157+
```shell
158+
pip install huggingface_hub[cli]
159+
huggingface-cli download "meta-llama/Llama-3.1-8B-Instruct" \
160+
--local-dir "meta-llama/Llama-3.1-8B-Instruct" \
161+
--token [hf_token]
162+
```
163+
For this to work you need to:
164+
- create HuggingFace account
165+
- create HuggingFace access token - hf_token
166+
- request access to llama 3.1 models and wait for this request to be approved - https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
167+
168+
Now you can exit the shell to continue with running batch commands:
169+
```shell
170+
exit
171+
```
172+
173+
## Submit jobs - run CUDA and Llama 3 fine tuning script
174+
Just like in Slurm, you can submit jobs in XPK using the following methods: batch jobs, array jobs and interactive jobs.
175+
176+
### 1. xpk run | srun - run CUDA in interactive mode
177+
xpk run command runs a job in an interactive and blocking way, the results are printed over terminal and no other commands can be executed till the end.
178+
```shell
179+
python3 xpk.py run \
180+
--project [project] \
181+
--zone [zone] \
182+
--cluster [cluster] \
183+
--nodes 1 \
184+
--gpus-per-task nvidia.com/gpu:8 \
185+
examples/llama-3.1-finetuning/check_cuda.sh
186+
```
187+
188+
The interface should display the following:
189+
```shell
190+
CUDA available: True
191+
Device count: 8
192+
```
193+
194+
### 2. xpk batch | sbatch - run training script in batch mode
195+
Once your script is ready, simply run the xpk batch command specifying which script to run to execute your workload.
196+
```shell
197+
python3 xpk.py batch \
198+
--project [project] \
199+
--zone [zone] \
200+
--cluster [cluster] \
201+
examples/llama-3.1-finetuning/batch_script.sh
202+
```
203+
The command will finish displaying the name of the created job:
204+
```shell
205+
[XPK] Job name: xpk-def-app-profile-slurm-9zm2g
206+
```
207+
208+
but the job execution might run longer depending on your job.
209+
The output from the script execution will be written to relevant folders in the attached storage determined in -mount-point parameter of storage create command. You can see their content by running the following command from the xpk shell command:
210+
```shell
211+
tail -f example_script.out example_script.err
212+
```
213+
Once the execution is finished you should be able to see in the logs:
214+
```shell
215+
2025-02-21 13:02:08.431 GMT
216+
{'train_runtime': 689.2645, 'train_samples_per_second': 0.048, 'train_steps_per_second': 0.004, 'train_loss': 2.037710189819336, 'epoch': 3.0}
217+
```
218+
219+
## Cleanup
220+
### 1. Stop shell
221+
```shell
222+
python3 xpk.py shell stop \
223+
--project $PROJECT \
224+
--zone $ZONE \
225+
--cluster $CLUSTER
226+
```
227+
228+
### 2. Delete shared storage
229+
```shell
230+
python3 xpk.py storage delete \
231+
--project $PROJECT \
232+
--zone $ZONE \
233+
--cluster $CLUSTER
234+
```
235+
236+
### 3. Delete XPK cluster
237+
```shell
238+
python3 xpk.py cluster delete \
239+
--project $PROJECT \
240+
--zone $ZONE \
241+
--cluster $CLUSTER
242+
```
243+
# More Slurm mode features:
244+
245+
## Job management - check the status of your job
246+
### 1. xpk job ls | squeue
247+
248+
As in slurm squeue command, xpk uses xpk job ls command to list the jobs in the queue, which were scheduled through Slurm-like mode over a specific cluster. It lists the jobs with the tasks completion status, duration and age
249+
```shell
250+
python3 xpk.py job ls \
251+
--project $PROJECT \
252+
--zone $ZONE \
253+
--cluster $CLUSTER
254+
```
255+
256+
The output should look like this:
257+
```shell
258+
NAME PROFILE LOCAL QUEUE COMPLETIONS DURATION AGE
259+
xpk-def-app-profile-slurm-6s6ff xpk-def-app-profile multislice-queue 1/1 8s 66m
260+
xpk-def-app-profile-slurm-fz5z8 xpk-def-app-profile multislice-queue 1/1 4s 63m
261+
```
262+
263+
### 2. xpk cancel | scancel
264+
If you want to cancel a job, use xpk cancel and provide the job id you wish to cancel.
265+
```shell
266+
xpk cancel <job_id>
267+
```
268+
269+
### 3. xpk job info | sacct
270+
To see the details of the job you submitted you can use xpk job info command.
271+
```shell
272+
python3 xpk.py job info JOB NAME \
273+
--project $PROJECT \
274+
--zone $ZONE \
275+
--cluster $CLUSTER
276+
```
277+
278+
The expected output should look like this:
279+
```
280+
Job name: xpk-def-app-profile-slurm-6s6ff
281+
Script name: ./job.sh
282+
Profile: default_xpk-def-app-profile
283+
Labels:
284+
kjobctl.x-k8s.io/mode: Slurm
285+
kjobctl.x-k8s.io/profile: xpk-def-app-profile
286+
kueue.x-k8s.io/queue-name: multislice-queue
287+
Mounts:
288+
- mountPath: /slurm/scripts
289+
name: slurm-scripts
290+
- mountPath: /slurm/env
291+
name: slurm-env
292+
Pods:
293+
- Name: xpk-def-app-profile-slurm-6s6ff-0-kgtv8
294+
Status: Completed
295+
Entrypoint environment variables template:
296+
- SLURM_ARRAY_JOB_ID=1
297+
- SLURM_ARRAY_TASK_COUNT=1
298+
- SLURM_ARRAY_TASK_MAX=0
299+
- SLURM_ARRAY_TASK_MIN=0
300+
- SLURM_TASKS_PER_NODE=1
301+
- SLURM_CPUS_PER_TASK=
302+
- SLURM_CPUS_ON_NODE=
303+
- SLURM_JOB_CPUS_PER_NODE=
304+
- SLURM_CPUS_PER_GPU=
305+
- SLURM_MEM_PER_CPU=
306+
- SLURM_MEM_PER_GPU=
307+
- SLURM_MEM_PER_NODE=
308+
- SLURM_GPUS=
309+
- SLURM_NTASKS=1
310+
- SLURM_NTASKS_PER_NODE=1
311+
- SLURM_NPROCS=1
312+
- SLURM_NNODES=1
313+
- SLURM_SUBMIT_DIR=/slurm/scripts
314+
- SLURM_SUBMIT_HOST=$HOSTNAME
315+
- SLURM_JOB_NODELIST=xpk-def-app-profile-slurm-6s6ff-0.xpk-def-app-profile-slurm-6s6ff
316+
- SLURM_JOB_FIRST_NODE=xpk-def-app-profile-slurm-6s6ff-0.xpk-def-app-profile-slurm-6s6ff
317+
- SLURM_JOB_ID=$(expr $JOB_COMPLETION_INDEX \* 1 + $i + 1)
318+
- SLURM_JOBID=$(expr $JOB_COMPLETION_INDEX \* 1 + $i + 1)
319+
- SLURM_ARRAY_TASK_ID=$container_index
320+
- SLURM_JOB_FIRST_NODE_IP=${SLURM_JOB_FIRST_NODE_IP:-""}
321+
```
322+
323+
### 4. xpk info | sinfo
324+
To monitor the status of queues, use xpk info command, which will provide an overview of the local and cluster queues as well as their status.
325+
326+
```
327+
[XPK] Local Queues usage
328+
QUEUE ADMITTED_WORKLOADS PENDING_WORKLOADS 2xv4-8:google.com/tpu
329+
multislice-queue 0 0 0/8
330+
[XPK] Cluster Queues usage
331+
QUEUE ADMITTED_WORKLOADS PENDING_WORKLOADS 2xv4-8:google.com/tpu
332+
cluster-queue
333+
```
334+
335+
## Array jobs
336+
337+
Slurm mode in XPK supports execution of array jobs, provided that the job scripts follow the requirements described in section Prepare your scripts .
338+
The script example below defines an array job named array_job with ten jobs (indices 1 through 10), each job using one task with one CPU and 4 GB of memory. The job runs in the compute partition with a wall time limit of one hour. Output for each job is directed to a file named array_job_%A_%a.out, where %A is the job ID and %a is the array index. Within each job, the SLURM_ARRAY_TASK_ID variable is used to construct an input file name, and srun executes my_program with one task and input from the corresponding file.
339+
340+
```bash
341+
#!/bin/bash
342+
#SBATCH --job-name=array_job
343+
#SBATCH --array=1-10
344+
#SBATCH -n 1
345+
#SBATCH -c 1
346+
#SBATCH --mem=4G
347+
#SBATCH -p compute
348+
#SBATCH -t 0-1
349+
#SBATCH -o array_job_%A_%a.out
350+
351+
input_file=input_${SLURM_ARRAY_TASK_ID}.txt
352+
353+
srun -n 1 my_program < $input_file
354+
```
355+
356+
## Options
357+
Slurm like commands support the following Slurm-like options:
358+
359+
| Option | Description |
360+
| --- | --- |
361+
|-a, --array | array job |
362+
| --cpus-per-task | how much cpus a container inside a pod requires. |
363+
|-e, --error | where to redirect std error stream of a task. If not passed it proceeds to stdout, and is available via kubectl logs.|
364+
|--gpus-per-task | how much gpus a container inside a pod requires.|
365+
|- -i, –input | what to pipe into the script.|
366+
|-J, --job-name=<jobname> | what is the job name |
367+
| --mem-per-cpu | how much memory a container requires, it multiplies the number of requested cpus per task by mem-per-cpu.|
368+
|--mem-per-task | how much memory a container requires.|
369+
|-N, --nodes | number of pods to be used at a time - parallelism in indexed jobs.|
370+
|-n, --ntasks| number of identical containers inside of a pod, usually 1.|
371+
|-o, --output| where to redirect the standard output stream of a task. If not passed it proceeds to stdout, and is available via kubectl logs.|
372+
|-D, --chdir| Change directory before executing the script.|
373+
|--partition| local queue name|
374+
375+
Flags can be passed as a part of command line or inside of the script using the following format:
376+
```
377+
#SBATCH --job-name=array_job\
378+
#SBATCH --output=array_job_%A_%a.out\
379+
#SBATCH --error=array_job_%A_%a.err\
380+
#SBATCH --array=1-22
381+
```
382+
Inline parameters (the ones provided in the CLI command) overwrite the parameters in the script.

0 commit comments

Comments
 (0)