|
| 1 | +<!-- |
| 2 | + Copyright 2025 Google LLC |
| 3 | +
|
| 4 | + Licensed under the Apache License, Version 2.0 (the "License"); |
| 5 | + you may not use this file except in compliance with the License. |
| 6 | + You may obtain a copy of the License at |
| 7 | +
|
| 8 | + https://www.apache.org/licenses/LICENSE-2.0 |
| 9 | +
|
| 10 | + Unless required by applicable law or agreed to in writing, software |
| 11 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 12 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 13 | + See the License for the specific language governing permissions and |
| 14 | + limitations under the License. |
| 15 | + --> |
| 16 | + |
| 17 | +# Use Slurm like commands in XPK to execute workloads on top of GKE |
| 18 | + |
| 19 | +XPK enables intuitive workload scheduling for ML researchers by offering Slurm-like commands and usage patterns. |
| 20 | + |
| 21 | +This document provides a guide to fine-tuning Large Language Models (LLMs) using XPK Slurm Like commands. By leveraging the power of XPK and adapting its familiar Slurm command structures, users can efficiently train and optimize LLMs for specific use cases. |
| 22 | + |
| 23 | +Slurm - XPK commands mapping: |
| 24 | + |
| 25 | +| Slurm command | XPK command | |
| 26 | +| --- | --- | |
| 27 | +|Slurm login node| xpk shell | |
| 28 | +|srun |xpk run | |
| 29 | +|sbatch |xpk batch | |
| 30 | +|squeue |xpk job ls | |
| 31 | +|scancel |xpk job cancel | |
| 32 | +|sacct |xpk job info | |
| 33 | +|sinfo |xpk info| |
| 34 | +|Array jobs| See [Array jobs](#array-jobs) | |
| 35 | +|Options |See [Options](#options)| |
| 36 | + |
| 37 | + |
| 38 | + |
| 39 | +## Set up the environment |
| 40 | + |
| 41 | +To recreate a usual Slurm setup, first prepare your environment by provisioning the cluster and creating and attaching storage. |
| 42 | + |
| 43 | +1. Export the variables for easier commands manipulation: |
| 44 | + |
| 45 | + ```shell |
| 46 | + export CLUSTER_NAME="CLUSTER NAME" |
| 47 | + export COMPUTE_ZONE="COMPUTE ZONE" |
| 48 | + export PROJECT_ID="PROJECT ID" |
| 49 | + ``` |
| 50 | + Replace the following variables: |
| 51 | + - `CLUSTER NAME` - name of your cluster |
| 52 | + - `COMPUTE ZONE `- compute zone the cluster is at |
| 53 | + - `PROJECT ID`- id of your project |
| 54 | +3. Create a cluster using `xpk cluster create` command and providing machine type and provisioning mode of your choice. |
| 55 | + ```shell |
| 56 | + python3 xpk.py cluster create \ |
| 57 | + --cluster=$CLUSTER_NAME \ |
| 58 | + --zone=$COMPUTE_ZONE \ |
| 59 | + --project=$PROJECT_ID \ |
| 60 | + --device-type=DEVICE_TYPE \ |
| 61 | + --num-nodes=NUM_NODES \ |
| 62 | + --PROVISIONING MODE \ |
| 63 | + --enable-workload-identity \ |
| 64 | + --enable-gcpfilestore-csi-driver \ |
| 65 | + --default-pool-cpu-num-nodes=2 |
| 66 | + ``` |
| 67 | + |
| 68 | + Replace the following variables: |
| 69 | + - `DEVICE_TYPE`: name of your machine |
| 70 | + - `NUM_NODES`: number of worker nodes in the nodepool |
| 71 | + - `PROVISIONING MODE`: provide provisioning mode of your choice.\ |
| 72 | + `--enable-workload-identity` and `--enable-gcpfilestore-csi-driver` options are not required but they will speed up shared file system creation in the next step. |
| 73 | + |
| 74 | +4. Create storage using `xpk storage create` command. XPK supports attaching GCS Bucket and Filestore storages and creating Filestore storage. If you already have the storage, follow the instructions outlined in [Storage](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md#storage.) |
| 75 | + ```shell |
| 76 | + xpk storage create STORAGE_NAME \ |
| 77 | + --project=$PROJECT_ID \ |
| 78 | + --zone=$COMPUTE_ZONE \ |
| 79 | + --cluster=$CLUSTER_NAME \ |
| 80 | + --type=gcpfilestore \ |
| 81 | + --size=1024 \ |
| 82 | + --access-mode=ReadWriteMany \ |
| 83 | + --vol=home \ |
| 84 | + --tier=REGIONAL \ |
| 85 | + --mount-point /home \ |
| 86 | + --auto-mount=true \ |
| 87 | + --readonly=false |
| 88 | + ``` |
| 89 | + Replace the following variables: |
| 90 | + - `STORAGE_NAME` name of your storage |
| 91 | + |
| 92 | + |
| 93 | +5. Initialize XPK configuration. You can customize the configuration based on your needs, like in the example of Llama 3 finetuning provided below: |
| 94 | + |
| 95 | +```shell |
| 96 | +python3 xpk.py config set shell-interactive-command /bin/bash |
| 97 | +python3 xpk.py config set shell-working-directory /home/llama3 |
| 98 | +python3 xpk.py config set shell-image pytorch/pytorch:2.6.0-cuda12.6-cudnn9-devel |
| 99 | +python3 xpk.py config set batch-working-directory /home/llama3 |
| 100 | +python3 xpk.py config set batch-image pytorch/pytorch:2.6.0-cuda12.6-cudnn9-runtime |
| 101 | +``` |
| 102 | + |
| 103 | +## Prepare and upload scripts |
| 104 | + |
| 105 | +### 1. Prepare scripts |
| 106 | +This section specifies the changes needed for Slurm scripts used for batch executions for slurm-like commands in XPK. |
| 107 | + |
| 108 | +Currently xpk batch supports the following Slurm script cases: |
| 109 | +1. Batch job with a single task and single step per task. |
| 110 | +2. Batch job with multiple parallel tasks and single step per task. |
| 111 | +3. Array job with a single task per job and single step per task. |
| 112 | +As a result, XPK runs script validation to ensure it executes only the above use cases. |
| 113 | + |
| 114 | +For successful script validation and later job execution, apply the following script updates: |
| 115 | +- The number of steps in a task is limited to one. Thus, ensure there is only one step in the job script, invoked by one srun invocation. |
| 116 | +- Ensure there is only one srun invocation per script and it is the final command in the script. |
| 117 | +- Do not include other Slurm commands invocation within the script (e.g. scontrol, sinfo etc.). |
| 118 | + |
| 119 | +### 2. xpk shell | Slurm login node - download scripts, models and data sets: |
| 120 | +Through the xpk shell you can access the shared file system or edit files (e.g. when quick model changes are needed). It is the equivalent of the Slurm login node. To access the remote system use xpk shell command: |
| 121 | +```shell |
| 122 | +python3 xpk.py shell \ |
| 123 | +--project $PROJECT \ |
| 124 | +--zone $ZONE \ |
| 125 | +--cluster $CLUSTER |
| 126 | +``` |
| 127 | + |
| 128 | +This should open a console on the cluster with /home/llama3 set as the shell’s current working directory config. The subsequent commands in this section should be run on the login node within the XPK shell command. |
| 129 | + |
| 130 | +### 3. Create Python virtual environment and activate it |
| 131 | + |
| 132 | +While in shell, run the below command: |
| 133 | +```shell |
| 134 | +python3 -m venv ./llama3_env |
| 135 | +source ./llama3_env/bin/activate |
| 136 | +``` |
| 137 | +As an alternative user may want to use conda - https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html |
| 138 | +### 4. Upload your training scripts and training data to the created storage. |
| 139 | +While in shell, run the below commands: |
| 140 | +```shell |
| 141 | +python3 <<EOF |
| 142 | +import urllib.request |
| 143 | +urllib.request.urlretrieve("https://raw.githubusercontent.com/AI-Hypercomputer/xpk/refs/heads/slurm-fixes/examples/llama-3.1-finetuning/requirements.txt", "requirements.txt") |
| 144 | +urllib.request.urlretrieve("https://raw.githubusercontent.com/AI-Hypercomputer/xpk/refs/heads/slurm-fixes/examples/llama-3.1-finetuning/train.py", "train.py") |
| 145 | +urllib.request.urlretrieve("https://raw.githubusercontent.com/AI-Hypercomputer/xpk/refs/heads/slurm-fixes/examples/llama-3.1-finetuning/training_data.jsonl", "training_data.jsonl") |
| 146 | +EOF |
| 147 | +``` |
| 148 | + |
| 149 | +### 5. Install necessary Python libraries |
| 150 | +While in shell, run the below commands: |
| 151 | +```shell |
| 152 | +pip install -r requirements.txt |
| 153 | +``` |
| 154 | + |
| 155 | +### 6. Download llama 3.1 model weights |
| 156 | +While in shell, download the model from a models platform e.g. HuggingFace |
| 157 | +```shell |
| 158 | +pip install huggingface_hub[cli] |
| 159 | +huggingface-cli download "meta-llama/Llama-3.1-8B-Instruct" \ |
| 160 | +--local-dir "meta-llama/Llama-3.1-8B-Instruct" \ |
| 161 | +--token [hf_token] |
| 162 | +``` |
| 163 | +For this to work you need to: |
| 164 | +- create HuggingFace account |
| 165 | +- create HuggingFace access token - hf_token |
| 166 | +- request access to llama 3.1 models and wait for this request to be approved - https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct |
| 167 | + |
| 168 | +Now you can exit the shell to continue with running batch commands: |
| 169 | +```shell |
| 170 | +exit |
| 171 | +``` |
| 172 | + |
| 173 | +## Submit jobs - run CUDA and Llama 3 fine tuning script |
| 174 | +Just like in Slurm, you can submit jobs in XPK using the following methods: batch jobs, array jobs and interactive jobs. |
| 175 | + |
| 176 | +### 1. xpk run | srun - run CUDA in interactive mode |
| 177 | +xpk run command runs a job in an interactive and blocking way, the results are printed over terminal and no other commands can be executed till the end. |
| 178 | +```shell |
| 179 | +python3 xpk.py run \ |
| 180 | +--project [project] \ |
| 181 | +--zone [zone] \ |
| 182 | +--cluster [cluster] \ |
| 183 | +--nodes 1 \ |
| 184 | +--gpus-per-task nvidia.com/gpu:8 \ |
| 185 | +examples/llama-3.1-finetuning/check_cuda.sh |
| 186 | +``` |
| 187 | + |
| 188 | +The interface should display the following: |
| 189 | +```shell |
| 190 | +CUDA available: True |
| 191 | +Device count: 8 |
| 192 | +``` |
| 193 | + |
| 194 | +### 2. xpk batch | sbatch - run training script in batch mode |
| 195 | +Once your script is ready, simply run the xpk batch command specifying which script to run to execute your workload. |
| 196 | +```shell |
| 197 | +python3 xpk.py batch \ |
| 198 | +--project [project] \ |
| 199 | +--zone [zone] \ |
| 200 | +--cluster [cluster] \ |
| 201 | +examples/llama-3.1-finetuning/batch_script.sh |
| 202 | +``` |
| 203 | +The command will finish displaying the name of the created job: |
| 204 | +```shell |
| 205 | +[XPK] Job name: xpk-def-app-profile-slurm-9zm2g |
| 206 | +``` |
| 207 | + |
| 208 | +but the job execution might run longer depending on your job. |
| 209 | +The output from the script execution will be written to relevant folders in the attached storage determined in -mount-point parameter of storage create command. You can see their content by running the following command from the xpk shell command: |
| 210 | +```shell |
| 211 | +tail -f example_script.out example_script.err |
| 212 | +``` |
| 213 | +Once the execution is finished you should be able to see in the logs: |
| 214 | +```shell |
| 215 | +2025-02-21 13:02:08.431 GMT |
| 216 | +{'train_runtime': 689.2645, 'train_samples_per_second': 0.048, 'train_steps_per_second': 0.004, 'train_loss': 2.037710189819336, 'epoch': 3.0} |
| 217 | +``` |
| 218 | + |
| 219 | +## Cleanup |
| 220 | +### 1. Stop shell |
| 221 | +```shell |
| 222 | +python3 xpk.py shell stop \ |
| 223 | +--project $PROJECT \ |
| 224 | +--zone $ZONE \ |
| 225 | +--cluster $CLUSTER |
| 226 | +``` |
| 227 | + |
| 228 | +### 2. Delete shared storage |
| 229 | +```shell |
| 230 | +python3 xpk.py storage delete \ |
| 231 | +--project $PROJECT \ |
| 232 | +--zone $ZONE \ |
| 233 | +--cluster $CLUSTER |
| 234 | +``` |
| 235 | + |
| 236 | +### 3. Delete XPK cluster |
| 237 | +```shell |
| 238 | +python3 xpk.py cluster delete \ |
| 239 | +--project $PROJECT \ |
| 240 | +--zone $ZONE \ |
| 241 | +--cluster $CLUSTER |
| 242 | +``` |
| 243 | +# More Slurm mode features: |
| 244 | + |
| 245 | +## Job management - check the status of your job |
| 246 | +### 1. xpk job ls | squeue |
| 247 | + |
| 248 | +As in slurm squeue command, xpk uses xpk job ls command to list the jobs in the queue, which were scheduled through Slurm-like mode over a specific cluster. It lists the jobs with the tasks completion status, duration and age |
| 249 | +```shell |
| 250 | +python3 xpk.py job ls \ |
| 251 | +--project $PROJECT \ |
| 252 | +--zone $ZONE \ |
| 253 | +--cluster $CLUSTER |
| 254 | +``` |
| 255 | + |
| 256 | +The output should look like this: |
| 257 | +```shell |
| 258 | +NAME PROFILE LOCAL QUEUE COMPLETIONS DURATION AGE |
| 259 | +xpk-def-app-profile-slurm-6s6ff xpk-def-app-profile multislice-queue 1/1 8s 66m |
| 260 | +xpk-def-app-profile-slurm-fz5z8 xpk-def-app-profile multislice-queue 1/1 4s 63m |
| 261 | +``` |
| 262 | + |
| 263 | +### 2. xpk cancel | scancel |
| 264 | +If you want to cancel a job, use xpk cancel and provide the job id you wish to cancel. |
| 265 | +```shell |
| 266 | +xpk cancel <job_id> |
| 267 | +``` |
| 268 | + |
| 269 | +### 3. xpk job info | sacct |
| 270 | +To see the details of the job you submitted you can use xpk job info command. |
| 271 | +```shell |
| 272 | +python3 xpk.py job info JOB NAME \ |
| 273 | +--project $PROJECT \ |
| 274 | +--zone $ZONE \ |
| 275 | +--cluster $CLUSTER |
| 276 | +``` |
| 277 | + |
| 278 | +The expected output should look like this: |
| 279 | +``` |
| 280 | +Job name: xpk-def-app-profile-slurm-6s6ff |
| 281 | +Script name: ./job.sh |
| 282 | +Profile: default_xpk-def-app-profile |
| 283 | +Labels: |
| 284 | + kjobctl.x-k8s.io/mode: Slurm |
| 285 | + kjobctl.x-k8s.io/profile: xpk-def-app-profile |
| 286 | + kueue.x-k8s.io/queue-name: multislice-queue |
| 287 | +Mounts: |
| 288 | +- mountPath: /slurm/scripts |
| 289 | + name: slurm-scripts |
| 290 | +- mountPath: /slurm/env |
| 291 | + name: slurm-env |
| 292 | +Pods: |
| 293 | +- Name: xpk-def-app-profile-slurm-6s6ff-0-kgtv8 |
| 294 | + Status: Completed |
| 295 | +Entrypoint environment variables template: |
| 296 | +- SLURM_ARRAY_JOB_ID=1 |
| 297 | +- SLURM_ARRAY_TASK_COUNT=1 |
| 298 | +- SLURM_ARRAY_TASK_MAX=0 |
| 299 | +- SLURM_ARRAY_TASK_MIN=0 |
| 300 | +- SLURM_TASKS_PER_NODE=1 |
| 301 | +- SLURM_CPUS_PER_TASK= |
| 302 | +- SLURM_CPUS_ON_NODE= |
| 303 | +- SLURM_JOB_CPUS_PER_NODE= |
| 304 | +- SLURM_CPUS_PER_GPU= |
| 305 | +- SLURM_MEM_PER_CPU= |
| 306 | +- SLURM_MEM_PER_GPU= |
| 307 | +- SLURM_MEM_PER_NODE= |
| 308 | +- SLURM_GPUS= |
| 309 | +- SLURM_NTASKS=1 |
| 310 | +- SLURM_NTASKS_PER_NODE=1 |
| 311 | +- SLURM_NPROCS=1 |
| 312 | +- SLURM_NNODES=1 |
| 313 | +- SLURM_SUBMIT_DIR=/slurm/scripts |
| 314 | +- SLURM_SUBMIT_HOST=$HOSTNAME |
| 315 | +- SLURM_JOB_NODELIST=xpk-def-app-profile-slurm-6s6ff-0.xpk-def-app-profile-slurm-6s6ff |
| 316 | +- SLURM_JOB_FIRST_NODE=xpk-def-app-profile-slurm-6s6ff-0.xpk-def-app-profile-slurm-6s6ff |
| 317 | +- SLURM_JOB_ID=$(expr $JOB_COMPLETION_INDEX \* 1 + $i + 1) |
| 318 | +- SLURM_JOBID=$(expr $JOB_COMPLETION_INDEX \* 1 + $i + 1) |
| 319 | +- SLURM_ARRAY_TASK_ID=$container_index |
| 320 | +- SLURM_JOB_FIRST_NODE_IP=${SLURM_JOB_FIRST_NODE_IP:-""} |
| 321 | +``` |
| 322 | + |
| 323 | +### 4. xpk info | sinfo |
| 324 | +To monitor the status of queues, use xpk info command, which will provide an overview of the local and cluster queues as well as their status. |
| 325 | + |
| 326 | +``` |
| 327 | +[XPK] Local Queues usage |
| 328 | +QUEUE ADMITTED_WORKLOADS PENDING_WORKLOADS 2xv4-8:google.com/tpu |
| 329 | +multislice-queue 0 0 0/8 |
| 330 | +[XPK] Cluster Queues usage |
| 331 | +QUEUE ADMITTED_WORKLOADS PENDING_WORKLOADS 2xv4-8:google.com/tpu |
| 332 | +cluster-queue |
| 333 | +``` |
| 334 | + |
| 335 | +## Array jobs |
| 336 | + |
| 337 | +Slurm mode in XPK supports execution of array jobs, provided that the job scripts follow the requirements described in section Prepare your scripts . |
| 338 | +The script example below defines an array job named array_job with ten jobs (indices 1 through 10), each job using one task with one CPU and 4 GB of memory. The job runs in the compute partition with a wall time limit of one hour. Output for each job is directed to a file named array_job_%A_%a.out, where %A is the job ID and %a is the array index. Within each job, the SLURM_ARRAY_TASK_ID variable is used to construct an input file name, and srun executes my_program with one task and input from the corresponding file. |
| 339 | + |
| 340 | +```bash |
| 341 | +#!/bin/bash |
| 342 | +#SBATCH --job-name=array_job |
| 343 | +#SBATCH --array=1-10 |
| 344 | +#SBATCH -n 1 |
| 345 | +#SBATCH -c 1 |
| 346 | +#SBATCH --mem=4G |
| 347 | +#SBATCH -p compute |
| 348 | +#SBATCH -t 0-1 |
| 349 | +#SBATCH -o array_job_%A_%a.out |
| 350 | + |
| 351 | +input_file=input_${SLURM_ARRAY_TASK_ID}.txt |
| 352 | + |
| 353 | +srun -n 1 my_program < $input_file |
| 354 | +``` |
| 355 | + |
| 356 | +## Options |
| 357 | +Slurm like commands support the following Slurm-like options: |
| 358 | + |
| 359 | +| Option | Description | |
| 360 | +| --- | --- | |
| 361 | +|-a, --array | array job | |
| 362 | +| --cpus-per-task | how much cpus a container inside a pod requires. | |
| 363 | +|-e, --error | where to redirect std error stream of a task. If not passed it proceeds to stdout, and is available via kubectl logs.| |
| 364 | +|--gpus-per-task | how much gpus a container inside a pod requires.| |
| 365 | +|- -i, –input | what to pipe into the script.| |
| 366 | +|-J, --job-name=<jobname> | what is the job name | |
| 367 | +| --mem-per-cpu | how much memory a container requires, it multiplies the number of requested cpus per task by mem-per-cpu.| |
| 368 | +|--mem-per-task | how much memory a container requires.| |
| 369 | +|-N, --nodes | number of pods to be used at a time - parallelism in indexed jobs.| |
| 370 | +|-n, --ntasks| number of identical containers inside of a pod, usually 1.| |
| 371 | +|-o, --output| where to redirect the standard output stream of a task. If not passed it proceeds to stdout, and is available via kubectl logs.| |
| 372 | +|-D, --chdir| Change directory before executing the script.| |
| 373 | +|--partition| local queue name| |
| 374 | + |
| 375 | +Flags can be passed as a part of command line or inside of the script using the following format: |
| 376 | +``` |
| 377 | +#SBATCH --job-name=array_job\ |
| 378 | +#SBATCH --output=array_job_%A_%a.out\ |
| 379 | +#SBATCH --error=array_job_%A_%a.err\ |
| 380 | +#SBATCH --array=1-22 |
| 381 | +``` |
| 382 | +Inline parameters (the ones provided in the CLI command) overwrite the parameters in the script. |
0 commit comments