Skip to content

Commit 3fa16b4

Browse files
gciepawloch00
andauthored
A4 support in readme (#460)
Co-authored-by: pawloch00 <[email protected]>
1 parent 1be6dc7 commit 3fa16b4

File tree

1 file changed

+43
-28
lines changed

1 file changed

+43
-28
lines changed

README.md

Lines changed: 43 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,9 @@ xpk supports the following TPU types:
5050
and the following GPU types:
5151
* A100
5252
* A3-Highgpu (h100)
53-
* A3-Mega (h100-mega) - [Create cluster](#provisioning-a3-ultra-and-a3-mega-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-and-a3-mega-clusters-gpu-machines)
54-
* A3-Ultra (h200) - [Create cluster](#provisioning-a3-ultra-and-a3-mega-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-and-a3-mega-clusters-gpu-machines)
53+
* A3-Mega (h100-mega) - [Create cluster](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines)
54+
* A3-Ultra (h200) - [Create cluster](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines)
55+
* A4 (b200) - [Create cluster](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines)
5556

5657
and the following CPU types:
5758
* n2-standard-32
@@ -425,24 +426,32 @@ will fail the cluster creation process because Vertex AI Tensorboard is not supp
425426
--tpu-type=v5litepod-16
426427
```
427428

428-
## Provisioning A3-Ultra and A3-Mega clusters (GPU machines)
429-
To create a cluster with A3 machines, run the below command. To create workloads on these clusters see [here](#workloads-for-a3-ultra-and-a3-mega-clusters-gpu-machines).
430-
* For A3-Ultra: --device-type=h200-141gb-8
431-
* For A3-Mega: --device-type=h100-mega-80gb-8
429+
## Provisioning A3 Ultra, A3 Mega and A4 clusters (GPU machines)
430+
To create a cluster with A3 or A4 machines, run the command below with selected device type. To create workloads on these clusters see [here](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines).
432431

433-
```shell
434-
python3 xpk.py cluster create \
435-
--cluster CLUSTER_NAME --device-type=h200-141gb-8 \
432+
**Note:** Creating A3 Ultra, A3 Mega and A4 clusters is currently supported **only** on linux/amd64 architecture.
433+
434+
Machine | Device type
435+
:- | :-
436+
A3 Mega | `h100-mega-80gb-8`
437+
A3 Ultra | `h200-141gb-8`
438+
A4 | `b200-8`
439+
440+
441+
```shell
442+
python3 xpk.py cluster create \
443+
--cluster CLUSTER_NAME --device-type DEVICE_TYPE \
436444
--zone=$COMPUTE_ZONE --project=$PROJECT_ID \
437-
--num-nodes=4 --reservation=$RESERVATION_ID
438-
```
439-
Currently, the below flags/arguments are supported for A3-Mega and A3-Ultra machines:
440-
* --num-nodes
441-
* --default-pool-cpu-machine-type
442-
* --default-pool-cpu-num-nodes
443-
* --reservation
444-
* --spot
445-
* --on-demand (only A3-Mega)
445+
--num-nodes=$NUM_NODES --reservation=$RESERVATION_ID
446+
```
447+
448+
Currently, the below flags/arguments are supported for A3 Mega, A3 Ultra and A4 machines:
449+
* `--num-nodes`
450+
* `--default-pool-cpu-machine-type`
451+
* `--default-pool-cpu-num-nodes`
452+
* `--reservation`
453+
* `--spot`
454+
* `--on-demand` (A3 Mega only)
446455

447456

448457
## Storage
@@ -662,21 +671,27 @@ increase this to a large number, say 50. Real jobs can be interrupted due to
662671
hardware failures and software updates. We assume your job has implemented
663672
checkpointing so the job restarts near where it was interrupted.
664673

665-
### Workloads for A3-Ultra and A3-Mega clusters (GPU machines)
666-
To submit jobs on a cluster with A3 machines, run the below command. To create a cluster with A3 machines see [here](#provisioning-a3-ultra-and-a3-mega-clusters-gpu-machines).
667-
* For A3-Ultra: --device-type=h200-141gb-8
668-
* For A3-Mega: --device-type=h100-mega-80gb-8
674+
### Workloads for A3 Ultra, A3 Mega and A4 clusters (GPU machines)
675+
To submit jobs on a cluster with A3 or A4 machines, run the command with selected device type. To create a cluster with A3 or A4 machines see [here](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines).
669676

670-
```shell
671-
python3 xpk.py workload create \
677+
678+
Machine | Device type
679+
:- | :-
680+
A3 Mega | `h100-mega-80gb-8`
681+
A3 Ultra | `h200-141gb-8`
682+
A4 | `b200-8`
683+
684+
```shell
685+
python3 xpk.py workload create \
672686
--workload=$WORKLOAD_NAME --command="echo goodbye" \
673-
--cluster=$CLUSTER_NAME --device-type=h200-141gb-8 \
687+
--cluster=$CLUSTER_NAME --device-type DEVICE_TYPE \
674688
--zone=$COMPUTE_ZONE --project=$PROJECT_ID \
675689
--num-nodes=$WOKRKLOAD_NUM_NODES
676-
```
677-
> The docker image flags/arguments introduced in [workloads section](#workload-create) can be used with A3 machines as well.
690+
```
691+
692+
> The docker image flags/arguments introduced in [workloads section](#workload-create) can be used with A3 or A4 machines as well.
678693

679-
In order to run NCCL test on A3 Ultra machines check out [this guide](/examples/nccl/nccl.md).
694+
In order to run NCCL test on A3 machines check out [this guide](/examples/nccl/nccl.md).
680695

681696
### Workload Priority and Preemption
682697
* Set the priority level of your workload with `--priority=LEVEL`

0 commit comments

Comments
 (0)