Skip to content

Commit 062b037

Browse files
authored
Reduced workload name
Shortened workload name to limitations (32 characters) "$USER-a4-llama3-1-70b"
1 parent fa418f9 commit 062b037

File tree

1 file changed

+9
-9
lines changed
  • training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs2048/recipe

1 file changed

+9
-9
lines changed

training/a4/llama3-1-70b/nemo-pretraining-gke/32node-bf16-seq8192-gbs2048/recipe/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
<!-- mdformat global-off -->
2-
# Pretrain llama3-1-70b-seq8192-gbs2048-mbs1-gpus256 workloads on a4 GKE Node pools with Nvidia NeMo Framework
2+
# Pretrain $USER-a4-llama3-1-70b workloads on a4 GKE Node pools with Nvidia NeMo Framework
33

4-
This recipe outlines the steps for running a llama3-1-70b-seq8192-gbs2048-mbs1-gpus256 pretraining
4+
This recipe outlines the steps for running a $USER-a4-llama3-1-70b pretraining
55
workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
66
[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).
77

@@ -89,7 +89,7 @@ your client:
8989

9090
```bash
9191
cd $RECIPE_ROOT
92-
export WORKLOAD_NAME=$USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus256-32node
92+
export WORKLOAD_NAME=$USER-a4-llama3-1-70b
9393
helm install $WORKLOAD_NAME . -f values.yaml \
9494
--set-file workload_launcher=launcher.sh \
9595
--set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus256.py \
@@ -107,7 +107,7 @@ your client:
107107

108108
```bash
109109
cd $RECIPE_ROOT
110-
export WORKLOAD_NAME=$USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus256-32node
110+
export WORKLOAD_NAME=$USER-a4-llama3-1-70b
111111
helm install $WORKLOAD_NAME . -f values.yaml \
112112
--set-file workload_launcher=launcher.sh \
113113
--set-file workload_config=llama3-1-70b-seq8192-gbs2048-mbs1-gpus256.py \
@@ -124,12 +124,12 @@ your client:
124124
To check the status of pods in your job, run the following command:
125125

126126
```
127-
kubectl get pods | grep $USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus256-32node
127+
kubectl get pods | grep $USER-a4-llama3-1-70b
128128
```
129129
130130
Replace the following:
131131
132-
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus256-32node.
132+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-70b.
133133
134134
To get the logs for one of the pods, run the following command:
135135
@@ -141,13 +141,13 @@ Information about the training job's progress, including crucial details such as
141141
loss, step count, and step time, is generated by the rank 0 process.
142142
This process runs on the pod whose name begins with
143143
`JOB_NAME_PREFIX-workload-0-0`.
144-
For example: `$USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus256-32node-workload-0-0-s9zrv`.
144+
For example: `$USER-a4-llama3-1-70b-workload-0-0-s9zrv`.
145145
146146
### Uninstall the Helm release
147147
148148
You can delete the job and other resources created by the Helm chart. To
149149
uninstall Helm, run the following command from your client:
150150
151151
```bash
152-
helm uninstall $USER-a4-llama3-1-70b-seq8192-gbs2048-mbs1-gpus256-32node
153-
```
152+
helm uninstall $USER-a4-llama3-1-70b
153+
```

0 commit comments

Comments
 (0)