Skip to content

Commit 9908589

Browse files
authored
Update README for workload name and paths
1 parent ceb8871 commit 9908589

File tree

1 file changed

+8
-8
lines changed
  • training/a4/llama3-1-70b/nemo-pretraining-gke/8node-bf16-seq8192-gbs256

1 file changed

+8
-8
lines changed

training/a4/llama3-1-70b/nemo-pretraining-gke/8node-bf16-seq8192-gbs256/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
7171
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
7272
cd gpu-recipes
7373
export REPO_ROOT=`git rev-parse --show-toplevel`
74-
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b/nemo-pretraining-gke/8_nodes
74+
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b/nemo-pretraining-gke/8node-bf16-seq8192-gbs256
7575
cd $RECIPE_ROOT
7676
```
7777

@@ -89,7 +89,7 @@ your client:
8989

9090
```bash
9191
cd $RECIPE_ROOT
92-
export WORKLOAD_NAME=$USER-a4-llama3-1-70b-8node
92+
export WORKLOAD_NAME=$USER-a4-llama3-1-70b
9393
helm install $WORKLOAD_NAME . -f values.yaml \
9494
--set-file workload_launcher=launcher.sh \
9595
--set-file workload_config=llama3-1-70b-bf16-seq8192-gbs256-gpus64.py \
@@ -107,7 +107,7 @@ your client:
107107

108108
```bash
109109
cd $RECIPE_ROOT
110-
export WORKLOAD_NAME=$USER-a4-llama3-1-70b-8node
110+
export WORKLOAD_NAME=$USER-a4-llama3-1-70b
111111
helm install $WORKLOAD_NAME . -f values.yaml \
112112
--set-file workload_launcher=launcher.sh \
113113
--set-file workload_config=llama3-1-70b-bf16-seq8192-gbs256-gpus64.py \
@@ -124,12 +124,12 @@ your client:
124124
To check the status of pods in your job, run the following command:
125125

126126
```
127-
kubectl get pods | grep $USER-a4-llama3-1-70b-8node
127+
kubectl get pods | grep $USER-a4-llama3-1-70b
128128
```
129129
130130
Replace the following:
131131
132-
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-70b-8node.
132+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-70b
133133
134134
To get the logs for one of the pods, run the following command:
135135
@@ -141,13 +141,13 @@ Information about the training job's progress, including crucial details such as
141141
loss, step count, and step time, is generated by the rank 0 process.
142142
This process runs on the pod whose name begins with
143143
`JOB_NAME_PREFIX-workload-0-0`.
144-
For example: `$USER-a4-llama3-1-70b-8node-workload-0-0-s9zrv`.
144+
For example: `$USER-a4-llama3-1-70b-workload-0-0-s9zrv`.
145145
146146
### Uninstall the Helm release
147147
148148
You can delete the job and other resources created by the Helm chart. To
149149
uninstall Helm, run the following command from your client:
150150
151151
```bash
152-
helm uninstall $USER-a4-llama3-1-70b-8node
153-
```
152+
helm uninstall $USER-a4-llama3-1-70b
153+
```

0 commit comments

Comments
 (0)