Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ spec:
--model-size ${MODEL_SIZE} \
--huggingface-checkpoint True

gsutil -m cp -r ${CHECKPOINT_TPU_UNSCANNED} /gcs/{{ .Values.model.name }}/output/unscanned_ckpt/checkpoints/
gcloud storage cp --recursive ${CHECKPOINT_TPU_UNSCANNED} /gcs/{{ .Values.model.name }}/output/unscanned_ckpt/checkpoints/

echo "Conversion Job Complete. Unscanned checkpoints should be at ${CHECKPOINT_TPU_UNSCANNED}"

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ spec:

echo "Conversion Job Complete. Unscanned checkpoints should be at ${CHECKPOINT_TPU_UNSCANNED}"
echo "Copying unscanned checkpoints to GCS bucket..."
gsutil -m cp -r ${CHECKPOINT_TPU_UNSCANNED} gs://${GCS_FUSE_BUCKET}/{{ .Values.model.name }}/output/unscanned_ckpt/
gcloud storage cp --recursive ${CHECKPOINT_TPU_UNSCANNED} gs://${GCS_FUSE_BUCKET}/{{ .Values.model.name }}/output/unscanned_ckpt/
echo "Finished copying unscanned checkpoints to gs://${GCS_FUSE_BUCKET}/{{ .Values.model.name }}/output/unscanned_ckpt/"

volumeMounts:
Expand Down
4 changes: 2 additions & 2 deletions inference/trillium/JetStream-Maxtext/Llama2-7B/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ bash download.sh # When prompted, choose 7B. This should create a directory llam
export CHKPT_BUCKET=gs://...
export MAXTEXT_BUCKET_SCANNED=gs://...
export MAXTEXT_BUCKET_UNSCANNED=gs://...
gsutil cp -r llama/llama-2-7b/* ${CHKPT_BUCKET}
gcloud storage cp --recursive llama/llama-2-7b/* ${CHKPT_BUCKET}


# Checkpoint conversion
Expand Down Expand Up @@ -117,4 +117,4 @@ Mean TPOT: 5052.76 ms
Median TPOT: 164.01 ms
P99 TPOT: 112171.56 ms

```
```
4 changes: 2 additions & 2 deletions inference/v5e/JetStream-Maxtext/Llama2-7B/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ bash download.sh # When prompted, choose 7B. This should create a directory llam
export CHKPT_BUCKET=gs://...
export MAXTEXT_BUCKET_SCANNED=gs://...
export MAXTEXT_BUCKET_UNSCANNED=gs://...
gsutil cp -r llama/llama-2-7b ${CHKPT_BUCKET}
gcloud storage cp --recursive llama/llama-2-7b ${CHKPT_BUCKET}


# Checkpoint conversion
Expand Down Expand Up @@ -117,4 +117,4 @@ Mean TPOT: 5052.76 ms
Median TPOT: 164.01 ms
P99 TPOT: 112171.56 ms

```
```
6 changes: 3 additions & 3 deletions microbenchmarks/trillium/collectives/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,15 @@ psum_ici: Matrix size: 17408x17408, dtype=<class 'jax.numpy.bfloat16'>, matrix_s

Results will be printed out and also stored at `/tmp/microbenchmarks/collectives`. You can save the stored results to GCS by adding the following to `--command` in the XPK command:
```
gsutil cp -r /tmp/microbenchmarks/collectives gs://<your-gcs-bucket>
gcloud storage cp --recursive /tmp/microbenchmarks/collectives gs://<your-gcs-bucket>
```

### Run with a custom yaml config
If you would like to run with a custom defined yaml with modified configurations (e.g. warmup_tries, tries, matrix_dim_range) you may do so by uploading it to a GCS bucket, pulling the yaml file from GCS in the workload, and then referencing the yaml file in the benchmark command.

Start by creating a yaml file `your_config.yaml`. Take a look at [1x_v6e_256.yaml](https://github.com/AI-Hypercomputer/accelerator-microbenchmarks/blob/35c10a42e8cfab7593157327dd3ad3150e4c001d/configs/1x_v6e_256.yaml) for an example yaml config. Then upload it to your GCS bucket:
```
gsutil cp your_config.yaml gs://<your-gcs-bucket>
gcloud storage cp your_config.yaml gs://<your-gcs-bucket>
```

Then use a modified launch command that pulls the yaml file from GCS and references it in the benchmark command:
Expand All @@ -51,7 +51,7 @@ python3 ~/xpk/xpk.py workload create \
--project=${PROJECT} \
--zone=${ZONE} \
--device-type=v6e-256 \
--command="git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git && cd accelerator-microbenchmarks && git checkout trillium-collectives && pip install -r requirements.txt && echo '4096 41943040 314572800' > /proc/sys/net/ipv4/tcp_rmem && export LIBTPU_INIT_ARGS='--megascale_grpc_premap_memory_bytes=17179869184 --xla_tpu_enable_sunk_dcn_allreduce_done_with_host_reduction=true' && gsutil cp gs://<your-gcs-bucket>/your_config.yaml configs/ && python src/run_benchmark.py --config=configs/your_config.yaml" \
--command="git clone https://github.com/AI-Hypercomputer/accelerator-microbenchmarks.git && cd accelerator-microbenchmarks && git checkout trillium-collectives && pip install -r requirements.txt && echo '4096 41943040 314572800' > /proc/sys/net/ipv4/tcp_rmem && export LIBTPU_INIT_ARGS='--megascale_grpc_premap_memory_bytes=17179869184 --xla_tpu_enable_sunk_dcn_allreduce_done_with_host_reduction=true' && gcloud storage cp gs://<your-gcs-bucket>/your_config.yaml configs/ && python src/run_benchmark.py --config=configs/your_config.yaml" \
--num-slices=1 \
--docker-image=us-docker.pkg.dev/cloud-tpu-images/jax-stable-stack/tpu:jax0.5.2-rev1 \
--workload=${WORKLOAD_NAME}
Expand Down
5 changes: 1 addition & 4 deletions training/archive/trillium/Llama3.0-70B-PyTorch/XPK/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@


# Instructions for training Llama 3.0 70B on Trillium TPU on multipod using XPK

## Environment Steup
Expand Down Expand Up @@ -88,7 +86,7 @@ You can use the profile
export PROFILE_SCRIPT_PATH=../../../../utils/

# download the profile from gcp bucket to local
gsutil cp -r $PROFILE_LOG_DIR ./
gcloud storage cp --recursive $PROFILE_LOG_DIR ./

# locate the xplane.pb file and process
PYTHONPATH==$PROFILE_SCRIPT_PATH:$PYTHONPATH python $PROFILE_SCRIPT_PATH/profile_convert.py xplane.pb
Expand All @@ -112,4 +110,3 @@ Plane ID: 2, Name: /device:TPU:0
Got 10 iterations
1.8454
```

5 changes: 1 addition & 4 deletions training/archive/trillium/Llama3.0-8B-PyTorch/XPK/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@


# Instructions for training Llama 3.0 8B on Trillium TPU on multipod using XPK

## Environment Steup
Expand Down Expand Up @@ -87,7 +85,7 @@ You can use the profile
export PROFILE_SCRIPT_PATH=../../../../utils/

# download the profile from gcp bucket to local
gsutil cp -r $PROFILE_LOG_DIR ./
gcloud storage cp --recursive $PROFILE_LOG_DIR ./

# locate the xplane.pb file and process
PYTHONPATH==$PROFILE_SCRIPT_PATH:$PYTHONPATH python $PROFILE_SCRIPT_PATH/profile_convert.py xplane.pb
Expand All @@ -111,4 +109,3 @@ Plane ID: 2, Name: /device:TPU:0
Got 10 iterations
1.8454
```

Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ You can use the profile
export PROFILE_SCRIPT_PATH=../../../../utils/

# download the profile from gcp bucket to local
gsutil cp -r $PROFILE_LOG_DIR ./
gcloud storage cp --recursive $PROFILE_LOG_DIR ./

# locate the profile output ending with ".pb".
# Name it xplane.pb file, and process it
Expand Down
5 changes: 1 addition & 4 deletions training/archive/trillium/Mixtral-8x7B-Pytorch/XPK/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@


# Instructions for training Mixtral 8x7B on Trillium TPU on multipod using XPK

## Environment Steup
Expand Down Expand Up @@ -87,7 +85,7 @@ You can use the profile
# this is the place we place the profile processing script
export PROFILE_SCRIPT_PATH=../../../../utils/
# download the profile from gcp bucket to local
gsutil cp -r $PROFILE_LOG_DIR ./
gcloud storage cp --recursive $PROFILE_LOG_DIR ./
# locate the xplane.pb file and process
PYTHONPATH==$PROFILE_SCRIPT_PATH:$PYTHONPATH python $PROFILE_SCRIPT_PATH/profile_convert.py xplane.pb
```
Expand All @@ -110,4 +108,3 @@ Plane ID: 2, Name: /device:TPU:0
Got 10 iterations
1.8454
```

Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ huggingface-cli download RaphaelLiu/PusaV1_training --repo-type dataset --local-
python src/maxdiffusion/data_preprocessing/wan_pusav1_to_tfrecords.py src/maxdiffusion/configs/base_wan_14b.yml train_data_dir=${HF_DATASET_DIR} tfrecords_dir=${TFRECORDS_DATASET_DIR} no_records_per_shard=10 skip_jax_distributed_system=True

# Upload to gcs
gsutil -m cp -r ${TFRECORDS_DATASET_DIR} ${DATASET_DIR}
gcloud storage cp --recursive ${TFRECORDS_DATASET_DIR} ${DATASET_DIR}
```

## Run the recipe
Expand Down