Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/ai-testbed/cerebras/csl.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Example script to forward port 8000 to localhost 8008:
export SDK_PORT=8000
export LOCAL_PORT=8008
export ALCFUserID=<your alcf username>
ssh -L $LOCAL_PORT:localhost:$LOCAL_PORT $ALCFUserID@cer-login-04.ai.alcf.anl.gov -t ssh -L $LOCAL_PORT:localhost:$SDK_PORT -N cer-anl-net001-us-sr01
ssh -L $LOCAL_PORT:localhost:$LOCAL_PORT $ALCFUserID@cerebras.alcf.anl.gov -t ssh -L $LOCAL_PORT:localhost:$SDK_PORT -N cer-anl-net001-us-sr01
```

Then open the following URL in your web browser: `http://localhost:8008/sdk-gui/`
Expand All @@ -114,8 +114,8 @@ pip install --upgrade pip

**Install SDK Packages:** Install the `cerebras_appliance` and `cerebras_sdk` Python packages in the virtual environment, specifying the appropriate Cerebras Software release:
```bash linenums="1"
pip install cerebras_appliance==2.6.0
pip install cerebras_sdk==2.6.0
pip install cerebras_appliance==2.9.0
pip install cerebras_sdk==2.9.0
```

### Examples
Expand Down
14 changes: 7 additions & 7 deletions docs/ai-testbed/cerebras/customizing-environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@

#### To make a PyTorch virtual environment for Cerebras

Clone the Cerebras modelzoo, if it is not already cloned. Check out the R 2.6.0 release.
Clone the Cerebras modelzoo, if it is not already cloned. Check out the R 2.9.0 release.

```console
mkdir ~/R_2.6.0
cd ~/R_2.6.0
mkdir ~/R_2.9.0
cd ~/R_2.9.0
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
git clone https://github.com/Cerebras/modelzoo.git
cd modelzoo
git tag
git checkout Release_2.6.0
git checkout Release_2.9.0
```
Note: a `git pull` will not update the tags; if `modelzoo/setup.py` does not exist after tag checkout, please re-clone `modelzoo`.

Expand All @@ -26,8 +26,8 @@ export https_proxy=http://proxy.alcf.anl.gov:3128
Then build the virtual environment

```console
mkdir ~/R_2.6.0
cd ~/R_2.6.0
mkdir ~/R_2.9.0
cd ~/R_2.9.0
# Note: "deactivate" does not actually work in scripts.
deactivate
rm -r venv_cerebras_pt
Expand All @@ -46,7 +46,7 @@ pip install -e modelzoo
To activate a virtual environment

```console
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
```

To deactivate a virtual environment,
Expand Down
56 changes: 28 additions & 28 deletions docs/ai-testbed/cerebras/example-programs.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@
Make a working directory and a local copy of the Cerebras **modelzoo** repository, if not previously done, as follows.

```bash
mkdir ~/R_2.6.0
cd ~/R_2.6.0
mkdir ~/R_2.9.0
cd ~/R_2.9.0
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
git clone https://github.com/Cerebras/modelzoo.git
cd modelzoo
git tag
git checkout Release_2.6.0
git checkout Release_2.9.0
```

Note: to access any external web resources from a Cerebras user node, you will need to have a proxy environment variable set (or equivalent). `wget` needs the lower-case proxy environment variable.
Expand Down Expand Up @@ -43,17 +43,17 @@ To run Unet with the <a href="https://www.kaggle.com/c/severstal-steel-defect-de
First, source a Cerebras PyTorch virtual environment.

```console
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
```

Then

```console
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert
cp /software/cerebras/dataset/severstal-steel-defect-detection/params_severstal_binary_rawds.yaml configs/params_severstal_binary_rawds.yaml
export MODEL_DIR=model_dir_unet
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX --job_labels name=unet_pt --params configs/params_severstal_binary_rawds.yaml --model_dir $MODEL_DIR --mode train --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.6.0/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=unet_pt --params configs/params_severstal_binary_rawds.yaml --model_dir $MODEL_DIR --mode train --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.9.0/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
```
--->

Expand All @@ -68,7 +68,7 @@ The BraggNN model has two versions:<br>

```console
TODO
cd ~/R_2.6.0/anl_shared/braggnn/tf
cd ~/R_2.9.0/anl_shared/braggnn/tf
# This yaml has a correct path to a BraggNN dataset
cp /software/cerebras/dataset/BraggN/params_bragg_nonlocal_sampleds.yaml configs/params_bragg_nonlocal_sampleds.yaml
export MODEL_DIR=model_dir_braggnn
Expand All @@ -88,23 +88,23 @@ source /software/cerebras/venvs/venv_cerebras_pt/bin/activate
# or your personal venv
--->
```console
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
```

Then

```console
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert
cp /software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml configs/bert_large_MSL128_sampleds.yaml
export MODEL_DIR=model_dir_bert_large_pytorch
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
cszoo fit configs/bert_large_MSL128_sampleds.yaml --job_labels name=bert_pt --model_dir $MODEL_DIR |& tee mytest.log
```
<!---
previously,
python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
--->
Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/vocab/google_research_uncased_L-12_H-768_A-12.txt`.
Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/vocab/google_research_uncased_L-12_H-768_A-12.txt`.

The last parts of the output should resemble the following, with messages about cuda that should be ignored and are not shown.

Expand All @@ -130,13 +130,13 @@ This PyTorch GPT-J 6B parameter pretraining sample uses 1 CS3.
First, source a Cerebras PyTorch virtual environment.

```console
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
```

Then

```console
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/gptj
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/gptj
cp /software/cerebras/dataset/gptj/params_gptj_6B_sampleds.yaml configs/params_gptj_6B_sampleds.yaml
export MODEL_DIR=model_dir_gptj
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
Expand All @@ -147,7 +147,7 @@ Note: the validation has been commented out of the yaml to decrease the run time

<!---
Previously,
python run.py CSX --job_labels name=gptj_pt --params configs/params_gptj_6B_sampleds.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=gptj_pt --params configs/params_gptj_6B_sampleds.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
--->

The last parts of the output should resemble the following:
Expand All @@ -162,7 +162,7 @@ The last parts of the output should resemble the following:
2025-10-10 20:20:51,668 INFO: Saved checkpoint model_dir_gptj/checkpoint_200.mdl
2025-10-10 20:21:14,280 INFO: Training completed successfully!
2025-10-10 20:21:14,286 INFO: Processed 24000 training sample(s) in 1443.67300221 seconds.
/home/arnoldw/R_2.6.0/venv_cerebras_pt/lib/python3.8/site-packages/pydantic/_internal/_gener
/home/arnoldw/R_2.9.0/venv_cerebras_pt/lib/python3.8/site-packages/pydantic/_internal/_gener
```

## Llama2-7B
Expand All @@ -171,11 +171,11 @@ The Cerebras llama2 7B model implementation can be found at modelzoo/modelzoo/tr

First, source a Cerebras PyTorch virtual environment.
```bash
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
```
Instructions for training:
```bash
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/llama
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/llama
cp /software/cerebras/dataset/params_llama2_7b.yaml configs/params_llama2_7b.yaml
export MODEL_DIR=model_dir_llama2_7b
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
Expand All @@ -185,7 +185,7 @@ cszoo fit configs/params_llama2_7b.yaml --job_labels name=llama2_7b --model_dir
Note: the validation has been commented out of the yaml to decrease the run time of this sample. To run validation, uncomment the validation sections at the end of `configs/params_llama2_7b.yaml`.
<!--
Formerly,
python run.py CSX --job_labels name=llama2_7b --params configs/params_llama2_7b.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /projects /home/ /software --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=llama2_7b --params configs/params_llama2_7b.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /projects /home/ /software --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
-->

Please find a sample output
Expand Down Expand Up @@ -230,11 +230,11 @@ The Cerebras ESM-2 model implementation can be found at `modelzoo/src/cerebras/m

First, source a Cerebras PyTorch virtual environment.
```bash
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
```
Instructions for training (for 400 steps):
```bash
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/esm2
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/esm2
cp /software/cerebras/dataset/ESM-2/params_esm2_t12_35M_UR50D_modified.yaml configs/params_esm2_t12_35M_UR50D_modified.yaml
export MODEL_DIR=model_dir_esm2
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
Expand All @@ -243,7 +243,7 @@ cszoo fit configs/params_esm2_t12_35M_UR50D_modified.yaml --job_labels name=esm2

<!--
Formerly,
python run.py CSX --job_labels name=esm2_t12_35m --params configs/params_esm2_t12_35M_UR50D_modified.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/$(whoami)/ /software --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --compile_dir /$(whoami) |& tee mytest.log
python run.py CSX --job_labels name=esm2_t12_35m --params configs/params_esm2_t12_35M_UR50D_modified.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/$(whoami)/ /software --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --compile_dir /$(whoami) |& tee mytest.log
-->

Note: the validation has been commented out of the yaml to decrease the run time of this sample. To run validation, uncomment the validation sections at the end of `configs/params_esm2_t12_35M_UR50D_modified.yaml`.
Expand Down Expand Up @@ -273,27 +273,27 @@ Saving checkpoint: 100%|██████████████████
Saving checkpoint: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1321/1321 [00:08<00:00, 154.35 tensors/s]
2025-10-10 23:45:54,994 INFO: Saved checkpoint model_dir_esm2/checkpoint_400.mdl
2025-10-10 23:46:01,812 INFO: Training completed successfully!
2025-10-10 23:46:01,861 INFO: Processed 819200 training sample(s) in 4049.286902367 seconds.
2025-10-10 23:46:01,861 INFO: Processed 819200 training sample(s) in 4049.286902367 seconds
```

## Vision Transformer
The cerebras transformer based vision classifier model implementation can be found at `modelzoo/models/vision/vision_transformer`. Configs for base and huge model of the vision transformer can be found at `modelzoo/models/vision/vision_transformer/configs`. This examples uses the ImageNet dataset preprocessed at path `/software/datasets/imagenet/`.

First, source a Cerebras PyTorch virtual environment.
```bash
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
```
Instructions for training (for 400 steps):
```bash
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/vision/vision_transformer
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/vision/vision_transformer
cp /software/cerebras/dataset/vision_transformer/params_vit_base_patch_16_imagenet_1k.yaml configs/params_vit_base_patch_16_imagenet_1k.yaml
export MODEL_DIR=model_dir_vit
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
cszoo fit configs/params_vit_base_patch_16_imagenet_1k.yaml --job_labels name=vision_transformer --model_dir $MODEL_DIR |& tee mytest.log
```
<!--
Formerly,
python run.py CSX --job_labels name=vision_transformer --params configs/params_vit_base_patch_16_imagenet_1k.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/$(whoami)/ /software --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --compile_dir /$(whoami) |& tee mytest.log
python run.py CSX --job_labels name=vision_transformer --params configs/params_vit_base_patch_16_imagenet_1k.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/$(whoami)/ /software --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --compile_dir /$(whoami) |& tee mytest.log
-->

Note: the validation has been commented out of the yaml to decrease the run time of this sample. To run validation, uncomment the validation sections at the end of `configs/params_vit_base_patch_16_imagenet_1k.yaml`.
Expand Down Expand Up @@ -345,20 +345,20 @@ The Cerebras Diffusion Transformer[[1](https://arxiv.org/pdf/2212.09748.pdf)] mo

First, source a Cerebras PyTorch virtual environment.
```bash
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
```

Instructions for training (for 400 steps):
```bash
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/vision/dit
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/vision/dit
cp /software/cerebras/dataset/params_dit_2B_patchsize_2x2_modified.yaml configs/params_dit_2B_patchsize_2x2_modified.yaml
export MODEL_DIR=model_dir_dit
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
cszoo fit configs/params_dit_2B_patchsize_2x2_modified.yaml --job_labels name=DiT --model_dir $MODEL_DIR |& tee mytest.log
```
<!---
Formerly:
python run.py CSX --job_labels name=DiT --mode train --params configs/params_dit_2B_patchsize_2x2_modified.yaml --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --model_dir ${MODEL_DIR} |& tee mytest.log
python run.py CSX --job_labels name=DiT --mode train --params configs/params_dit_2B_patchsize_2x2_modified.yaml --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --model_dir ${MODEL_DIR} |& tee mytest.log
--->

???+ example "Example output:"
Expand Down
4 changes: 2 additions & 2 deletions docs/ai-testbed/cerebras/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,15 @@ The ALCF CS-3 Cerebras Wafer-Scale Cluster, is designed to support large-scale m

The Cerebras Wafer-Scale cluster is run as an appliance: a user submits a job to the appliance, and the appliance manages preprocessing and streaming of the data, IO, and device orchestration within the appliance. It provides programming via PyTorch. This installation supports Weight Streaming execution for models being pre-trained or fine-tuned.

The public Cerebras documentation is available [here](https://training-docs.cerebras.ai/rel-2.6.0/getting-started/overview).
The public Cerebras documentation is available [here](https://training-docs.cerebras.ai/rel-2.9.0/getting-started/overview).

A typical Cerebras Wafer-Scale Cluster is shown in the figure below. Users connect via SSH to the login node, `cerebras.alcf.anl.gov` and then ssh to a user node, using either `cer-usn-01` or `cer-usn-02`.
<!--- The rest of the nodes in the cluster infrastructure are not directly accessible, except by admins.-->
The trees `/home`, `/projects`, and `/software` are shared across the login nodes and user nodes, the relevant cluster infrastructure nodes, and all ALCF AI testbed platforms.

![CS-3 cluster figure](files/topology-of-weight-streaming-on-wsc.png)
/// caption
Figure: topology of CS-3 cluster ([source](https://training-docs.cerebras.ai/rel-2.6.0/concepts/cerebras-wafer-scale-cluster))
Figure: topology of CS-3 cluster ([source](https://training-docs.cerebras.ai/rel-2.9.0/concepts/cerebras-wafer-scale-cluster))
///

As indicated in the figure, which represent a CS-3 cluster with 4 CS-3 WSE, each of the CS-3 engines (marked at the right end corner of the figure) is responsible only for running and accelerating the computations for training and predictions with the model. The other work, including compilation, is performed on the input nodes, and the MemoryX nodes are used for weight storage and broadcast, and SwarmX nodes are used for gradient accumulation.
4 changes: 2 additions & 2 deletions docs/ai-testbed/cerebras/miscellaneous.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@
## Porting applications to the CS-3

Cerebras documentation for porting code to run on a Cerebras CS-3 system:<br>
[Port Pytorch Models to Cerebras](https://training-docs.cerebras.ai/rel-2.6.0/model-zoo/migration/porting-pytorch-models-to-cerebras#port-pytorch-models-to-cerebras)
[Port Pytorch Models to Cerebras](https://training-docs.cerebras.ai/rel-2.9.0/model-zoo/migration/porting-pytorch-models-to-cerebras#port-pytorch-models-to-cerebras)

## Finetuning a model using CS-3s

The Cerebras tutorial for finetuning a model:<br>
[Fine-Tune Your First Model](https://training-docs.cerebras.ai/rel-2.6.0/getting-started/fine-tune-your-first-model)
[Fine-Tune Your First Model](https://training-docs.cerebras.ai/rel-2.9.0/getting-started/fine-tune-your-first-model)

The tutorial covers how to:

Expand Down
Loading