Skip to content

Commit e2c1eb7

Browse files
authored
Merge branch 'main' into FilippoSimini-patch-1
2 parents 4705007 + 965d2a8 commit e2c1eb7

File tree

13 files changed

+77
-70
lines changed

13 files changed

+77
-70
lines changed

AuroraBugTracking

docs/CODEOWNERS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ aurora/aurora-pe.md @koysean
9595
**/filesystem-and-storage/ @kevin-harms
9696

9797
# All container documentation
98-
**/containers/ @bcote-anl # ?
98+
#### **/containers/ @bcote-anl # ?
9999

100100
# All debugger documentation
101101
**/debugging*/* @jkwack

docs/ai-testbed/cerebras/csl.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ Example script to forward port 8000 to localhost 8008:
8989
export SDK_PORT=8000
9090
export LOCAL_PORT=8008
9191
export ALCFUserID=<your alcf username>
92-
ssh -L $LOCAL_PORT:localhost:$LOCAL_PORT $ALCFUserID@cer-login-04.ai.alcf.anl.gov -t ssh -L $LOCAL_PORT:localhost:$SDK_PORT -N cer-anl-net001-us-sr01
92+
ssh -L $LOCAL_PORT:localhost:$LOCAL_PORT $ALCFUserID@cerebras.alcf.anl.gov -t ssh -L $LOCAL_PORT:localhost:$SDK_PORT -N cer-anl-net001-us-sr01
9393
```
9494

9595
Then open the following URL in your web browser: `http://localhost:8008/sdk-gui/`
@@ -114,8 +114,8 @@ pip install --upgrade pip
114114

115115
**Install SDK Packages:** Install the `cerebras_appliance` and `cerebras_sdk` Python packages in the virtual environment, specifying the appropriate Cerebras Software release:
116116
```bash linenums="1"
117-
pip install cerebras_appliance==2.6.0
118-
pip install cerebras_sdk==2.6.0
117+
pip install cerebras_appliance==2.9.0
118+
pip install cerebras_sdk==2.9.0
119119
```
120120

121121
### Examples

docs/ai-testbed/cerebras/customizing-environment.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,16 @@
44

55
#### To make a PyTorch virtual environment for Cerebras
66

7-
Clone the Cerebras modelzoo, if it is not already cloned. Check out the R 2.6.0 release.
7+
Clone the Cerebras modelzoo, if it is not already cloned. Check out the R 2.9.0 release.
88

99
```console
10-
mkdir ~/R_2.6.0
11-
cd ~/R_2.6.0
10+
mkdir ~/R_2.9.0
11+
cd ~/R_2.9.0
1212
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
1313
git clone https://github.com/Cerebras/modelzoo.git
1414
cd modelzoo
1515
git tag
16-
git checkout Release_2.6.0
16+
git checkout Release_2.9.0
1717
```
1818
Note: a `git pull` will not update the tags; if `modelzoo/setup.py` does not exist after tag checkout, please re-clone `modelzoo`.
1919

@@ -26,8 +26,8 @@ export https_proxy=http://proxy.alcf.anl.gov:3128
2626
Then build the virtual environment
2727

2828
```console
29-
mkdir ~/R_2.6.0
30-
cd ~/R_2.6.0
29+
mkdir ~/R_2.9.0
30+
cd ~/R_2.9.0
3131
# Note: "deactivate" does not actually work in scripts.
3232
deactivate
3333
rm -r venv_cerebras_pt
@@ -46,7 +46,7 @@ pip install -e modelzoo
4646
To activate a virtual environment
4747

4848
```console
49-
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
49+
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
5050
```
5151

5252
To deactivate a virtual environment,

docs/ai-testbed/cerebras/example-programs.md

Lines changed: 28 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@
44
Make a working directory and a local copy of the Cerebras **modelzoo** repository, if not previously done, as follows.
55

66
```bash
7-
mkdir ~/R_2.6.0
8-
cd ~/R_2.6.0
7+
mkdir ~/R_2.9.0
8+
cd ~/R_2.9.0
99
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
1010
git clone https://github.com/Cerebras/modelzoo.git
1111
cd modelzoo
1212
git tag
13-
git checkout Release_2.6.0
13+
git checkout Release_2.9.0
1414
```
1515

1616
Note: to access any external web resources from a Cerebras user node, you will need to have a proxy environment variable set (or equivalent). `wget` needs the lower-case proxy environment variable.
@@ -43,17 +43,17 @@ To run Unet with the <a href="https://www.kaggle.com/c/severstal-steel-defect-de
4343
First, source a Cerebras PyTorch virtual environment.
4444
4545
```console
46-
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
46+
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
4747
```
4848
4949
Then
5050
5151
```console
52-
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert
52+
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert
5353
cp /software/cerebras/dataset/severstal-steel-defect-detection/params_severstal_binary_rawds.yaml configs/params_severstal_binary_rawds.yaml
5454
export MODEL_DIR=model_dir_unet
5555
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
56-
python run.py CSX --job_labels name=unet_pt --params configs/params_severstal_binary_rawds.yaml --model_dir $MODEL_DIR --mode train --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.6.0/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
56+
python run.py CSX --job_labels name=unet_pt --params configs/params_severstal_binary_rawds.yaml --model_dir $MODEL_DIR --mode train --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.9.0/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
5757
```
5858
--->
5959

@@ -68,7 +68,7 @@ The BraggNN model has two versions:<br>
6868
6969
```console
7070
TODO
71-
cd ~/R_2.6.0/anl_shared/braggnn/tf
71+
cd ~/R_2.9.0/anl_shared/braggnn/tf
7272
# This yaml has a correct path to a BraggNN dataset
7373
cp /software/cerebras/dataset/BraggN/params_bragg_nonlocal_sampleds.yaml configs/params_bragg_nonlocal_sampleds.yaml
7474
export MODEL_DIR=model_dir_braggnn
@@ -88,23 +88,23 @@ source /software/cerebras/venvs/venv_cerebras_pt/bin/activate
8888
# or your personal venv
8989
--->
9090
```console
91-
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
91+
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
9292
```
9393

9494
Then
9595

9696
```console
97-
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert
97+
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert
9898
cp /software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml configs/bert_large_MSL128_sampleds.yaml
9999
export MODEL_DIR=model_dir_bert_large_pytorch
100100
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
101101
cszoo fit configs/bert_large_MSL128_sampleds.yaml --job_labels name=bert_pt --model_dir $MODEL_DIR |& tee mytest.log
102102
```
103103
<!---
104104
previously,
105-
python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
105+
python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
106106
--->
107-
Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/vocab/google_research_uncased_L-12_H-768_A-12.txt`.
107+
Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/vocab/google_research_uncased_L-12_H-768_A-12.txt`.
108108

109109
The last parts of the output should resemble the following, with messages about cuda that should be ignored and are not shown.
110110

@@ -130,13 +130,13 @@ This PyTorch GPT-J 6B parameter pretraining sample uses 1 CS3.
130130
First, source a Cerebras PyTorch virtual environment.
131131

132132
```console
133-
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
133+
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
134134
```
135135

136136
Then
137137

138138
```console
139-
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/gptj
139+
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/gptj
140140
cp /software/cerebras/dataset/gptj/params_gptj_6B_sampleds.yaml configs/params_gptj_6B_sampleds.yaml
141141
export MODEL_DIR=model_dir_gptj
142142
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
@@ -147,7 +147,7 @@ Note: the validation has been commented out of the yaml to decrease the run time
147147

148148
<!---
149149
Previously,
150-
python run.py CSX --job_labels name=gptj_pt --params configs/params_gptj_6B_sampleds.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
150+
python run.py CSX --job_labels name=gptj_pt --params configs/params_gptj_6B_sampleds.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
151151
--->
152152

153153
The last parts of the output should resemble the following:
@@ -162,7 +162,7 @@ The last parts of the output should resemble the following:
162162
2025-10-10 20:20:51,668 INFO: Saved checkpoint model_dir_gptj/checkpoint_200.mdl
163163
2025-10-10 20:21:14,280 INFO: Training completed successfully!
164164
2025-10-10 20:21:14,286 INFO: Processed 24000 training sample(s) in 1443.67300221 seconds.
165-
/home/arnoldw/R_2.6.0/venv_cerebras_pt/lib/python3.8/site-packages/pydantic/_internal/_gener
165+
/home/arnoldw/R_2.9.0/venv_cerebras_pt/lib/python3.8/site-packages/pydantic/_internal/_gener
166166
```
167167

168168
## Llama2-7B
@@ -171,11 +171,11 @@ The Cerebras llama2 7B model implementation can be found at modelzoo/modelzoo/tr
171171

172172
First, source a Cerebras PyTorch virtual environment.
173173
```bash
174-
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
174+
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
175175
```
176176
Instructions for training:
177177
```bash
178-
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/llama
178+
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/llama
179179
cp /software/cerebras/dataset/params_llama2_7b.yaml configs/params_llama2_7b.yaml
180180
export MODEL_DIR=model_dir_llama2_7b
181181
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
@@ -185,7 +185,7 @@ cszoo fit configs/params_llama2_7b.yaml --job_labels name=llama2_7b --model_dir
185185
Note: the validation has been commented out of the yaml to decrease the run time of this sample. To run validation, uncomment the validation sections at the end of `configs/params_llama2_7b.yaml`.
186186
<!--
187187
Formerly,
188-
python run.py CSX --job_labels name=llama2_7b --params configs/params_llama2_7b.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /projects /home/ /software --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
188+
python run.py CSX --job_labels name=llama2_7b --params configs/params_llama2_7b.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /projects /home/ /software --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --compile_dir $(whoami) |& tee mytest.log
189189
-->
190190

191191
Please find a sample output
@@ -230,11 +230,11 @@ The Cerebras ESM-2 model implementation can be found at `modelzoo/src/cerebras/m
230230
231231
First, source a Cerebras PyTorch virtual environment.
232232
```bash
233-
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
233+
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
234234
```
235235
Instructions for training (for 400 steps):
236236
```bash
237-
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/esm2
237+
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/esm2
238238
cp /software/cerebras/dataset/ESM-2/params_esm2_t12_35M_UR50D_modified.yaml configs/params_esm2_t12_35M_UR50D_modified.yaml
239239
export MODEL_DIR=model_dir_esm2
240240
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
@@ -243,7 +243,7 @@ cszoo fit configs/params_esm2_t12_35M_UR50D_modified.yaml --job_labels name=esm2
243243
244244
<!--
245245
Formerly,
246-
python run.py CSX --job_labels name=esm2_t12_35m --params configs/params_esm2_t12_35M_UR50D_modified.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/$(whoami)/ /software --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --compile_dir /$(whoami) |& tee mytest.log
246+
python run.py CSX --job_labels name=esm2_t12_35m --params configs/params_esm2_t12_35M_UR50D_modified.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/$(whoami)/ /software --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --compile_dir /$(whoami) |& tee mytest.log
247247
-->
248248
249249
Note: the validation has been commented out of the yaml to decrease the run time of this sample. To run validation, uncomment the validation sections at the end of `configs/params_esm2_t12_35M_UR50D_modified.yaml`.
@@ -273,27 +273,27 @@ Saving checkpoint: 100%|██████████████████
273273
Saving checkpoint: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1321/1321 [00:08<00:00, 154.35 tensors/s]
274274
2025-10-10 23:45:54,994 INFO: Saved checkpoint model_dir_esm2/checkpoint_400.mdl
275275
2025-10-10 23:46:01,812 INFO: Training completed successfully!
276-
2025-10-10 23:46:01,861 INFO: Processed 819200 training sample(s) in 4049.286902367 seconds.
276+
2025-10-10 23:46:01,861 INFO: Processed 819200 training sample(s) in 4049.286902367 seconds
277277
```
278278
279279
## Vision Transformer
280280
The cerebras transformer based vision classifier model implementation can be found at `modelzoo/models/vision/vision_transformer`. Configs for base and huge model of the vision transformer can be found at `modelzoo/models/vision/vision_transformer/configs`. This examples uses the ImageNet dataset preprocessed at path `/software/datasets/imagenet/`.
281281
282282
First, source a Cerebras PyTorch virtual environment.
283283
```bash
284-
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
284+
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
285285
```
286286
Instructions for training (for 400 steps):
287287
```bash
288-
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/vision/vision_transformer
288+
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/vision/vision_transformer
289289
cp /software/cerebras/dataset/vision_transformer/params_vit_base_patch_16_imagenet_1k.yaml configs/params_vit_base_patch_16_imagenet_1k.yaml
290290
export MODEL_DIR=model_dir_vit
291291
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
292292
cszoo fit configs/params_vit_base_patch_16_imagenet_1k.yaml --job_labels name=vision_transformer --model_dir $MODEL_DIR |& tee mytest.log
293293
```
294294
<!--
295295
Formerly,
296-
python run.py CSX --job_labels name=vision_transformer --params configs/params_vit_base_patch_16_imagenet_1k.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/$(whoami)/ /software --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --compile_dir /$(whoami) |& tee mytest.log
296+
python run.py CSX --job_labels name=vision_transformer --params configs/params_vit_base_patch_16_imagenet_1k.yaml --num_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/$(whoami)/ /software --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --compile_dir /$(whoami) |& tee mytest.log
297297
-->
298298
299299
Note: the validation has been commented out of the yaml to decrease the run time of this sample. To run validation, uncomment the validation sections at the end of `configs/params_vit_base_patch_16_imagenet_1k.yaml`.
@@ -345,20 +345,20 @@ The Cerebras Diffusion Transformer[[1](https://arxiv.org/pdf/2212.09748.pdf)] mo
345345
346346
First, source a Cerebras PyTorch virtual environment.
347347
```bash
348-
source ~/R_2.6.0/venv_cerebras_pt/bin/activate
348+
source ~/R_2.9.0/venv_cerebras_pt/bin/activate
349349
```
350350
351351
Instructions for training (for 400 steps):
352352
```bash
353-
cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/vision/dit
353+
cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/vision/dit
354354
cp /software/cerebras/dataset/params_dit_2B_patchsize_2x2_modified.yaml configs/params_dit_2B_patchsize_2x2_modified.yaml
355355
export MODEL_DIR=model_dir_dit
356356
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
357357
cszoo fit configs/params_dit_2B_patchsize_2x2_modified.yaml --job_labels name=DiT --model_dir $MODEL_DIR |& tee mytest.log
358358
```
359359
<!---
360360
Formerly:
361-
python run.py CSX --job_labels name=DiT --mode train --params configs/params_dit_2B_patchsize_2x2_modified.yaml --python_paths /home/$(whoami)/R_2.6.0/modelzoo/src --model_dir ${MODEL_DIR} |& tee mytest.log
361+
python run.py CSX --job_labels name=DiT --mode train --params configs/params_dit_2B_patchsize_2x2_modified.yaml --python_paths /home/$(whoami)/R_2.9.0/modelzoo/src --model_dir ${MODEL_DIR} |& tee mytest.log
362362
--->
363363
364364
???+ example "Example output:"

docs/ai-testbed/cerebras/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,15 @@ The ALCF CS-3 Cerebras Wafer-Scale Cluster, is designed to support large-scale m
66

77
The Cerebras Wafer-Scale cluster is run as an appliance: a user submits a job to the appliance, and the appliance manages preprocessing and streaming of the data, IO, and device orchestration within the appliance. It provides programming via PyTorch. This installation supports Weight Streaming execution for models being pre-trained or fine-tuned.
88

9-
The public Cerebras documentation is available [here](https://training-docs.cerebras.ai/rel-2.6.0/getting-started/overview).
9+
The public Cerebras documentation is available [here](https://training-docs.cerebras.ai/rel-2.9.0/getting-started/overview).
1010

1111
A typical Cerebras Wafer-Scale Cluster is shown in the figure below. Users connect via SSH to the login node, `cerebras.alcf.anl.gov` and then ssh to a user node, using either `cer-usn-01` or `cer-usn-02`.
1212
<!--- The rest of the nodes in the cluster infrastructure are not directly accessible, except by admins.-->
1313
The trees `/home`, `/projects`, and `/software` are shared across the login nodes and user nodes, the relevant cluster infrastructure nodes, and all ALCF AI testbed platforms.
1414

1515
![CS-3 cluster figure](files/topology-of-weight-streaming-on-wsc.png)
1616
/// caption
17-
Figure: topology of CS-3 cluster ([source](https://training-docs.cerebras.ai/rel-2.6.0/concepts/cerebras-wafer-scale-cluster))
17+
Figure: topology of CS-3 cluster ([source](https://training-docs.cerebras.ai/rel-2.9.0/concepts/cerebras-wafer-scale-cluster))
1818
///
1919

2020
As indicated in the figure, which represent a CS-3 cluster with 4 CS-3 WSE, each of the CS-3 engines (marked at the right end corner of the figure) is responsible only for running and accelerating the computations for training and predictions with the model. The other work, including compilation, is performed on the input nodes, and the MemoryX nodes are used for weight storage and broadcast, and SwarmX nodes are used for gradient accumulation.

docs/ai-testbed/cerebras/miscellaneous.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@
33
## Porting applications to the CS-3
44

55
Cerebras documentation for porting code to run on a Cerebras CS-3 system:<br>
6-
[Port Pytorch Models to Cerebras](https://training-docs.cerebras.ai/rel-2.6.0/model-zoo/migration/porting-pytorch-models-to-cerebras#port-pytorch-models-to-cerebras)
6+
[Port Pytorch Models to Cerebras](https://training-docs.cerebras.ai/rel-2.9.0/model-zoo/migration/porting-pytorch-models-to-cerebras#port-pytorch-models-to-cerebras)
77

88
## Finetuning a model using CS-3s
99

1010
The Cerebras tutorial for finetuning a model:<br>
11-
[Fine-Tune Your First Model](https://training-docs.cerebras.ai/rel-2.6.0/getting-started/fine-tune-your-first-model)
11+
[Fine-Tune Your First Model](https://training-docs.cerebras.ai/rel-2.9.0/getting-started/fine-tune-your-first-model)
1212

1313
The tutorial covers how to:
1414

0 commit comments

Comments
 (0)