diff --git a/docs/ai-testbed/cerebras/csl.md b/docs/ai-testbed/cerebras/csl.md index 4e1511aea6..d4c0f2490c 100644 --- a/docs/ai-testbed/cerebras/csl.md +++ b/docs/ai-testbed/cerebras/csl.md @@ -89,7 +89,7 @@ Example script to forward port 8000 to localhost 8008: export SDK_PORT=8000 export LOCAL_PORT=8008 export ALCFUserID= -ssh -L $LOCAL_PORT:localhost:$LOCAL_PORT $ALCFUserID@cer-login-04.ai.alcf.anl.gov -t ssh -L $LOCAL_PORT:localhost:$SDK_PORT -N cer-anl-net001-us-sr01 +ssh -L $LOCAL_PORT:localhost:$LOCAL_PORT $ALCFUserID@cerebras.alcf.anl.gov -t ssh -L $LOCAL_PORT:localhost:$SDK_PORT -N cer-anl-net001-us-sr01 ``` Then open the following URL in your web browser: `http://localhost:8008/sdk-gui/` @@ -114,8 +114,8 @@ pip install --upgrade pip **Install SDK Packages:** Install the `cerebras_appliance` and `cerebras_sdk` Python packages in the virtual environment, specifying the appropriate Cerebras Software release: ```bash linenums="1" -pip install cerebras_appliance==2.6.0 -pip install cerebras_sdk==2.6.0 +pip install cerebras_appliance==2.9.0 +pip install cerebras_sdk==2.9.0 ``` ### Examples diff --git a/docs/ai-testbed/cerebras/customizing-environment.md b/docs/ai-testbed/cerebras/customizing-environment.md index 63a36f7f39..aca4687ca9 100644 --- a/docs/ai-testbed/cerebras/customizing-environment.md +++ b/docs/ai-testbed/cerebras/customizing-environment.md @@ -4,16 +4,16 @@ #### To make a PyTorch virtual environment for Cerebras -Clone the Cerebras modelzoo, if it is not already cloned. Check out the R 2.6.0 release. +Clone the Cerebras modelzoo, if it is not already cloned. Check out the R 2.9.0 release. ```console -mkdir ~/R_2.6.0 -cd ~/R_2.6.0 +mkdir ~/R_2.9.0 +cd ~/R_2.9.0 export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128 git clone https://github.com/Cerebras/modelzoo.git cd modelzoo git tag -git checkout Release_2.6.0 +git checkout Release_2.9.0 ``` Note: a `git pull` will not update the tags; if `modelzoo/setup.py` does not exist after tag checkout, please re-clone `modelzoo`. @@ -26,8 +26,8 @@ export https_proxy=http://proxy.alcf.anl.gov:3128 Then build the virtual environment ```console -mkdir ~/R_2.6.0 -cd ~/R_2.6.0 +mkdir ~/R_2.9.0 +cd ~/R_2.9.0 # Note: "deactivate" does not actually work in scripts. deactivate rm -r venv_cerebras_pt @@ -46,7 +46,7 @@ pip install -e modelzoo To activate a virtual environment ```console -source ~/R_2.6.0/venv_cerebras_pt/bin/activate +source ~/R_2.9.0/venv_cerebras_pt/bin/activate ``` To deactivate a virtual environment, diff --git a/docs/ai-testbed/cerebras/example-programs.md b/docs/ai-testbed/cerebras/example-programs.md index d556fe657a..90eb00d5d5 100644 --- a/docs/ai-testbed/cerebras/example-programs.md +++ b/docs/ai-testbed/cerebras/example-programs.md @@ -4,13 +4,13 @@ Make a working directory and a local copy of the Cerebras **modelzoo** repository, if not previously done, as follows. ```bash -mkdir ~/R_2.6.0 -cd ~/R_2.6.0 +mkdir ~/R_2.9.0 +cd ~/R_2.9.0 export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128 git clone https://github.com/Cerebras/modelzoo.git cd modelzoo git tag -git checkout Release_2.6.0 +git checkout Release_2.9.0 ``` Note: to access any external web resources from a Cerebras user node, you will need to have a proxy environment variable set (or equivalent). `wget` needs the lower-case proxy environment variable. @@ -43,17 +43,17 @@ To run Unet with the @@ -68,7 +68,7 @@ The BraggNN model has two versions:
```console TODO -cd ~/R_2.6.0/anl_shared/braggnn/tf +cd ~/R_2.9.0/anl_shared/braggnn/tf # This yaml has a correct path to a BraggNN dataset cp /software/cerebras/dataset/BraggN/params_bragg_nonlocal_sampleds.yaml configs/params_bragg_nonlocal_sampleds.yaml export MODEL_DIR=model_dir_braggnn @@ -88,13 +88,13 @@ source /software/cerebras/venvs/venv_cerebras_pt/bin/activate # or your personal venv ---> ```console -source ~/R_2.6.0/venv_cerebras_pt/bin/activate +source ~/R_2.9.0/venv_cerebras_pt/bin/activate ``` Then ```console -cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert +cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/bert cp /software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml configs/bert_large_MSL128_sampleds.yaml export MODEL_DIR=model_dir_bert_large_pytorch if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi @@ -102,9 +102,9 @@ cszoo fit configs/bert_large_MSL128_sampleds.yaml --job_labels name=bert_pt --mo ``` -Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/vocab/google_research_uncased_L-12_H-768_A-12.txt`. +Note: the vocabulary file referenced in `/software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml` is the same as the one at `/home/$(whoami)/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/vocab/google_research_uncased_L-12_H-768_A-12.txt`. The last parts of the output should resemble the following, with messages about cuda that should be ignored and are not shown. @@ -130,13 +130,13 @@ This PyTorch GPT-J 6B parameter pretraining sample uses 1 CS3. First, source a Cerebras PyTorch virtual environment. ```console -source ~/R_2.6.0/venv_cerebras_pt/bin/activate +source ~/R_2.9.0/venv_cerebras_pt/bin/activate ``` Then ```console -cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/gptj +cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/gptj cp /software/cerebras/dataset/gptj/params_gptj_6B_sampleds.yaml configs/params_gptj_6B_sampleds.yaml export MODEL_DIR=model_dir_gptj if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi @@ -147,7 +147,7 @@ Note: the validation has been commented out of the yaml to decrease the run time The last parts of the output should resemble the following: @@ -162,7 +162,7 @@ The last parts of the output should resemble the following: 2025-10-10 20:20:51,668 INFO: Saved checkpoint model_dir_gptj/checkpoint_200.mdl 2025-10-10 20:21:14,280 INFO: Training completed successfully! 2025-10-10 20:21:14,286 INFO: Processed 24000 training sample(s) in 1443.67300221 seconds. -/home/arnoldw/R_2.6.0/venv_cerebras_pt/lib/python3.8/site-packages/pydantic/_internal/_gener +/home/arnoldw/R_2.9.0/venv_cerebras_pt/lib/python3.8/site-packages/pydantic/_internal/_gener ``` ## Llama2-7B @@ -171,11 +171,11 @@ The Cerebras llama2 7B model implementation can be found at modelzoo/modelzoo/tr First, source a Cerebras PyTorch virtual environment. ```bash -source ~/R_2.6.0/venv_cerebras_pt/bin/activate +source ~/R_2.9.0/venv_cerebras_pt/bin/activate ``` Instructions for training: ```bash -cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/llama +cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/llama cp /software/cerebras/dataset/params_llama2_7b.yaml configs/params_llama2_7b.yaml export MODEL_DIR=model_dir_llama2_7b if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi @@ -185,7 +185,7 @@ cszoo fit configs/params_llama2_7b.yaml --job_labels name=llama2_7b --model_dir Note: the validation has been commented out of the yaml to decrease the run time of this sample. To run validation, uncomment the validation sections at the end of `configs/params_llama2_7b.yaml`. Please find a sample output @@ -230,11 +230,11 @@ The Cerebras ESM-2 model implementation can be found at `modelzoo/src/cerebras/m First, source a Cerebras PyTorch virtual environment. ```bash -source ~/R_2.6.0/venv_cerebras_pt/bin/activate +source ~/R_2.9.0/venv_cerebras_pt/bin/activate ``` Instructions for training (for 400 steps): ```bash -cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/esm2 +cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/esm2 cp /software/cerebras/dataset/ESM-2/params_esm2_t12_35M_UR50D_modified.yaml configs/params_esm2_t12_35M_UR50D_modified.yaml export MODEL_DIR=model_dir_esm2 if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi @@ -243,7 +243,7 @@ cszoo fit configs/params_esm2_t12_35M_UR50D_modified.yaml --job_labels name=esm2 Note: the validation has been commented out of the yaml to decrease the run time of this sample. To run validation, uncomment the validation sections at the end of `configs/params_esm2_t12_35M_UR50D_modified.yaml`. @@ -273,7 +273,7 @@ Saving checkpoint: 100%|██████████████████ Saving checkpoint: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1321/1321 [00:08<00:00, 154.35 tensors/s] 2025-10-10 23:45:54,994 INFO: Saved checkpoint model_dir_esm2/checkpoint_400.mdl 2025-10-10 23:46:01,812 INFO: Training completed successfully! -2025-10-10 23:46:01,861 INFO: Processed 819200 training sample(s) in 4049.286902367 seconds. +2025-10-10 23:46:01,861 INFO: Processed 819200 training sample(s) in 4049.286902367 seconds ``` ## Vision Transformer @@ -281,11 +281,11 @@ The cerebras transformer based vision classifier model implementation can be fou First, source a Cerebras PyTorch virtual environment. ```bash -source ~/R_2.6.0/venv_cerebras_pt/bin/activate +source ~/R_2.9.0/venv_cerebras_pt/bin/activate ``` Instructions for training (for 400 steps): ```bash -cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/vision/vision_transformer +cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/vision/vision_transformer cp /software/cerebras/dataset/vision_transformer/params_vit_base_patch_16_imagenet_1k.yaml configs/params_vit_base_patch_16_imagenet_1k.yaml export MODEL_DIR=model_dir_vit if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi @@ -293,7 +293,7 @@ cszoo fit configs/params_vit_base_patch_16_imagenet_1k.yaml --job_labels name=vi ``` Note: the validation has been commented out of the yaml to decrease the run time of this sample. To run validation, uncomment the validation sections at the end of `configs/params_vit_base_patch_16_imagenet_1k.yaml`. @@ -345,12 +345,12 @@ The Cerebras Diffusion Transformer[[1](https://arxiv.org/pdf/2212.09748.pdf)] mo First, source a Cerebras PyTorch virtual environment. ```bash -source ~/R_2.6.0/venv_cerebras_pt/bin/activate +source ~/R_2.9.0/venv_cerebras_pt/bin/activate ``` Instructions for training (for 400 steps): ```bash -cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/vision/dit +cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/vision/dit cp /software/cerebras/dataset/params_dit_2B_patchsize_2x2_modified.yaml configs/params_dit_2B_patchsize_2x2_modified.yaml export MODEL_DIR=model_dir_dit if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi @@ -358,7 +358,7 @@ cszoo fit configs/params_dit_2B_patchsize_2x2_modified.yaml --job_labels name=Di ``` ???+ example "Example output:" diff --git a/docs/ai-testbed/cerebras/index.md b/docs/ai-testbed/cerebras/index.md index 706b9e3b79..324d5a4808 100644 --- a/docs/ai-testbed/cerebras/index.md +++ b/docs/ai-testbed/cerebras/index.md @@ -6,7 +6,7 @@ The ALCF CS-3 Cerebras Wafer-Scale Cluster, is designed to support large-scale m The Cerebras Wafer-Scale cluster is run as an appliance: a user submits a job to the appliance, and the appliance manages preprocessing and streaming of the data, IO, and device orchestration within the appliance. It provides programming via PyTorch. This installation supports Weight Streaming execution for models being pre-trained or fine-tuned. -The public Cerebras documentation is available [here](https://training-docs.cerebras.ai/rel-2.6.0/getting-started/overview). +The public Cerebras documentation is available [here](https://training-docs.cerebras.ai/rel-2.9.0/getting-started/overview). A typical Cerebras Wafer-Scale Cluster is shown in the figure below. Users connect via SSH to the login node, `cerebras.alcf.anl.gov` and then ssh to a user node, using either `cer-usn-01` or `cer-usn-02`. @@ -14,7 +14,7 @@ The trees `/home`, `/projects`, and `/software` are shared across the login node ![CS-3 cluster figure](files/topology-of-weight-streaming-on-wsc.png) /// caption -Figure: topology of CS-3 cluster ([source](https://training-docs.cerebras.ai/rel-2.6.0/concepts/cerebras-wafer-scale-cluster)) +Figure: topology of CS-3 cluster ([source](https://training-docs.cerebras.ai/rel-2.9.0/concepts/cerebras-wafer-scale-cluster)) /// As indicated in the figure, which represent a CS-3 cluster with 4 CS-3 WSE, each of the CS-3 engines (marked at the right end corner of the figure) is responsible only for running and accelerating the computations for training and predictions with the model. The other work, including compilation, is performed on the input nodes, and the MemoryX nodes are used for weight storage and broadcast, and SwarmX nodes are used for gradient accumulation. diff --git a/docs/ai-testbed/cerebras/miscellaneous.md b/docs/ai-testbed/cerebras/miscellaneous.md index 53122f9ef4..ef8494354e 100644 --- a/docs/ai-testbed/cerebras/miscellaneous.md +++ b/docs/ai-testbed/cerebras/miscellaneous.md @@ -3,12 +3,12 @@ ## Porting applications to the CS-3 Cerebras documentation for porting code to run on a Cerebras CS-3 system:
-[Port Pytorch Models to Cerebras](https://training-docs.cerebras.ai/rel-2.6.0/model-zoo/migration/porting-pytorch-models-to-cerebras#port-pytorch-models-to-cerebras) +[Port Pytorch Models to Cerebras](https://training-docs.cerebras.ai/rel-2.9.0/model-zoo/migration/porting-pytorch-models-to-cerebras#port-pytorch-models-to-cerebras) ## Finetuning a model using CS-3s The Cerebras tutorial for finetuning a model:
-[Fine-Tune Your First Model](https://training-docs.cerebras.ai/rel-2.6.0/getting-started/fine-tune-your-first-model) +[Fine-Tune Your First Model](https://training-docs.cerebras.ai/rel-2.9.0/getting-started/fine-tune-your-first-model) The tutorial covers how to: diff --git a/docs/ai-testbed/cerebras/running-a-model-or-program.md b/docs/ai-testbed/cerebras/running-a-model-or-program.md index dc53129c8b..c41bae05b7 100644 --- a/docs/ai-testbed/cerebras/running-a-model-or-program.md +++ b/docs/ai-testbed/cerebras/running-a-model-or-program.md @@ -26,9 +26,9 @@ Follow these instructions to compile and train a small (111m parameters) GPT3 mo First, make a virtual environment for Cerebras for PyTorch. See [Customizing Environments](./customizing-environment.md) for the procedures for making PyTorch virtual environments for Cerebras. -If an environment is made in ```~/R_2.6.0/```, it would be activated as follows: +If an environment is made in ```~/R_2.9.0/```, it would be activated as follows: ```console -source ~/R_2.6.0/venv_cerebras_pt/bin/activate +source ~/R_2.9.0/venv_cerebras_pt/bin/activate ``` Note: to access any external web resources from a Cerebras user node, you will need to have a proxy environment variable set (or equivalent). `wget` needs the lower-case proxy environment variable. @@ -39,24 +39,24 @@ export https_proxy=http://proxy.alcf.anl.gov:3128 ### Clone the Cerebras modelzoo -If you have not already cloned the Cerebras modelzoo repo and checked out the Release_2.6.0 tag, do so. +If you have not already cloned the Cerebras modelzoo repo and checked out the Release_2.9.0 tag, do so. ```console -mkdir ~/R_2.6.0 -cd ~/R_2.6.0 +mkdir ~/R_2.9.0 +cd ~/R_2.9.0 export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128 git clone https://github.com/Cerebras/modelzoo.git cd modelzoo git tag -git checkout Release_2.6.0 +git checkout Release_2.9.0 ``` ## Running a Pytorch sample ### Activate your PyTorch virtual environment, and change to the working directory ```console -source ~/R_2.6.0/venv_cerebras_pt/bin/activate -cd ~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3 +source ~/R_2.9.0/venv_cerebras_pt/bin/activate +cd ~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3 ``` Next, copy a sample config file. This is for a small GPT3 model, modified to use a preprocessed dataset and to run for fewer steps. @@ -78,7 +78,7 @@ cszoo fit configs/Cerebras_GPT/111m_modified.yaml --job_labels name=gpt3_111m -- A successful GPT3 (111m parameters) PyTorch training/validation run should finish with output resembling the following: @@ -100,7 +100,7 @@ A successful GPT3 (111m parameters) PyTorch training/validation run should finis As the console output shows, for this sample, the run framework starts three jobs (two compiles and one execute) as part of a single workflow: ```text -(venv_cerebras_pt) username@cer-anl-net001-us-sr01:~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3$ grep -B1 "Job id:" mytest.log +(venv_cerebras_pt) username@cer-anl-net001-us-sr01:~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3$ grep -B1 "Job id:" mytest.log 2025-10-30 18:10:39,460 INFO: Initiating a new compile wsjob against the cluster server. 2025-10-30 18:10:39,479 INFO: Job id: wsjob-acxb4mqan53ppiffvdaafq, workflow id: wflow-ocjyqlrf5szhpecphsq3x8, namespace: job-operator, remote log path: /n1/wsjob/workdir/job-operator/wsjob-acxb4mqan53ppiffvdaafq -- @@ -109,7 +109,7 @@ As the console output shows, for this sample, the run framework starts three job -- 2025-10-30 18:21:33,099 INFO: Initiating a new compile wsjob against the cluster server. 2025-10-30 18:21:33,118 INFO: Job id: wsjob-6mvjwjqovjprbibbpi3w43, workflow id: wflow-ocjyqlrf5szhpecphsq3x8, namespace: job-operator, remote log path: /n1/wsjob/workdir/job-operator/wsjob-6mvjwjqovjprbibbpi3w43 -(venv_cerebras_pt) username@cer-anl-net001-us-sr01:~/R_2.6.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3$ +(venv_cerebras_pt) username@cer-anl-net001-us-sr01:~/R_2.9.0/modelzoo/src/cerebras/modelzoo/models/nlp/gpt3$ ``` The jobs can be seen with `csctl get jobs`, from another console session on a user node.