|
1 | | -## ML Training how-to guide for users and developers |
| 1 | +# ML Training |
2 | 2 |
|
| 3 | +The ML training (implemented in ``train_model.py``) can be run in two ways: |
3 | 4 |
|
4 | | -### Prerequisites |
5 | | -- Ensure you have [Conda](https://conda-forge.org/download/) installed. |
6 | | -- Ensure you have Docker installed (for deployment) |
| 5 | +- In your local Python environment, for testing/debugging: ``python train_model.py ...`` |
7 | 6 |
|
| 7 | +- Through the GUI, by clicking the ``Train`` button, or through SLURM by running ``sbatch training_pm.sbatch``. |
| 8 | +In both cases, the training runs in a Docker container at NERSC. This Docker container |
| 9 | +is pulled from the NERSC registry (https://registry.nersc.gov) and does not reflect any local changes |
| 10 | +you may have made to ``train_model.py``, unless you re-build and re-deploy the container. |
8 | 11 |
|
9 | | -### How to set up the conda environment |
| 12 | +Both methods are described in more detail below. |
10 | 13 |
|
11 | | -#### Local development |
| 14 | +## Training in a local Python environment (testing/debugging) |
12 | 15 |
|
13 | | -For local development, create and activate the conda environment: |
14 | | -```bash |
15 | | -conda env create -f environment.yml |
16 | | -conda activate ml-training |
17 | | -``` |
18 | | -#### At NERSC |
| 16 | +### On your local computer |
19 | 17 |
|
20 | | -At NERSC, use the following instead to install the environment |
21 | | -```bash |
22 | | -module load python |
23 | | -conda env create --prefix /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training -f environment.yml |
24 | | -``` |
25 | | -and subsequently use this to activate the environment |
26 | | -```bash |
27 | | -module load python |
28 | | -conda activate /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training |
29 | | -``` |
| 18 | +For local development, ensure you have [Conda](https://conda-forge.org/download/) installed. Then: |
30 | 19 |
|
31 | | -### How to build and run the Docker container |
| 20 | +1. Create the conda environment (this only needs to be done once): |
| 21 | + ```bash |
| 22 | + conda env create -f environment.yml |
| 23 | + ``` |
| 24 | + |
| 25 | +2. Open a separate terminal and keep it open: |
| 26 | + ```bash |
| 27 | + ssh -L 27017:mongodb05.nersc.gov:27017 <username>@dtn03.nersc.gov -N |
| 28 | + ``` |
| 29 | + |
| 30 | +3. Activate the conda environment and setup database read-write access: |
| 31 | + ```bash |
| 32 | + conda activate ml-training |
| 33 | + export SF_DB_ADMIN_PASSWORD='your_password_here' # Use SINGLE quotes around the password! |
| 34 | + ``` |
| 35 | + |
| 36 | +4. Run the training script in test mode: |
| 37 | + ```console |
| 38 | + python train_model.py --test --model <NN/GP> --config_file <your_test_yaml_file> |
| 39 | + ``` |
| 40 | + |
| 41 | +### At NERSC |
| 42 | + |
| 43 | +1. Create the conda environment (this only needs to be done once): |
| 44 | + ```bash |
| 45 | + module load python |
| 46 | + conda env create --prefix /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training -f environment.yml |
| 47 | + ``` |
| 48 | + |
| 49 | +2. Activate the environment and setup database read-write access: |
| 50 | + ```bash |
| 51 | + module load python |
| 52 | + conda activate /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training |
| 53 | + export SF_DB_ADMIN_PASSWORD='your_password_here' # Use SINGLE quotes around the password! |
| 54 | + ``` |
| 55 | + |
| 56 | +3. Run the training script in test mode: |
| 57 | + ```console |
| 58 | + python train_model.py --test --model <NN/GP> --config_file <your_test_yaml_file> |
| 59 | + ``` |
| 60 | + |
| 61 | +## Training through the GUI or through SLURM |
| 62 | + |
| 63 | +> **Warning:** |
| 64 | +> |
| 65 | +> Pushing a new Docker container affects training jobs launched from your locally-deployed GUI, |
| 66 | +> but also from the production GUI (deployed on NERSC Spin), since in both cases, the training |
| 67 | +> runs in a Docker container at NERSC, which is pulled from the NERSC registry (https://registry.nersc.gov). |
| 68 | +> |
| 69 | +> Yet, currently, this is the only way to test the end-to-end integration of the GUI with the training workflow. |
32 | 70 |
|
33 | 71 | 1. Move to the root directory of the repository. |
34 | 72 |
|
@@ -74,17 +112,20 @@ conda activate /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training |
74 | 112 | ```console |
75 | 113 | salloc -N 1 --ntasks-per-node=1 -t 1:00:00 -q interactive -C gpu --gpu-bind=single:1 -c 32 -G 1 -A m558 |
76 | 114 |
|
77 | | - podman-hpc run --gpu -v /etc/localtime:/etc/localtime -v $HOME/db.profile:/root/db.profile --rm -it registry.nersc.gov/m558/superfacility/ml-training:latest python -u /app/ml/train_model.py --experiment <experiment_name> --model NN |
| 115 | + podman-hpc run --gpu -v /etc/localtime:/etc/localtime -v $HOME/db.profile:/root/db.profile -v /path/to/config.yaml:/app/ml/config.yaml --rm -it registry.nersc.gov/m558/superfacility/ml-training:latest python -u /app/ml/train_model.py --test --config_file /app/ml/config.yaml --model NN |
78 | 116 | ``` |
79 | 117 | Note that `-v /etc/localtime:/etc/localtime` is necessary to synchronize the time zone in the container with the host machine. |
80 | 118 |
|
81 | | -Note: for our interactive dashboard, we run ML training jobs via the NERSC superfacility using the collaboration account `sf558`. |
82 | | -Since this is a non-interactive, non-user account, we also use a custom user to pull the image from https://registry.nersc.gov to perlmutter. |
83 | | -The registry login credentials need to be prepared (once) in the `$HOME` of `sf558` (`/global/homes/s/sf558/`) in a file named `registry.profile` with the following content: |
84 | | -```bash |
85 | | -export REGISTRY_USER="robot\$m558+perlmutter-nersc-gov" |
86 | | -export REGISTRY_PASSWORD="..." # request this from Remi/Axel |
87 | | -``` |
| 119 | + |
| 120 | +> **Note:** |
| 121 | +> |
| 122 | +> When we run ML training jobs through the GUI, we use NERSC's Superfacility API with the collaboration account `sf558`. |
| 123 | +> Since this is a non-interactive, non-user account, we also use a custom user to pull the image from https://registry.nersc.gov to Perlmutter. |
| 124 | +> The registry login credentials need to be prepared (once) in the `$HOME` of `sf558` (`/global/homes/s/sf558/`) in a file named `registry.profile` with the following content: |
| 125 | +> ```bash |
| 126 | +> export REGISTRY_USER="robot\$m558+perlmutter-nersc-gov" |
| 127 | +> export REGISTRY_PASSWORD="..." |
| 128 | +> ``` |
88 | 129 |
|
89 | 130 | ## References |
90 | 131 |
|
|
0 commit comments