Skip to content

Commit dec4386

Browse files
RemiLehepre-commit-ci[bot]EZoni
authored
Pass path to config file when running train_model.py, update README (#372)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Edoardo Zoni <59625522+EZoni@users.noreply.github.com>
1 parent e2723b8 commit dec4386

File tree

5 files changed

+95
-60
lines changed

5 files changed

+95
-60
lines changed

dashboard/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ Here are a few how-to guides on how to develop and use the dashboard.
4747
conda activate gui
4848
```
4949

50-
2. Set the database settings (read+write):
50+
2. Set the database settings (read only):
5151
```console
5252
export SF_DB_HOST='127.0.0.1'
5353
export SF_DB_READONLY_PASSWORD='your_password_here' # Use SINGLE quotes around the password!

dashboard/model_manager.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -221,13 +221,13 @@ async def training_kernel(self):
221221
if training_script is None:
222222
raise RuntimeError("Could not find training_pm.sbatch")
223223

224-
# replace the --experiment command line argument in the batch script
225-
# with the current experiment in the state
224+
# replace the --model argument in the python command with the current model type from the state
226225
training_script = re.sub(
227-
pattern=r"--experiment (.*)",
228-
repl=rf"--experiment {state.experiment} --model {model_type_tag_dict[state.model_type]}",
226+
pattern=r"--model \$\{model\}",
227+
repl=rf"--model {model_type_tag_dict[state.model_type]}",
229228
string=training_script,
230229
)
230+
231231
# submit the training job through the Superfacility API
232232
sfapi_job = await perlmutter.submit_job(training_script)
233233
state.model_training_status = "Submitted"

ml/README.md

Lines changed: 72 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,72 @@
1-
## ML Training how-to guide for users and developers
1+
# ML Training
22

3+
The ML training (implemented in ``train_model.py``) can be run in two ways:
34

4-
### Prerequisites
5-
- Ensure you have [Conda](https://conda-forge.org/download/) installed.
6-
- Ensure you have Docker installed (for deployment)
5+
- In your local Python environment, for testing/debugging: ``python train_model.py ...``
76

7+
- Through the GUI, by clicking the ``Train`` button, or through SLURM by running ``sbatch training_pm.sbatch``.
8+
In both cases, the training runs in a Docker container at NERSC. This Docker container
9+
is pulled from the NERSC registry (https://registry.nersc.gov) and does not reflect any local changes
10+
you may have made to ``train_model.py``, unless you re-build and re-deploy the container.
811

9-
### How to set up the conda environment
12+
Both methods are described in more detail below.
1013

11-
#### Local development
14+
## Training in a local Python environment (testing/debugging)
1215

13-
For local development, create and activate the conda environment:
14-
```bash
15-
conda env create -f environment.yml
16-
conda activate ml-training
17-
```
18-
#### At NERSC
16+
### On your local computer
1917

20-
At NERSC, use the following instead to install the environment
21-
```bash
22-
module load python
23-
conda env create --prefix /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training -f environment.yml
24-
```
25-
and subsequently use this to activate the environment
26-
```bash
27-
module load python
28-
conda activate /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training
29-
```
18+
For local development, ensure you have [Conda](https://conda-forge.org/download/) installed. Then:
3019

31-
### How to build and run the Docker container
20+
1. Create the conda environment (this only needs to be done once):
21+
```bash
22+
conda env create -f environment.yml
23+
```
24+
25+
2. Open a separate terminal and keep it open:
26+
```bash
27+
ssh -L 27017:mongodb05.nersc.gov:27017 <username>@dtn03.nersc.gov -N
28+
```
29+
30+
3. Activate the conda environment and setup database read-write access:
31+
```bash
32+
conda activate ml-training
33+
export SF_DB_ADMIN_PASSWORD='your_password_here' # Use SINGLE quotes around the password!
34+
```
35+
36+
4. Run the training script in test mode:
37+
```console
38+
python train_model.py --test --model <NN/GP> --config_file <your_test_yaml_file>
39+
```
40+
41+
### At NERSC
42+
43+
1. Create the conda environment (this only needs to be done once):
44+
```bash
45+
module load python
46+
conda env create --prefix /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training -f environment.yml
47+
```
48+
49+
2. Activate the environment and setup database read-write access:
50+
```bash
51+
module load python
52+
conda activate /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training
53+
export SF_DB_ADMIN_PASSWORD='your_password_here' # Use SINGLE quotes around the password!
54+
```
55+
56+
3. Run the training script in test mode:
57+
```console
58+
python train_model.py --test --model <NN/GP> --config_file <your_test_yaml_file>
59+
```
60+
61+
## Training through the GUI or through SLURM
62+
63+
> **Warning:**
64+
>
65+
> Pushing a new Docker container affects training jobs launched from your locally-deployed GUI,
66+
> but also from the production GUI (deployed on NERSC Spin), since in both cases, the training
67+
> runs in a Docker container at NERSC, which is pulled from the NERSC registry (https://registry.nersc.gov).
68+
>
69+
> Yet, currently, this is the only way to test the end-to-end integration of the GUI with the training workflow.
3270
3371
1. Move to the root directory of the repository.
3472

@@ -74,17 +112,20 @@ conda activate /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training
74112
```console
75113
salloc -N 1 --ntasks-per-node=1 -t 1:00:00 -q interactive -C gpu --gpu-bind=single:1 -c 32 -G 1 -A m558
76114

77-
podman-hpc run --gpu -v /etc/localtime:/etc/localtime -v $HOME/db.profile:/root/db.profile --rm -it registry.nersc.gov/m558/superfacility/ml-training:latest python -u /app/ml/train_model.py --experiment <experiment_name> --model NN
115+
podman-hpc run --gpu -v /etc/localtime:/etc/localtime -v $HOME/db.profile:/root/db.profile -v /path/to/config.yaml:/app/ml/config.yaml --rm -it registry.nersc.gov/m558/superfacility/ml-training:latest python -u /app/ml/train_model.py --test --config_file /app/ml/config.yaml --model NN
78116
```
79117
Note that `-v /etc/localtime:/etc/localtime` is necessary to synchronize the time zone in the container with the host machine.
80118

81-
Note: for our interactive dashboard, we run ML training jobs via the NERSC superfacility using the collaboration account `sf558`.
82-
Since this is a non-interactive, non-user account, we also use a custom user to pull the image from https://registry.nersc.gov to perlmutter.
83-
The registry login credentials need to be prepared (once) in the `$HOME` of `sf558` (`/global/homes/s/sf558/`) in a file named `registry.profile` with the following content:
84-
```bash
85-
export REGISTRY_USER="robot\$m558+perlmutter-nersc-gov"
86-
export REGISTRY_PASSWORD="..." # request this from Remi/Axel
87-
```
119+
120+
> **Note:**
121+
>
122+
> When we run ML training jobs through the GUI, we use NERSC's Superfacility API with the collaboration account `sf558`.
123+
> Since this is a non-interactive, non-user account, we also use a custom user to pull the image from https://registry.nersc.gov to Perlmutter.
124+
> The registry login credentials need to be prepared (once) in the `$HOME` of `sf558` (`/global/homes/s/sf558/`) in a file named `registry.profile` with the following content:
125+
> ```bash
126+
> export REGISTRY_USER="robot\$m558+perlmutter-nersc-gov"
127+
> export REGISTRY_PASSWORD="..."
128+
> ```
88129
89130
## References
90131

ml/train_model.py

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,8 @@ def parse_arguments():
4646
# Parse command line arguments
4747
parser = argparse.ArgumentParser()
4848
parser.add_argument(
49-
"--experiment",
50-
help="name/tag of the experiment",
49+
"--config_file",
50+
help="path to the configuration file",
5151
type=str,
5252
required=True,
5353
)
@@ -63,27 +63,23 @@ def parse_arguments():
6363
default=False,
6464
)
6565
args = parser.parse_args()
66-
experiment = args.experiment
66+
config_file = args.config_file
6767
model_type = args.model
6868
test_mode = args.test
69-
print(f"Experiment: {experiment}, Model type: {model_type}, Test mode: {test_mode}")
69+
print(
70+
f"Config file path: {config_file}, Model type: {model_type}, Test mode: {test_mode}"
71+
)
7072
if model_type not in ["NN", "ensemble_NN", "GP"]:
7173
raise ValueError(f"Invalid model type: {model_type}")
72-
return experiment, model_type, test_mode
74+
return config_file, model_type, test_mode
7375

7476

75-
def load_config(experiment):
76-
# Extract configurations of experiments & models
77-
possible_config_file_paths = [
78-
f"{os.path.dirname(os.path.abspath(__file__))}config.yaml",
79-
"./config.yaml",
80-
f"../experiments/synapse-{experiment}/config.yaml",
81-
]
82-
for config_file_path in possible_config_file_paths:
83-
if os.path.exists(config_file_path):
84-
with open(config_file_path) as f:
85-
return yaml.safe_load(f.read())
86-
raise RuntimeError("File config.yaml not found.")
77+
def load_config(config_file):
78+
# Load configuration from the specified file path
79+
if not os.path.exists(config_file):
80+
raise RuntimeError(f"Configuration file not found: {config_file}")
81+
with open(config_file) as f:
82+
return yaml.safe_load(f.read())
8783

8884

8985
def connect_to_db(config_dict):
@@ -431,6 +427,9 @@ def write_model(model, model_type, experiment, db):
431427
# Parse command line arguments and load config
432428
experiment, model_type, test_mode = parse_arguments()
433429
config_dict = load_config(experiment)
430+
# Extract experiment name from config file
431+
experiment = config_dict["experiment"]
432+
print(f"Experiment: {experiment}")
434433
# Extract input and output variables from the config file
435434
input_variables = config_dict["inputs"]
436435
input_names = [v["name"] for v in input_variables.values()]

ml/training_pm.sbatch

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,7 @@
1212
#SBATCH -o /global/cfs/cdirs/m558/superfacility/model_training/logs/sf.o%j
1313
#SBATCH -e /global/cfs/cdirs/m558/superfacility/model_training/logs/sf.e%j
1414

15-
experiment=${1} # e.g., "experiment_name"
16-
model=${2} # e.g., "NN", "GP", etc.
17-
18-
if [[ -z "${experiment}" ]]; then
19-
echo "Must pass the experiment name/tag as a command line argument"
20-
fi
15+
model=${1} # e.g., "NN", "GP", etc.
2116

2217
# login to the registry, update if needed
2318
# Note: If you encounter issues, note that we compare image
@@ -59,4 +54,4 @@ srun podman-hpc run --gpu \
5954
-v $HOME/db.profile:/root/db.profile \
6055
-v /global/cfs/cdirs/m558/superfacility/model_training/config.yaml:/app/ml/config.yaml \
6156
--rm -it ${REGISTRY_NAME}/${IMAGE_NAME}:${IMAGE_VERSION} \
62-
python -u /app/ml/train_model.py --experiment ${experiment} --model ${model}
57+
python -u /app/ml/train_model.py --config_file /app/ml/config.yaml --model ${model}

0 commit comments

Comments
 (0)