Pass path to config file when running train_model.py, update README (#372)

RemiLehe · pre-commit-ci[bot] · EZoni · web-flow · commit dec4386d692a · 2026-01-30T17:08:23.000-08:00
Co-authored-by: pre-commit-ci[bot] &lt;66853113+pre-commit-ci[bot]@users.noreply.github.com&gt;
Co-authored-by: Edoardo Zoni &lt;59625522+EZoni@users.noreply.github.com&gt;
diff --git a/dashboard/README.md b/dashboard/README.md
@@ -47,7 +47,7 @@ Here are a few how-to guides on how to develop and use the dashboard.
     conda activate gui
     ```
 
-2. Set the database settings (read+write):
+2. Set the database settings (read only):
     ```console
     export SF_DB_HOST='127.0.0.1'
     export SF_DB_READONLY_PASSWORD='your_password_here'  # Use SINGLE quotes around the password!
diff --git a/dashboard/model_manager.py b/dashboard/model_manager.py
@@ -221,13 +221,13 @@ async def training_kernel(self):
                 if training_script is None:
                     raise RuntimeError("Could not find training_pm.sbatch")
 
-                # replace the --experiment command line argument in the batch script
-                # with the current experiment in the state
+                # replace the --model argument in the python command with the current model type from the state
                 training_script = re.sub(
-                    pattern=r"--experiment (.*)",
-                    repl=rf"--experiment {state.experiment} --model {model_type_tag_dict[state.model_type]}",
+                    pattern=r"--model \$\{model\}",
+                    repl=rf"--model {model_type_tag_dict[state.model_type]}",
                     string=training_script,
                 )
+
                 # submit the training job through the Superfacility API
                 sfapi_job = await perlmutter.submit_job(training_script)
                 state.model_training_status = "Submitted"
diff --git a/ml/README.md b/ml/README.md
@@ -1,34 +1,72 @@
-## ML Training how-to guide for users and developers
+# ML Training
 
+The ML training (implemented in ``train_model.py``) can be run in two ways:
 
-### Prerequisites
-- Ensure you have [Conda](https://conda-forge.org/download/) installed.
-- Ensure you have Docker installed (for deployment)
+- In your local Python environment, for testing/debugging: ``python train_model.py ...``
 
+- Through the GUI, by clicking the ``Train`` button, or through SLURM by running ``sbatch training_pm.sbatch``.
+In both cases, the training runs in a Docker container at NERSC. This Docker container
+is pulled from the NERSC registry (https://registry.nersc.gov) and does not reflect any local changes
+you may have made to ``train_model.py``, unless you re-build and re-deploy the container.
 
-### How to set up the conda environment
+Both methods are described in more detail below.
 
-#### Local development
+## Training in a local Python environment (testing/debugging)
 
-For local development, create and activate the conda environment:
-```bash
-conda env create -f environment.yml
-conda activate ml-training
-```
-#### At NERSC
+### On your local computer
 
-At NERSC, use the following instead to install the environment
-```bash
-module load python
-conda env create --prefix /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training -f environment.yml
-```
-and subsequently use this to activate the environment
-```bash
-module load python
-conda activate /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training
-```
+For local development, ensure you have [Conda](https://conda-forge.org/download/) installed. Then:
 
-### How to build and run the Docker container
+1. Create the conda environment (this only needs to be done once):
+   ```bash
+   conda env create -f environment.yml
+   ```
+
+2. Open a separate terminal and keep it open:
+   ```bash
+   ssh -L 27017:mongodb05.nersc.gov:27017 <username>@dtn03.nersc.gov -N
+   ```
+
+3. Activate the conda environment and setup database read-write access:
+   ```bash
+   conda activate ml-training
+   export SF_DB_ADMIN_PASSWORD='your_password_here'  # Use SINGLE quotes around the password!
+   ```
+
+4. Run the training script in test mode:
+   ```console
+   python train_model.py --test --model <NN/GP> --config_file <your_test_yaml_file>
+   ```
+
+### At NERSC
+
+1. Create the conda environment (this only needs to be done once):
+   ```bash
+   module load python
+   conda env create --prefix /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training -f environment.yml
+   ```
+
+2. Activate the environment and setup database read-write access:
+   ```bash
+   module load python
+   conda activate /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training
+   export SF_DB_ADMIN_PASSWORD='your_password_here'  # Use SINGLE quotes around the password!
+   ```
+
+3. Run the training script in test mode:
+   ```console
+   python train_model.py --test --model <NN/GP> --config_file <your_test_yaml_file>
+   ```
+
+## Training through the GUI or through SLURM
+
+> **Warning:**
+>
+> Pushing a new Docker container affects training jobs launched from your locally-deployed GUI,
+> but also from the production GUI (deployed on NERSC Spin), since in both cases, the training
+> runs in a Docker container at NERSC, which is pulled from the NERSC registry (https://registry.nersc.gov).
+>
+> Yet, currently, this is the only way to test the end-to-end integration of the GUI with the training workflow.
 
 1. Move to the root directory of the repository.
 
@@ -74,17 +112,20 @@ conda activate /global/cfs/cdirs/m558/$(whoami)/sw/perlmutter/ml-training
    ```console
    salloc -N 1 --ntasks-per-node=1 -t 1:00:00 -q interactive -C gpu --gpu-bind=single:1 -c 32 -G 1 -A m558
 
-   podman-hpc run --gpu -v /etc/localtime:/etc/localtime -v $HOME/db.profile:/root/db.profile --rm -it registry.nersc.gov/m558/superfacility/ml-training:latest python -u /app/ml/train_model.py --experiment <experiment_name> --model NN
+   podman-hpc run --gpu -v /etc/localtime:/etc/localtime -v $HOME/db.profile:/root/db.profile -v /path/to/config.yaml:/app/ml/config.yaml --rm -it registry.nersc.gov/m558/superfacility/ml-training:latest python -u /app/ml/train_model.py --test --config_file /app/ml/config.yaml --model NN
    ```
    Note that `-v /etc/localtime:/etc/localtime` is necessary to synchronize the time zone in the container with the host machine.
 
-Note: for our interactive dashboard, we run ML training jobs via the NERSC superfacility using the collaboration account `sf558`.
-Since this is a non-interactive, non-user account, we also use a custom user to pull the image from https://registry.nersc.gov to perlmutter.
-The registry login credentials need to be prepared (once) in the `$HOME` of `sf558` (`/global/homes/s/sf558/`) in a file named `registry.profile` with the following content:
-```bash
-export REGISTRY_USER="robot\$m558+perlmutter-nersc-gov"
-export REGISTRY_PASSWORD="..."  # request this from Remi/Axel
-```
+
+> **Note:**
+>
+> When we run ML training jobs through the GUI, we use NERSC's Superfacility API with the collaboration account `sf558`.
+> Since this is a non-interactive, non-user account, we also use a custom user to pull the image from https://registry.nersc.gov to Perlmutter.
+> The registry login credentials need to be prepared (once) in the `$HOME` of `sf558` (`/global/homes/s/sf558/`) in a file named `registry.profile` with the following content:
+> ```bash
+> export REGISTRY_USER="robot\$m558+perlmutter-nersc-gov"
+> export REGISTRY_PASSWORD="..."
+> ```
 
 ## References
 
diff --git a/ml/train_model.py b/ml/train_model.py
@@ -46,8 +46,8 @@ def parse_arguments():
     # Parse command line arguments
     parser = argparse.ArgumentParser()
     parser.add_argument(
-        "--experiment",
-        help="name/tag of the experiment",
+        "--config_file",
+        help="path to the configuration file",
         type=str,
         required=True,
     )
@@ -63,27 +63,23 @@ def parse_arguments():
         default=False,
     )
     args = parser.parse_args()
-    experiment = args.experiment
+    config_file = args.config_file
     model_type = args.model
     test_mode = args.test
-    print(f"Experiment: {experiment}, Model type: {model_type}, Test mode: {test_mode}")
+    print(
+        f"Config file path: {config_file}, Model type: {model_type}, Test mode: {test_mode}"
+    )
     if model_type not in ["NN", "ensemble_NN", "GP"]:
         raise ValueError(f"Invalid model type: {model_type}")
-    return experiment, model_type, test_mode
+    return config_file, model_type, test_mode
 
 
-def load_config(experiment):
-    # Extract configurations of experiments & models
-    possible_config_file_paths = [
-        f"{os.path.dirname(os.path.abspath(__file__))}config.yaml",
-        "./config.yaml",
-        f"../experiments/synapse-{experiment}/config.yaml",
-    ]
-    for config_file_path in possible_config_file_paths:
-        if os.path.exists(config_file_path):
-            with open(config_file_path) as f:
-                return yaml.safe_load(f.read())
-    raise RuntimeError("File config.yaml not found.")
+def load_config(config_file):
+    # Load configuration from the specified file path
+    if not os.path.exists(config_file):
+        raise RuntimeError(f"Configuration file not found: {config_file}")
+    with open(config_file) as f:
+        return yaml.safe_load(f.read())
 
 
 def connect_to_db(config_dict):
@@ -431,6 +427,9 @@ def write_model(model, model_type, experiment, db):
     # Parse command line arguments and load config
     experiment, model_type, test_mode = parse_arguments()
     config_dict = load_config(experiment)
+    # Extract experiment name from config file
+    experiment = config_dict["experiment"]
+    print(f"Experiment: {experiment}")
     # Extract input and output variables from the config file
     input_variables = config_dict["inputs"]
     input_names = [v["name"] for v in input_variables.values()]
diff --git a/ml/training_pm.sbatch b/ml/training_pm.sbatch
@@ -12,12 +12,7 @@
 #SBATCH -o /global/cfs/cdirs/m558/superfacility/model_training/logs/sf.o%j
 #SBATCH -e /global/cfs/cdirs/m558/superfacility/model_training/logs/sf.e%j
 
-experiment=${1}  # e.g., "experiment_name"
-model=${2}       # e.g., "NN", "GP", etc.
-
-if [[ -z "${experiment}" ]]; then
-    echo "Must pass the experiment name/tag as a command line argument"
-fi
+model=${1}  # e.g., "NN", "GP", etc.
 
 # login to the registry, update if needed
 # Note: If you encounter issues, note that we compare image
@@ -59,4 +54,4 @@ srun podman-hpc run --gpu \
     -v $HOME/db.profile:/root/db.profile \
     -v /global/cfs/cdirs/m558/superfacility/model_training/config.yaml:/app/ml/config.yaml \
     --rm -it ${REGISTRY_NAME}/${IMAGE_NAME}:${IMAGE_VERSION} \
-    python -u /app/ml/train_model.py --experiment ${experiment} --model ${model}
+    python -u /app/ml/train_model.py --config_file /app/ml/config.yaml --model ${model}