Execute GEOS via calling gcm_run.j and create new model specific tasks. #677

Dooruk · 2025-12-15T20:14:07Z

Following the comments on my previous PR, I created a more "lightweight" PR. I realize this is not an easy PR for others who never executed GEOS within SWELL to review. Changes to R2D2 and GEOS model version hasn't helped in terms of simplifying this either. This is still a work in progress, and I don't even think it will work in this form, but if #626 goes in first I could simplify this a little bit more.

I added files to ignore at the end. I realize I should be more selective in my commits in terms of what goes in. Also better docstrings are required (I see @ashiklom's comments in other PRs). With those in mind..

There are two main things happening here (read the section about `gcm_run.j` at the end of the description for model execution task):

Cylc (flow.cylc) calls gcm_run.j directly. To faciliate this a forecast directory was created under {swell_exp_dir}/GEOSgcm/forecast. This forecast folder is a replication of a GEOS experiment folder, with only a few changes regarding where HOMDIR, EXPDIR are defined. Model execution happens under {swell_exp_dir}/GEOSgcm/forecast/scratch similar to typical GEOS model runs.

Why was this change necessary:

We want SWELL to be less involved in GEOS model execution task(s). The previous method required lots of file manipulation (in particular due to the boundary condition files, AKA /RC files) in the forecast directory. This creates incompatibility while running/testing different GEOSgcm versions.
In Cylc templating forecast dir can't be updated in final flow.cylc if it is templated in a time dependent way.
subprocess simply didn't run with GEOSv12 on Milan nodes, I tried many combinations, didn't pass beyond initialization.
Defining sufficient nodes, MPI layouts etc. is handled by gcm_run.j. If users make mistake in terms of requesting sufficient SLURM nodes, GEOS tries submitting hundreds of instances to compensate lack of compute resources, then NCCS will yell at you.
(long term relevancy) gcm_run.j and gcm_setup.j scripts are being or will be modernized. This is work underway but might take a long time (especially gcm_run.j).

⚠️ Which means to use a gcm_run.j in SWELL, some parts should be erased or commented out. Or, my idea is that there could be conditional sections in gcm_run.j say SWELL_active, then gcm_run.j can skip those sections, which are mainly postprocessing anyway.

Model-differentiated tasks were created in this PR. There is a new suite and new tasks that are only executed with coupled mode. A new suite called 3dfgat_coupled_cycle (not the best name) is added.

⚠️ @mranst introduced this in some capacity already: #626, so if we merge the model-differentiated tasks PR first I can make necessary changes to my PR and it might be less confusing. It is my mistake to make this change in this PR.

Ignore these changes (has no impact or related to grid change):

src/swell/configuration/jedi/interfaces/geos_marine/model/background_error.yaml
src/swell/tasks/generate_b_climatology.py

Finally, little primer on `gcm_run.j`

Let's consider gcm_run.j in 4 stages:

SLURM & node assignment
Preprocessing
Execution
Postprocessing

In the current implementation, SWELL handles 2 & 3 via python and subprocess and 1 is assumed to be set properly by the user, which caused trouble with the NCCS. For DA purposes 4, postprocessing is explicitly handled by SWELL but that is not the focus of this PR.

In this proposed implementation, the main difference is that we rely on gcm_run.j for 2 and 3 by conducting surgical edits via PrepCoupledGeosRundir at few locations and running gcm_run.j directly from Cylc (which doesn't capture failed exit status):

    [[RunGeos]]
        script = "{{experiment_path}}/forecast/gcm_run.j"
        platform = {{platform}}
        [[[job]]]
            shell = /bin/csh
        [[[directives]]]
        {%- for key, value in scheduling["RunGeos"]["directives"]["all"].items() %}
            --{{key}} = {{value}}
        {%- endfor %}

I created the 3dfgat_coupled_cycle suite for testing, should work by default if anyone has time to check it out.

…via Cylc

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

ashiklom · 2025-12-16T11:01:53Z

src/swell/tasks/base/task_base.py

+        '''
+        Method to provide "forecast" directory to geos class
+        If paths are provided, it is combined with the forecast directory and returned
+        '''


Better! But, FWIW, where I'd like to get with these is something like this:

Suggested change

'''

Method to provide "forecast" directory to geos class

If paths are provided, it is combined with the forecast directory and returned

'''

"""Set structure of 'forecast' directory for GEOS

Args:

paths: One or more paths to concatenate into a nested forecast directory structure.

Returns:

The full forecast directory path, appended to `self.cycle_forecast_dir`, as a single string.

Examples:

For example, if `obj.cycle_forecast_dir = "/path/to/cycle"` then:

>>> obj.forecast_dir("forecast")

"/path/to/cycle/forecast"

>>> obj.forecast_dir(["nested", "dir", "structure"])

"/path/to/cycle/nested/dir/structure"

"""

For more examples of Google-style docstrings, see here:

https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html

Also, you don't have to change it here (because we use forecast_dir a lot, and this PR touches a lot of things!), but just flagging that it's best to avoid functions that can take multiple types whenever possible. In this case, this should either always take a string and literally concatenate it, or (perhaps, better) it should always take a list of strings (even if there's just one path). That requires much less code with, IMHO, much clearer and more consistent behavior.

def forecast_dir(self, paths: list[str]) -> str: os.makedirs(self.cycle_forecast_dir, 0o755, exist_ok=true) return os.path.join(self.cycle_forecast_dir, *paths)

src/swell/tasks/generate_b_climatology.py

src/swell/tasks/get_coupled_geos_restart.py

ashiklom · 2025-12-16T11:30:12Z

src/swell/tasks/link_coupled_geos_output.py

+        ds = ds.rename({'aice': 'aice_h', 'hi': 'hi_h', 'hs': 'hs_h'})
+
+        # Save as a new file
+        ds.to_netcdf(dst_history, mode='w')


Suggest being explicit about format here.

Suggested change

ds.to_netcdf(dst_history, mode='w')

ds.to_netcdf(dst_history, mode='w', format='NETCDF4')

This would write a *.nc4 file, correct?

format="NETCDF4" has nothing to do with the file extension. It will write to whatever the full value of dst_history is. If dst_history = "somefile.crazy", it will write a file called somefile.crazy.

The point here is to (1) make it clear to other/future programmers that we're writing NetCDF4 files; and (2) make this code agnostic to the current dependencies installed and to xarray.to_netcdf's internal rules about how it decides what kinds of "netcdf" files to write by default (e.g., right now, I think the rule is something like: Use NetCDF4 if h5netcdf or netcdf4 is installed; otherwise, fall back to netcdf3...but that might change in the future).

src/swell/tasks/move_erase_da_restart.py

src/swell/tasks/link_coupled_geos_output.py

src/swell/tasks/prepare_analysis.py

ashiklom · 2025-12-16T12:19:05Z

src/swell/utilities/file_system_operations.py


-    except Exception:
-        logger.abort('Copying failed, see if source files exists')
+    except Exception as e:


In general, avoid catch-all except Exception.

If shutil is most likely to throw an OSError, we can catch that explicitly and let all other errors bubble up naturally.

except OSError as e: logger.abort("...")

Better yet, in general, we should start avoiding this pattern altogether and just let exceptions get raised natively by Python and bubble up through the call stack.

I understand OSError is a subset of Exception. Are you suggesting also to avoid as e part?

Yes, OSError is a subclass of Exception, but that's exactly the point: It's bad engineering practice to be overly indiscriminate in handling exceptions (i.e., except Exception). Basically, the problem is that SystemExit (triggered by some system error calls), KeyboardInterrupt (the exception triggered by pressing Control-C), and lots of other things are all rolled up under Exception. In most cases, you do not want SystemExit and KeyboardInterrupt to be handled in the same ways as other things (and usually, you don't want these handled at all but raised immediately).

As a specific example: As written right now, if you try to interrupt this code with Control-C while it happens to be inside this try-except block, you will not interrupt immediately but rather keep executing via the logger.abort statement.

Here is a more detailed writeup: https://jerrynsh.com/stop-using-exceptions-like-this-in-python/

All that said: logger.abort will just re-raise the same exception with a logging message, so the practical consequences of this are relatively minor. But we could simplify a lot of our code by just not bothering with logger.abort in the first place and just letting exceptions raise themselves --- Python is perfectly good at printing out backtraces and other useful information when it raises errors without it needing to be wrapped in a logger.

ashiklom · 2025-12-16T12:19:17Z

src/swell/utilities/file_system_operations.py


-    except Exception:
-        logger.abort('Moving failed, see if source files exist')
+    except Exception as e:


Ditto to above.

src/swell/utilities/question_defaults.py

mranst · 2025-12-17T21:37:57Z

src/swell/suites/3dfgat_coupled_cycle/suite_config.py

+                "sea_ice_snow_thickness"
+            ]),
+            qd.window_length("PT6H"),
+            qd.window_offset("PT3H"),


The window_offset causes failures since its been removed from swell

Yeah, I had that change in my full PR. I don't think this PR will work without the remainder of the parts.

Dooruk added 9 commits December 15, 2025 13:26

write out exceptions

cb9c947

minor changes, make TUI message clearer

c1f7119

changes in moving files between forecast cycles

3ca20c4

take out the linking task

10ff584

introduce simplified coupled tasks assuming direct execution of GEOS …

79e6153

…via Cylc

forecast suite changes

b573f10

revamp files, more changes underway

a1fd030

define new coupled model only suite

55aeb11

change forecast location and make forecast a static folder

4c8b9a3

Dooruk mentioned this pull request Dec 15, 2025

A different way of executing GEOS from SWELL #643

Closed

Dooruk requested a review from Copilot December 15, 2025 20:14

Copilot started reviewing on behalf of Dooruk December 15, 2025 20:17 View session

Dooruk requested review from ashiklom, jeromebarre, mer-a-o, mranst and rtodling December 15, 2025 20:37

Copilot AI reviewed Dec 16, 2025

View reviewed changes