Skip to content

Conversation

@Dooruk
Copy link
Collaborator

@Dooruk Dooruk commented Dec 15, 2025

Following the comments on my previous PR, I created a more "lightweight" PR. I realize this is not an easy PR for others who never executed GEOS within SWELL to review. Changes to R2D2 and GEOS model version hasn't helped in terms of simplifying this either. This is still a work in progress, and I don't even think it will work in this form, but if #626 goes in first I could simplify this a little bit more.

I added files to ignore at the end. I realize I should be more selective in my commits in terms of what goes in. Also better docstrings are required (I see @ashiklom's comments in other PRs). With those in mind..

There are two main things happening here (read the section about gcm_run.j at the end of the description for model execution task):

  1. Cylc (flow.cylc) calls gcm_run.j directly. To faciliate this a forecast directory was created under {swell_exp_dir}/GEOSgcm/forecast. This forecast folder is a replication of a GEOS experiment folder, with only a few changes regarding where HOMDIR, EXPDIR are defined. Model execution happens under {swell_exp_dir}/GEOSgcm/forecast/scratch similar to typical GEOS model runs.

Why was this change necessary:

  • We want SWELL to be less involved in GEOS model execution task(s). The previous method required lots of file manipulation (in particular due to the boundary condition files, AKA /RC files) in the forecast directory. This creates incompatibility while running/testing different GEOSgcm versions.
  • In Cylc templating forecast dir can't be updated in final flow.cylc if it is templated in a time dependent way.
  • subprocess simply didn't run with GEOSv12 on Milan nodes, I tried many combinations, didn't pass beyond initialization.
  • Defining sufficient nodes, MPI layouts etc. is handled by gcm_run.j. If users make mistake in terms of requesting sufficient SLURM nodes, GEOS tries submitting hundreds of instances to compensate lack of compute resources, then NCCS will yell at you.
  • (long term relevancy) gcm_run.j and gcm_setup.j scripts are being or will be modernized. This is work underway but might take a long time (especially gcm_run.j).

⚠️ Which means to use a gcm_run.j in SWELL, some parts should be erased or commented out. Or, my idea is that there could be conditional sections in gcm_run.j say SWELL_active, then gcm_run.j can skip those sections, which are mainly postprocessing anyway.

  1. Model-differentiated tasks were created in this PR. There is a new suite and new tasks that are only executed with coupled mode. A new suite called 3dfgat_coupled_cycle (not the best name) is added.

⚠️ @mranst introduced this in some capacity already: #626, so if we merge the model-differentiated tasks PR first I can make necessary changes to my PR and it might be less confusing. It is my mistake to make this change in this PR.

Ignore these changes (has no impact or related to grid change):

src/swell/configuration/jedi/interfaces/geos_marine/model/background_error.yaml
src/swell/tasks/generate_b_climatology.py

Finally, little primer on gcm_run.j

Let's consider gcm_run.j in 4 stages:

  1. SLURM & node assignment
  2. Preprocessing
  3. Execution
  4. Postprocessing

In the current implementation, SWELL handles 2 & 3 via python and subprocess and 1 is assumed to be set properly by the user, which caused trouble with the NCCS. For DA purposes 4, postprocessing is explicitly handled by SWELL but that is not the focus of this PR.

In this proposed implementation, the main difference is that we rely on gcm_run.j for 2 and 3 by conducting surgical edits via PrepCoupledGeosRundir at few locations and running gcm_run.j directly from Cylc (which doesn't capture failed exit status):

    [[RunGeos]]
        script = "{{experiment_path}}/forecast/gcm_run.j"
        platform = {{platform}}
        [[[job]]]
            shell = /bin/csh
        [[[directives]]]
        {%- for key, value in scheduling["RunGeos"]["directives"]["all"].items() %}
            --{{key}} = {{value}}
        {%- endfor %}

I created the 3dfgat_coupled_cycle suite for testing, should work by default if anyone has time to check it out.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment on lines +206 to +209
'''
Method to provide "forecast" directory to geos class
If paths are provided, it is combined with the forecast directory and returned
'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better! But, FWIW, where I'd like to get with these is something like this:

Suggested change
'''
Method to provide "forecast" directory to geos class
If paths are provided, it is combined with the forecast directory and returned
'''
"""Set structure of 'forecast' directory for GEOS
Args:
paths: One or more paths to concatenate into a nested forecast directory structure.
Returns:
The full forecast directory path, appended to `self.cycle_forecast_dir`, as a single string.
Examples:
For example, if `obj.cycle_forecast_dir = "/path/to/cycle"` then:
>>> obj.forecast_dir("forecast")
"/path/to/cycle/forecast"
>>> obj.forecast_dir(["nested", "dir", "structure"])
"/path/to/cycle/nested/dir/structure"
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more examples of Google-style docstrings, see here:

https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html

Also, you don't have to change it here (because we use forecast_dir a lot, and this PR touches a lot of things!), but just flagging that it's best to avoid functions that can take multiple types whenever possible. In this case, this should either always take a string and literally concatenate it, or (perhaps, better) it should always take a list of strings (even if there's just one path). That requires much less code with, IMHO, much clearer and more consistent behavior.

def forecast_dir(self, paths: list[str]) -> str:
  os.makedirs(self.cycle_forecast_dir, 0o755, exist_ok=true)
  return os.path.join(self.cycle_forecast_dir, *paths)

ds = ds.rename({'aice': 'aice_h', 'hi': 'hi_h', 'hs': 'hs_h'})

# Save as a new file
ds.to_netcdf(dst_history, mode='w')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest being explicit about format here.

Suggested change
ds.to_netcdf(dst_history, mode='w')
ds.to_netcdf(dst_history, mode='w', format='NETCDF4')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would write a *.nc4 file, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format="NETCDF4" has nothing to do with the file extension. It will write to whatever the full value of dst_history is. If dst_history = "somefile.crazy", it will write a file called somefile.crazy.

The point here is to (1) make it clear to other/future programmers that we're writing NetCDF4 files; and (2) make this code agnostic to the current dependencies installed and to xarray.to_netcdf's internal rules about how it decides what kinds of "netcdf" files to write by default (e.g., right now, I think the rule is something like: Use NetCDF4 if h5netcdf or netcdf4 is installed; otherwise, fall back to netcdf3...but that might change in the future).


except Exception:
logger.abort('Copying failed, see if source files exists')
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, avoid catch-all except Exception.

If shutil is most likely to throw an OSError, we can catch that explicitly and let all other errors bubble up naturally.

except OSError as e:
  logger.abort("...")

Better yet, in general, we should start avoiding this pattern altogether and just let exceptions get raised natively by Python and bubble up through the call stack.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand OSError is a subset of Exception. Are you suggesting also to avoid as e part?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, OSError is a subclass of Exception, but that's exactly the point: It's bad engineering practice to be overly indiscriminate in handling exceptions (i.e., except Exception). Basically, the problem is that SystemExit (triggered by some system error calls), KeyboardInterrupt (the exception triggered by pressing Control-C), and lots of other things are all rolled up under Exception. In most cases, you do not want SystemExit and KeyboardInterrupt to be handled in the same ways as other things (and usually, you don't want these handled at all but raised immediately).

As a specific example: As written right now, if you try to interrupt this code with Control-C while it happens to be inside this try-except block, you will not interrupt immediately but rather keep executing via the logger.abort statement.

Here is a more detailed writeup: https://jerrynsh.com/stop-using-exceptions-like-this-in-python/

All that said: logger.abort will just re-raise the same exception with a logging message, so the practical consequences of this are relatively minor. But we could simplify a lot of our code by just not bothering with logger.abort in the first place and just letting exceptions raise themselves --- Python is perfectly good at printing out backtraces and other useful information when it raises errors without it needing to be wrapped in a logger.


except Exception:
logger.abort('Moving failed, see if source files exist')
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto to above.

"sea_ice_snow_thickness"
]),
qd.window_length("PT6H"),
qd.window_offset("PT3H"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The window_offset causes failures since its been removed from swell

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had that change in my full PR. I don't think this PR will work without the remainder of the parts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants