Allow grouping input cubes by date (instead of filename) for `fix_metadata` #2551

schlunma · 2024-10-11T08:56:27Z

Description

This PR allows grouping the input cubes for our fix_metadata functions by date, i.e., all files with the same data range are passed to the fix simultaneously). This can be enabled by setting the class variable GROUP_CUBES_BY_DATE = True in the corresponding fix class. This allows implementing fixes where variables from multiple input files are necessary (for example, to derive rsut for ERA5).

This solution only works for projects where the input files are located in the same directory, and the input file pattern is flexible enough to find all files. This is fine for the native ERA5 data in netCDF format (that we need to manually download and put into the corresponding directories). For other projects where files are stored in different directories, further changes are necessary (potentially in local.py). However, this PR is a prerequisite to make these other cases work.

By default, input cubes are grouped by filename for fix_metadata (i.e., each fix_metadata call operates only on a single file):

ESMValCore/esmvalcore/cmor/fix.py

Lines 197 to 204 in fd82b43

    
           by_file = defaultdict(list) 
        
           for cube in cubes: 
        
               by_file[cube.attributes.get("source_file", "")].append(cube) 
        
           for cube_list in by_file.values(): 
        
               cube_list = CubeList(cube_list) 
        
               for fix in fixes: 
        
                   cube_list = fix.fix_metadata(cube_list)

Note that this is fully backwards-compatible since the new functionality needs to be explicitly enabled.

Closes #1806

Link to documentation: TBA

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 Any changed dependencies have been added or removed correctly
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

schlunma · 2024-10-11T08:59:25Z

Here's a small recipe to test this for rsut (I also included rsdt to ensure that this still works fine):

# ESMValTool
---
documentation:
  title: test
  description: test
  authors:
    - schlund_manuel

datasets:
  - {project: native6, dataset: ERA5, type: reanaly, version: v1, tier: 3, timerange: 2000/2001}

diagnostics:

  test:
    variables:
      rsut:
        mip: Amon
      rsdt:
        mip: Amon
    scripts:
      null

Input files need to be arranged like this:

.
└── Tier3
    └── ERA5
        └── v1
            └── mon
                ├── rsdt
                │   ├── era5_toa_incident_solar_radiation_2000_monthly.nc
                │   └── era5_toa_incident_solar_radiation_2001_monthly.nc
                └── rsut
                    ├── era5_mean_top_net_short_wave_radiation_flux_2000_monthly.nc
                    ├── era5_mean_top_net_short_wave_radiation_flux_2001_monthly.nc
                    ├── era5_toa_incident_solar_radiation_2000_monthly.nc
                    └── era5_toa_incident_solar_radiation_2001_monthly.nc

@bouweandela do you think this approach is a reasonable solution to this problem? As mentioned in the description, it doesn't solve the problem for all cases, but a different grouping will be necessary for all of them. And it is fully sufficient for the ERA5 netCDF case.

codecov · 2024-10-11T09:45:28Z

Codecov Report

Attention: Patch coverage is 57.14286% with 12 lines in your changes missing coverage. Please review.

Project coverage is 95.12%. Comparing base (0dc146c) to head (cabc32c).

Files with missing lines	Patch %	Lines
esmvalcore/cmor/_fixes/native6/era5.py	36.36%	7 Missing ⚠️
esmvalcore/cmor/fix.py	66.66%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2551      +/-   ##
==========================================
- Coverage   95.20%   95.12%   -0.08%     
==========================================
  Files         259      259              
  Lines       15211    15232      +21     
==========================================
+ Hits        14481    14489       +8     
- Misses        730      743      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…data

bouweandela · 2025-05-28T08:33:37Z

esmvalcore/cmor/fix.py

    fixed_cubes = CubeList()

-    # Group cubes by input file and apply all fixes to each group element
-    # (i.e., each file) individually
-    by_file = defaultdict(list)
-    for cube in cubes:
-        by_file[cube.attributes.get("source_file", "")].append(cube)
-
-    for cube_list in by_file.values():
-        cube_list = CubeList(cube_list)
+    # Group cubes and apply all fixes to each group element individually. There
+    # are two options for grouping:
+    # (1) By input file name (default).
+    # (2) By time range (can be enabled by setting the attribute
+    #     GROUP_CUBES_BY_DATE=True for the fix class; see
+    #     _fixes.native6.era5.Rsut for an example).
+    grouped_cubes = _group_cubes(fixes, cubes)
+    for cube_list in grouped_cubes.values():
        for fix in fixes:
            cube_list = fix.fix_metadata(cube_list)


It may be nicer to define the grouping operation on the Fix object, so this code would look like:

fixed_cubes = CubeList(cubes) for fix in fixes: fixed_cubes = CubeList( cube for group in fix.group_input_for_fix_metadata(fixed_cubes) for cube in fix.fix_metadata(group) )

bouweandela · 2025-05-28T08:36:01Z

do you think this approach is a reasonable solution to this problem?

If it works for you, it should be fine

…_for_fix_metadata

…data

schlunma added 2 commits October 11, 2024 10:44

Allow grouping cubes by data for fix_metadata

0884afe

Added ERA5 fix for rsut

168c9f2

schlunma added the fix for dataset Related to dataset-specific fix files label Oct 11, 2024

schlunma added this to the v2.12.0 milestone Oct 11, 2024

schlunma requested review from axel-lauer and bouweandela October 11, 2024 08:56

schlunma self-assigned this Oct 11, 2024

schlunma mentioned this pull request Oct 11, 2024

Variable derivation for ERA5 on-the-fly CMORizer #1806

Open

Fix test

73450fa

schlunma and others added 2 commits October 11, 2024 14:46

Merge remote-tracking branch 'origin/main' into grouping_for_fix_meta…

859619b

…data

Merge branch 'main' into grouping_for_fix_metadata

06ff0ac

schlunma mentioned this pull request Oct 30, 2024

Load esmvalcore.dataset.Dataset objects in parallel using Dask #2517

Open

9 tasks

schlunma modified the milestones: v2.12.0, v2.13.0 Feb 6, 2025

Merge branch 'main' into grouping_for_fix_metadata

ee82368

bouweandela reviewed May 28, 2025

View reviewed changes

schlunma added 6 commits May 28, 2025 13:00

Merge commit 'd3eb1d159d443a3ff96cbda0a550d5d983094a15' into grouping…

5d94cb4

…_for_fix_metadata

Merge commit '22a1a8561d419ef3556891dad7dde28cdcd82a59' into grouping…

17dbbc3

…_for_fix_metadata

Safe fixes

8027304

Unsafe fixes

7e41997

Manual fixes

eae214e

Merge remote-tracking branch 'origin/main' into grouping_for_fix_meta…

cabc32c

…data

schlunma removed this from the v2.13.0 milestone Jun 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow grouping input cubes by date (instead of filename) for `fix_metadata` #2551

Allow grouping input cubes by date (instead of filename) for `fix_metadata` #2551

Uh oh!

schlunma commented Oct 11, 2024

Uh oh!

schlunma commented Oct 11, 2024 •

edited

Loading

Uh oh!

codecov bot commented Oct 11, 2024 •

edited

Loading

Uh oh!

bouweandela May 28, 2025

Uh oh!

bouweandela commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	by_file = defaultdict(list)
	for cube in cubes:
	by_file[cube.attributes.get("source_file", "")].append(cube)

	for cube_list in by_file.values():
	cube_list = CubeList(cube_list)
	for fix in fixes:
	cube_list = fix.fix_metadata(cube_list)

Allow grouping input cubes by date (instead of filename) for fix_metadata #2551

Are you sure you want to change the base?

Allow grouping input cubes by date (instead of filename) for fix_metadata #2551

Uh oh!

Conversation

schlunma commented Oct 11, 2024

Description

Before you get started

Checklist

Uh oh!

schlunma commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bouweandela May 28, 2025

Choose a reason for hiding this comment

Uh oh!

bouweandela commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Allow grouping input cubes by date (instead of filename) for `fix_metadata` #2551

Allow grouping input cubes by date (instead of filename) for `fix_metadata` #2551

schlunma commented Oct 11, 2024 •

edited

Loading

codecov bot commented Oct 11, 2024 •

edited

Loading