Chunking of data to follow new CMIP7 directives #137

rbeucher · 2025-12-08T23:38:59Z

Added the DatasetChunker class, which applies CMIP7 chunking rules: time coordinates and bounds are single-chunked, while data variables are chunked into blocks of at least 4MB . (src/access_moppy/base.py)
Introduced new configuration options to the CMORiser base class: enable_chunking, chunk_size_mb, enable_compression, and compression_level, allowing users to control chunking and compression behavior. (__init__ in src/access_moppy/base.py) [1] [2]

Optimized NetCDF write process:

Refactored the write method to ensure all variable metadata (B-tree fragments) is written before any data chunks, and applied HDF5 optimizations (shuffle, zlib compression, fletcher32 checksum) to time-dependent data variables for improved read/write performance and data integrity. (src/access_moppy/base.py) [1] [2]

…ecessary build configuration

* Fix and restucture MOM supergrid logic for OM2 and OM3 * Add support for staggered variables on B-grid * update conditional logic * fix format --------- Co-authored-by: rhaegar325 <[email protected]>

…write - Implement DatasetChunker class with rules for optimal chunking: * Time coordinates: single chunk (no time dimension chunking) * Time bounds: single chunk for all dimensions * Data variables: chunked to at least 4MB blocks - Add chunking integration to CMIP6_CMORiser workflow - Optimize write() method with two-phase approach: * Phase 1: Create all variables and metadata (B-tree fragments) * Phase 2: Write data chunks after metadata sync - Ensures optimal NetCDF4/HDF5 file layout for read performance - Configurable via enable_chunking and chunk_size_mb parameters

…ser classes

codecov · 2025-12-08T23:39:55Z

Codecov Report

❌ Patch coverage is 69.02655% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.44%. Comparing base (54ec175) to head (cc4dc23).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/access_moppy/base.py	69.02%	35 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #137      +/-   ##
==========================================
+ Coverage   56.51%   57.44%   +0.92%     
==========================================
  Files          18       18              
  Lines        2293     2399     +106     
==========================================
+ Hits         1296     1378      +82     
- Misses        997     1021      +24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rbeucher · 2025-12-08T23:49:03Z

Hey @rhaegar325 , would you mind having a look at this?

rhaegar325 · 2025-12-09T06:02:19Z

src/access_moppy/base.py

+            # Now all B-tree metadata is written, data chunks come after
+            for var in self.ds.variables:
+                vdat = self.ds[var]
+                created_vars[var][:] = vdat.values


vdat.values will trigger dask.compute() despite the dask.client() has already been closed beforehead. In this case, Dask still attempts to load the entire dataset into memory on a single worker, which results in a MemoryError. A better solution is to write the data in chunks, instead of forcing a full in-memory materialization.

I will try to find a way to solve this.

rbeucher · 2025-12-09T06:06:43Z

Good point. Thanks for pointing that out!

…ory limit

rhaegar325 · 2025-12-09T10:34:43Z

src/access_moppy/base.py

+            # Force NetCDF to write all metadata/B-tree information
+            dst.sync()
+
+            # PHASE 2: Write actual data chunks


The new Phase 2 includes using chunked writing to save the data when it is too large.

rhaegar325 and others added 24 commits November 10, 2025 13:05

update for OM2

9cce778

format organising

0f20c2c

format organising

81d3d75

Update write() to prevent Dask crashes due to excessive memory usage

4e36710

fake om2_01 metadata

fbd3c3b

update to solve grid issue

24fdc90

grid update

fcad733

update grid mechanism for OM2

428c7a3

format

41e3d4c

format

1f6caae

end of file fixer

2169829

organise

099bf32

update for grid

f473c99

update for B and C grid

d5cb492

fix for pass ruff

48556a5

Refactor documentation requirements in requirements.txt to remove unn…

669718c

…ecessary build configuration

Fix ruff

e22625e

Restore CMIP6 vocab file

915f669

Fix and restucture MOM supergrid logic for OM2 and OM3 (#128)

f93aff5

* Fix and restucture MOM supergrid logic for OM2 and OM3 * Add support for staggered variables on B-grid * update conditional logic * fix format --------- Co-authored-by: rhaegar325 <[email protected]>

Merge branch 'main' into update_om2

b13ee0c

Fix merge conflicts

760610f

Merge branch 'main' into feature/dataset-chunking

1d1fcb0

Fix typos and improve documentation in DatasetChunker and CMIP6_CMORi…

4b09864

…ser classes

Fix formatting

0ca6438

rbeucher requested a review from rhaegar325 December 8, 2025 23:48

rhaegar325 reviewed Dec 9, 2025

View reviewed changes

rhaegar325 added 2 commits December 9, 2025 17:25

add a machenism to write in chunks if data exceeded dask worker's mem…

7860e77

…ory limit

add a machenism to write in chunks if data exceeded dask worker's mem…

cc4dc23

…ory limit

rhaegar325 reviewed Dec 9, 2025

View reviewed changes

rhaegar325 self-requested a review December 9, 2025 12:00

rbeucher merged commit a4f53c8 into main Dec 10, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking of data to follow new CMIP7 directives #137

Chunking of data to follow new CMIP7 directives #137

Uh oh!

rbeucher commented Dec 8, 2025

Uh oh!

codecov bot commented Dec 8, 2025 •

edited

Loading

Uh oh!

rbeucher commented Dec 8, 2025

Uh oh!

rhaegar325 Dec 9, 2025

Uh oh!

rhaegar325 Dec 9, 2025

Uh oh!

rbeucher commented Dec 9, 2025

Uh oh!

rhaegar325 Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Chunking of data to follow new CMIP7 directives #137

Chunking of data to follow new CMIP7 directives #137

Uh oh!

Conversation

rbeucher commented Dec 8, 2025

Uh oh!

codecov bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rbeucher commented Dec 8, 2025

Uh oh!

rhaegar325 Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

rhaegar325 Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

rbeucher commented Dec 9, 2025

Uh oh!

rhaegar325 Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Dec 8, 2025 •

edited

Loading