Skip to content

Conversation

@rbeucher
Copy link
Member

@rbeucher rbeucher commented Dec 8, 2025

  • Added the DatasetChunker class, which applies CMIP7 chunking rules: time coordinates and bounds are single-chunked, while data variables are chunked into blocks of at least 4MB . (src/access_moppy/base.py)
  • Introduced new configuration options to the CMORiser base class: enable_chunking, chunk_size_mb, enable_compression, and compression_level, allowing users to control chunking and compression behavior. (__init__ in src/access_moppy/base.py) [1] [2]

Optimized NetCDF write process:

  • Refactored the write method to ensure all variable metadata (B-tree fragments) is written before any data chunks, and applied HDF5 optimizations (shuffle, zlib compression, fletcher32 checksum) to time-dependent data variables for improved read/write performance and data integrity. (src/access_moppy/base.py) [1] [2]

rhaegar325 and others added 24 commits November 10, 2025 13:05
* Fix and restucture MOM supergrid logic for OM2 and OM3

* Add support for staggered variables on B-grid

* update conditional logic

* fix format

---------

Co-authored-by: rhaegar325 <[email protected]>
…write

- Implement DatasetChunker class with rules for optimal chunking:
  * Time coordinates: single chunk (no time dimension chunking)
  * Time bounds: single chunk for all dimensions
  * Data variables: chunked to at least 4MB blocks
- Add chunking integration to CMIP6_CMORiser workflow
- Optimize write() method with two-phase approach:
  * Phase 1: Create all variables and metadata (B-tree fragments)
  * Phase 2: Write data chunks after metadata sync
- Ensures optimal NetCDF4/HDF5 file layout for read performance
- Configurable via enable_chunking and chunk_size_mb parameters
@codecov
Copy link

codecov bot commented Dec 8, 2025

Codecov Report

❌ Patch coverage is 69.02655% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.44%. Comparing base (54ec175) to head (cc4dc23).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/access_moppy/base.py 69.02% 35 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #137      +/-   ##
==========================================
+ Coverage   56.51%   57.44%   +0.92%     
==========================================
  Files          18       18              
  Lines        2293     2399     +106     
==========================================
+ Hits         1296     1378      +82     
- Misses        997     1021      +24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rbeucher rbeucher requested a review from rhaegar325 December 8, 2025 23:48
@rbeucher
Copy link
Member Author

rbeucher commented Dec 8, 2025

Hey @rhaegar325 , would you mind having a look at this?

# Now all B-tree metadata is written, data chunks come after
for var in self.ds.variables:
vdat = self.ds[var]
created_vars[var][:] = vdat.values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vdat.values will trigger dask.compute() despite the dask.client() has already been closed beforehead. In this case, Dask still attempts to load the entire dataset into memory on a single worker, which results in a MemoryError. A better solution is to write the data in chunks, instead of forcing a full in-memory materialization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to find a way to solve this.

@rbeucher
Copy link
Member Author

rbeucher commented Dec 9, 2025

Good point. Thanks for pointing that out!

# Force NetCDF to write all metadata/B-tree information
dst.sync()

# PHASE 2: Write actual data chunks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Phase 2 includes using chunked writing to save the data when it is too large.

@rhaegar325 rhaegar325 self-requested a review December 9, 2025 12:00
@rbeucher rbeucher merged commit a4f53c8 into main Dec 10, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants