-
Notifications
You must be signed in to change notification settings - Fork 1
Chunking of data to follow new CMIP7 directives #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ecessary build configuration
* Fix and restucture MOM supergrid logic for OM2 and OM3 * Add support for staggered variables on B-grid * update conditional logic * fix format --------- Co-authored-by: rhaegar325 <[email protected]>
…write - Implement DatasetChunker class with rules for optimal chunking: * Time coordinates: single chunk (no time dimension chunking) * Time bounds: single chunk for all dimensions * Data variables: chunked to at least 4MB blocks - Add chunking integration to CMIP6_CMORiser workflow - Optimize write() method with two-phase approach: * Phase 1: Create all variables and metadata (B-tree fragments) * Phase 2: Write data chunks after metadata sync - Ensures optimal NetCDF4/HDF5 file layout for read performance - Configurable via enable_chunking and chunk_size_mb parameters
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #137 +/- ##
==========================================
+ Coverage 56.51% 57.44% +0.92%
==========================================
Files 18 18
Lines 2293 2399 +106
==========================================
+ Hits 1296 1378 +82
- Misses 997 1021 +24 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hey @rhaegar325 , would you mind having a look at this? |
src/access_moppy/base.py
Outdated
| # Now all B-tree metadata is written, data chunks come after | ||
| for var in self.ds.variables: | ||
| vdat = self.ds[var] | ||
| created_vars[var][:] = vdat.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vdat.values will trigger dask.compute() despite the dask.client() has already been closed beforehead. In this case, Dask still attempts to load the entire dataset into memory on a single worker, which results in a MemoryError. A better solution is to write the data in chunks, instead of forcing a full in-memory materialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to find a way to solve this.
|
Good point. Thanks for pointing that out! |
| # Force NetCDF to write all metadata/B-tree information | ||
| dst.sync() | ||
|
|
||
| # PHASE 2: Write actual data chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new Phase 2 includes using chunked writing to save the data when it is too large.
DatasetChunkerclass, which applies CMIP7 chunking rules: time coordinates and bounds are single-chunked, while data variables are chunked into blocks of at least 4MB . (src/access_moppy/base.py)enable_chunking,chunk_size_mb,enable_compression, andcompression_level, allowing users to control chunking and compression behavior. (__init__insrc/access_moppy/base.py) [1] [2]Optimized NetCDF write process:
writemethod to ensure all variable metadata (B-tree fragments) is written before any data chunks, and applied HDF5 optimizations (shuffle, zlib compression, fletcher32 checksum) to time-dependent data variables for improved read/write performance and data integrity. (src/access_moppy/base.py) [1] [2]