Dataless netcdf load+save. #6739

pp-mo · 2025-10-07T13:04:44Z

…netCDF4 usage.

codecov · 2025-10-08T17:08:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.25%. Comparing base (37f4547) to head (7f6a032).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #6739   +/-   ##
=======================================
  Coverage   90.24%   90.25%           
=======================================
  Files          91       91           
  Lines       24613    24630   +17     
  Branches     4604     4609    +5     
=======================================
+ Hits        22212    22229   +17     
  Misses       1624     1624           
  Partials      777      777

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ukmo-ccbunney

Looks sensible to me. Just a couple of observations:

Is it documented somewhere that the underlying netCDF library will not write any data for a netCDF variable if all the data is masked? Or are we relying on undocumented/assumed behaviour?
Should we make some warning on an empty cube load? It is possible (but not sensible) that someone could actually set data in the "dataless" variable (outside Iris) and this would be lost on load if the didn't remove the iris_dataless_cube attribute.

Further.

pp-mo · 2025-10-09T15:56:09Z

* Is it documented somewhere that the underlying netCDF library will not write any data for a netCDF variable if all the data is masked? Or are we relying on undocumented/assumed behaviour?

Hmm. I thought so, but when I checked it is a bit sketchy.

In detail, the docs here do say that a variable extends along an 'unlimited' dimension when written :
"""netCDF Variable objects with unlimited dimensions will grow along those dimensions if you assign data outside the currently defined range of indices."""

In practice, that means a chunk is created when written to
( chunking as a standard concept for netcdf variables -- nothind to do with Dask ).

My assumption is, that a fixed-size variable (i.e. with no unlimited dimension) is a single chunk, and it still doesn't get created until (at least partially) written to.
But I guess the docs don't precisely guarantee that.

However, I have now added a test which demonstrates that a large variable does not take up space.

pp-mo · 2025-10-09T15:59:54Z

Should we make some warning on an empty cube load? It is possible (but not sensible) that someone could actually set data in the "dataless" variable (outside Iris) and this would be lost on load if the didn't remove the iris_dataless_cube attribute.

TBH I don't have much appetite for this. I'm doubtful that creating a file with Iris and then post-modifying it would be a common thing.
I did now at least document the mechanism.

pp-mo · 2025-10-09T16:28:10Z

NOTE: this is using "load_raw" to check loading back, only because merging dataless cubes is not yet supported.
But see #6741, which will fix that.

ukmo-ccbunney · 2025-10-10T08:12:22Z

In detail, the docs here do say that a variable extends along an 'unlimited' dimension when written : """netCDF Variable objects with unlimited dimensions will grow along those dimensions if you assign data outside the currently defined range of indices."""

In practice, that means a chunk is created when written to

OK. That certainly seems to be the case when I look at the file size of a netCDF file with a "dataless" variable.

I was thinking, the other way we could do this is by writing to a scalar variable of the correct dtype and storing the dimension data as a special "iris" attribute (that would be handled accordingly by the loader). Something like this:

netcdf empty_test {
dimensions:
	latitude = 1000 ;
	longitude = 1000 ;
variables:
	float empty ;
		empty:iris_dataless_cube = "true"
		empty:iris_original_dims = "latitude,longutude"

That way there would never be any danger of the variable being more than a handful of bytes in the file.

Just throwing it out there...

lib/iris/tests/integration/netcdf/test_dataless.py

pp-mo · 2025-10-10T10:36:27Z

the other way we could do this is by writing to a scalar variable

@trexfeathers + @stephenworsley + I discussed this option a bit at the sprint standup today.

I am concerned that this alternative is rather nonstandard CF, and wouldn't be understood by other CF compliant software (e.g. xarray, cf-python, or whatever ?)

From a strict CF point of view, I think there could be a problem if that this type of encoding might risk not correctly identifying any attached aux-coords, cell-measures or ancillaries : either it would not be perceived as a data-variable, or the references to those other variables would not be valid because the main variable doesn't map the correct dimensions.

At the very least, the variable wouldn't be treated as a "normal" data variable by other software, whereas with the current scheme it should be.

I think it's also a fair bet that the changes to support the alternative encoding will be more complicated than what I've done here already!

pp-mo · 2025-10-10T10:38:34Z

Also, @stephenworsley has suggested that perhaps we should allow saving dataless cubes only when the user activates a special control to assert that this was intended. (e.g. with iris.allow_save_dataless(): ...).

I think that would make more sense in the case that we adopted you alternate "scalar variable encoding" idea,
since in that case the results are less intelligible to other software.

Do you think this has merit ?

ukmo-ccbunney · 2025-10-10T12:38:57Z

@trexfeathers + @stephenworsley + I discussed this option a bit at the sprint standup today.

I am concerned that this alternative is rather nonstandard CF, and wouldn't be understood by other CF compliant software (e.g. xarray, cf-python, or whatever ?)

Yes - valid points. I guess it would be aver Iris specific approach and would risk problems with other software.

I think my concern was more that the dataless cube gets saved, then is modified outside Iris to add some data, then loaded back into Iris, but the data is lost because we still have an iris_dataless_cube attribute still set.

However, as already discussed in the comments above, this is maybe a bit of a contrived scenario, so I am happy to go with the original approach.

Also, @stephenworsley has suggested that perhaps we should allow saving dataless cubes only when the user activates a special control to assert that this was intended. (e.g. with iris.allow_save_dataless(): ...).
I think that would make more sense in the case that we adopted you alternate "scalar variable encoding" idea, since in that case the results are less intelligible to other software.
Do you think this has merit ?

Potentially. Although surely all that would actually do is control whether the iris_dataless_cube attribute is set on the variable; there would still be no actual data written to the variable (assuming you passed a fully masked array).

pp-mo · 2025-10-10T13:19:48Z

... perhaps we should allow saving dataless cubes only when the user activates a special control
Potentially. Although surely all that would actually do is control whether the iris_dataless_cube attribute is set on the variable; there would still be no actual data written to the variable (assuming you passed a fully masked array).

I think the idea is that, without the control, you would just get a "can't save dataless cubes" error.

ukmo-ccbunney · 2025-10-10T14:03:04Z

* Is it documented somewhere that the underlying netCDF library will not write any data for a netCDF variable if all the data is masked? Or are we relying on undocumented/assumed behaviour?
Hmm. I thought so, but when I checked it is a bit sketchy.

In detail, the docs here do say that a variable extends along an 'unlimited' dimension when written : """netCDF Variable objects with unlimited dimensions will grow along those dimensions if you assign data outside the currently defined range of indices."""

In practice, that means a chunk is created when written to ( chunking as a standard concept for netcdf variables -- nothind to do with Dask ).

My assumption is, that a fixed-size variable (i.e. with no unlimited dimension) is a single chunk, and it still doesn't get created until (at least partially) written to. But I guess the docs don't precisely guarantee that.

However, I have now added a test which demonstrates that a large variable does not take up space.

... perhaps we should allow saving dataless cubes only when the user activates a special control
Potentially. Although surely all that would actually do is control whether the iris_dataless_cube attribute is set on the variable; there would still be no actual data written to the variable (assuming you passed a fully masked array).

I think the idea is that, without the control, you would just get a "can't save dataless cubes" error.

Ah OK - I understand. In that case I guess it makes sense.

pp-mo · 2025-10-21T12:30:40Z

Hi @ukmo-ccbunney , I know I said to hold on this 'til I was happy I had thought a bit more about the possible downsides...

My remaining concern was that if the cunning netcdf 'feature' I'm relying on here was to stop working (i.e. they start taking up space in the file) ...

then we would probably want to switch to the alternative 'scalar variable' approach outlined above,
and then, if anyone is relying on the existing form it could cause problems.
... and I guess we'd have to enable the change with an iris.FUTURE flag, too, so a fair bit of complexity to it.

However, although the alternative scheme is non-standard and switching would require careful documentation, that's also true if we were to adopt it now.
So on balance, I still think the proposal here of using an (apparently) "normal" CF data variable still wins out, for compatibility purposes, ease of implementation and simplicity of explanation.

So I propose we adopt this approach + hope it doesn't get broken!

ukmo-ccbunney · 2025-10-22T13:11:33Z

Hi @ukmo-ccbunney , I know I said to hold on this 'til I was happy I had thought a bit more about the possible downsides...

My remaining concern was that if the cunning netcdf 'feature' I'm relying on here was to stop working (i.e. they start taking up space in the file) ...

then we would probably want to switch to the alternative 'scalar variable' approach outlined above,

and then, if anyone is relying on the existing form it could cause problems.

... and I guess we'd have to enable the change with an iris.FUTURE flag, too, so a fair bit of complexity to it.

However, although the alternative scheme is non-standard and switching would require careful documentation, that's also true if we were to adopt it now. So on balance, I still think the proposal here of using an (apparently) "normal" CF data variable still wins out, for compatibility purposes, ease of implementation and simplicity of explanation.

So I propose we adopt this approach + hope it doesn't get broken!

Agreed. The solution as it stands is neat and efficient.

If the future behaviour of netCDF changed and it did write out a full data array of missing values to the file, the implementation as it stands would still work fine in Iris (albeit with the indirect risk of causing the write to fail due to unexpectedly filling up a disk).

ukmo-ccbunney

LGTM! 🚀

…).

* Initial WIP for dataless merges -- cannot yet merge datafull+dataless. * Starting tests. * Functioning backstop: merge can pass-through dataless, but not actually merge them. * Dataless merge, combine dataless with/without dataful. * Tidy awkward layout in test. * Ensure that cube.shape can only be a tuple (or None). * Make test_merge check against dataless input in all its tests. * Improve tests, and test for lazy merge result. * Fix typo. * Expand documentation. * Fix broken ref + tweak whatsnew. * Fixes following implementation of dataless save-and-load (#6739). * Remove redundant checks. * Make make_gridcube() dataless, and improve documentation cross-refs. * Review changes: small fixes to docs. * Use the intended dtype for data of all-masked arrays.

pp-mo changed the title ~~Dataless netcdf load+save; plus tests.~~ Dataless netcdf load+save. Oct 7, 2025

Dataless netcdf load+save; plus tests.

354f3e5

pp-mo force-pushed the save_dataless branch from 548eddb to 354f3e5 Compare October 7, 2025 13:14

trexfeathers added this to 🚴 Peloton Oct 8, 2025

Use thread-safe DatasetWrapper to satisfy coding-standards check for …

00b0960

…netCDF4 usage.

pp-mo marked this pull request as ready for review October 9, 2025 09:05

ukmo-ccbunney reviewed Oct 9, 2025

View reviewed changes

pp-mo added 2 commits October 9, 2025 16:54

Check that saved dataless cubes consume little file space.

ce6c7f8

Further.

Add documentation links.

6a625b0

pp-mo force-pushed the save_dataless branch from 98be450 to 6a625b0 Compare October 9, 2025 15:54

ukmo-ccbunney reviewed Oct 10, 2025

View reviewed changes

lib/iris/tests/integration/netcdf/test_dataless.py Show resolved Hide resolved

pp-mo mentioned this pull request Oct 10, 2025

Merge dataless #6741

Merged

Merge branch 'main' into save_dataless

7f6a032

ukmo-ccbunney approved these changes Oct 22, 2025

View reviewed changes

ukmo-ccbunney merged commit 24e258f into SciTools:main Oct 22, 2025
22 checks passed

github-project-automation bot moved this to Done in 🚴 Peloton Oct 22, 2025

pp-mo added a commit to pp-mo/iris that referenced this pull request Oct 22, 2025

Fixes following implementation of dataless save-and-load (SciTools#6739…

1cb44e7

…).

Dataless netcdf load+save. #6739

Dataless netcdf load+save. #6739

Uh oh!

Conversation

pp-mo commented Oct 7, 2025

Uh oh!

codecov bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ukmo-ccbunney left a comment

Choose a reason for hiding this comment

Uh oh!

pp-mo commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pp-mo commented Oct 9, 2025

Uh oh!

pp-mo commented Oct 9, 2025

Uh oh!

ukmo-ccbunney commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pp-mo commented Oct 10, 2025

Uh oh!

pp-mo commented Oct 10, 2025

Uh oh!

ukmo-ccbunney commented Oct 10, 2025

Uh oh!

pp-mo commented Oct 10, 2025

Uh oh!

ukmo-ccbunney commented Oct 10, 2025

Uh oh!

pp-mo commented Oct 21, 2025

Uh oh!

ukmo-ccbunney commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ukmo-ccbunney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Oct 8, 2025 •

edited

Loading

pp-mo commented Oct 9, 2025 •

edited

Loading

ukmo-ccbunney commented Oct 10, 2025 •

edited

Loading

ukmo-ccbunney commented Oct 22, 2025 •

edited

Loading