Skip to content

Conversation

@pp-mo
Copy link
Member

@pp-mo pp-mo commented Oct 7, 2025

Closes #6727

@pp-mo pp-mo changed the title Dataless netcdf load+save; plus tests. Dataless netcdf load+save. Oct 7, 2025
@codecov
Copy link

codecov bot commented Oct 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.25%. Comparing base (37f4547) to head (7f6a032).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6739   +/-   ##
=======================================
  Coverage   90.24%   90.25%           
=======================================
  Files          91       91           
  Lines       24613    24630   +17     
  Branches     4604     4609    +5     
=======================================
+ Hits        22212    22229   +17     
  Misses       1624     1624           
  Partials      777      777           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pp-mo pp-mo marked this pull request as ready for review October 9, 2025 09:05
Copy link
Contributor

@ukmo-ccbunney ukmo-ccbunney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible to me. Just a couple of observations:

  • Is it documented somewhere that the underlying netCDF library will not write any data for a netCDF variable if all the data is masked? Or are we relying on undocumented/assumed behaviour?
  • Should we make some warning on an empty cube load? It is possible (but not sensible) that someone could actually set data in the "dataless" variable (outside Iris) and this would be lost on load if the didn't remove the iris_dataless_cube attribute.

@pp-mo
Copy link
Member Author

pp-mo commented Oct 9, 2025

* Is it documented somewhere that the underlying netCDF library will not write any data for a netCDF variable if all the data is masked? Or are we relying on undocumented/assumed behaviour?

Hmm. I thought so, but when I checked it is a bit sketchy.

In detail, the docs here do say that a variable extends along an 'unlimited' dimension when written :
"""netCDF Variable objects with unlimited dimensions will grow along those dimensions if you assign data outside the currently defined range of indices."""

In practice, that means a chunk is created when written to
( chunking as a standard concept for netcdf variables -- nothind to do with Dask ).

My assumption is, that a fixed-size variable (i.e. with no unlimited dimension) is a single chunk, and it still doesn't get created until (at least partially) written to.
But I guess the docs don't precisely guarantee that.

However, I have now added a test which demonstrates that a large variable does not take up space.

@pp-mo
Copy link
Member Author

pp-mo commented Oct 9, 2025

Should we make some warning on an empty cube load? It is possible (but not sensible) that someone could actually set data in the "dataless" variable (outside Iris) and this would be lost on load if the didn't remove the iris_dataless_cube attribute.

TBH I don't have much appetite for this. I'm doubtful that creating a file with Iris and then post-modifying it would be a common thing.
I did now at least document the mechanism.

@pp-mo
Copy link
Member Author

pp-mo commented Oct 9, 2025

NOTE: this is using "load_raw" to check loading back, only because merging dataless cubes is not yet supported.
But see #6741, which will fix that.

@ukmo-ccbunney
Copy link
Contributor

ukmo-ccbunney commented Oct 10, 2025

In detail, the docs here do say that a variable extends along an 'unlimited' dimension when written : """netCDF Variable objects with unlimited dimensions will grow along those dimensions if you assign data outside the currently defined range of indices."""

In practice, that means a chunk is created when written to

OK. That certainly seems to be the case when I look at the file size of a netCDF file with a "dataless" variable.

I was thinking, the other way we could do this is by writing to a scalar variable of the correct dtype and storing the dimension data as a special "iris" attribute (that would be handled accordingly by the loader). Something like this:

netcdf empty_test {
dimensions:
	latitude = 1000 ;
	longitude = 1000 ;
variables:
	float empty ;
		empty:iris_dataless_cube = "true"
		empty:iris_original_dims = "latitude,longutude"

That way there would never be any danger of the variable being more than a handful of bytes in the file.

Just throwing it out there...

@pp-mo pp-mo mentioned this pull request Oct 10, 2025
@pp-mo
Copy link
Member Author

pp-mo commented Oct 10, 2025

the other way we could do this is by writing to a scalar variable

@trexfeathers + @stephenworsley + I discussed this option a bit at the sprint standup today.

I am concerned that this alternative is rather nonstandard CF, and wouldn't be understood by other CF compliant software (e.g. xarray, cf-python, or whatever ?)

From a strict CF point of view, I think there could be a problem if that this type of encoding might risk not correctly identifying any attached aux-coords, cell-measures or ancillaries : either it would not be perceived as a data-variable, or the references to those other variables would not be valid because the main variable doesn't map the correct dimensions.

At the very least, the variable wouldn't be treated as a "normal" data variable by other software, whereas with the current scheme it should be.

I think it's also a fair bet that the changes to support the alternative encoding will be more complicated than what I've done here already!

@pp-mo
Copy link
Member Author

pp-mo commented Oct 10, 2025

Also, @stephenworsley has suggested that perhaps we should allow saving dataless cubes only when the user activates a special control to assert that this was intended. (e.g. with iris.allow_save_dataless(): ...).

I think that would make more sense in the case that we adopted you alternate "scalar variable encoding" idea,
since in that case the results are less intelligible to other software.

Do you think this has merit ?

@ukmo-ccbunney
Copy link
Contributor

@trexfeathers + @stephenworsley + I discussed this option a bit at the sprint standup today.

I am concerned that this alternative is rather nonstandard CF, and wouldn't be understood by other CF compliant software (e.g. xarray, cf-python, or whatever ?)

Yes - valid points. I guess it would be aver Iris specific approach and would risk problems with other software.

I think my concern was more that the dataless cube gets saved, then is modified outside Iris to add some data, then loaded back into Iris, but the data is lost because we still have an iris_dataless_cube attribute still set.

However, as already discussed in the comments above, this is maybe a bit of a contrived scenario, so I am happy to go with the original approach.

Also, @stephenworsley has suggested that perhaps we should allow saving dataless cubes only when the user activates a special control to assert that this was intended. (e.g. with iris.allow_save_dataless(): ...).
I think that would make more sense in the case that we adopted you alternate "scalar variable encoding" idea, since in that case the results are less intelligible to other software.
Do you think this has merit ?

Potentially. Although surely all that would actually do is control whether the iris_dataless_cube attribute is set on the variable; there would still be no actual data written to the variable (assuming you passed a fully masked array).

@pp-mo
Copy link
Member Author

pp-mo commented Oct 10, 2025

... perhaps we should allow saving dataless cubes only when the user activates a special control
Potentially. Although surely all that would actually do is control whether the iris_dataless_cube attribute is set on the variable; there would still be no actual data written to the variable (assuming you passed a fully masked array).

I think the idea is that, without the control, you would just get a "can't save dataless cubes" error.

@ukmo-ccbunney
Copy link
Contributor

* Is it documented somewhere that the underlying netCDF library will not write any data for a netCDF variable if all the data is masked? Or are we relying on undocumented/assumed behaviour?

Hmm. I thought so, but when I checked it is a bit sketchy.

In detail, the docs here do say that a variable extends along an 'unlimited' dimension when written : """netCDF Variable objects with unlimited dimensions will grow along those dimensions if you assign data outside the currently defined range of indices."""

In practice, that means a chunk is created when written to ( chunking as a standard concept for netcdf variables -- nothind to do with Dask ).

My assumption is, that a fixed-size variable (i.e. with no unlimited dimension) is a single chunk, and it still doesn't get created until (at least partially) written to. But I guess the docs don't precisely guarantee that.

However, I have now added a test which demonstrates that a large variable does not take up space.

... perhaps we should allow saving dataless cubes only when the user activates a special control
Potentially. Although surely all that would actually do is control whether the iris_dataless_cube attribute is set on the variable; there would still be no actual data written to the variable (assuming you passed a fully masked array).

I think the idea is that, without the control, you would just get a "can't save dataless cubes" error.

Ah OK - I understand. In that case I guess it makes sense.

@pp-mo
Copy link
Member Author

pp-mo commented Oct 21, 2025

Hi @ukmo-ccbunney , I know I said to hold on this 'til I was happy I had thought a bit more about the possible downsides...

My remaining concern was that if the cunning netcdf 'feature' I'm relying on here was to stop working (i.e. they start taking up space in the file) ...

  • then we would probably want to switch to the alternative 'scalar variable' approach outlined above,
  • and then, if anyone is relying on the existing form it could cause problems.
  • ... and I guess we'd have to enable the change with an iris.FUTURE flag, too, so a fair bit of complexity to it.

However, although the alternative scheme is non-standard and switching would require careful documentation, that's also true if we were to adopt it now.
So on balance, I still think the proposal here of using an (apparently) "normal" CF data variable still wins out, for compatibility purposes, ease of implementation and simplicity of explanation.

So I propose we adopt this approach + hope it doesn't get broken!

@ukmo-ccbunney
Copy link
Contributor

ukmo-ccbunney commented Oct 22, 2025

Hi @ukmo-ccbunney , I know I said to hold on this 'til I was happy I had thought a bit more about the possible downsides...

My remaining concern was that if the cunning netcdf 'feature' I'm relying on here was to stop working (i.e. they start taking up space in the file) ...

  • then we would probably want to switch to the alternative 'scalar variable' approach outlined above,
  • and then, if anyone is relying on the existing form it could cause problems.
  • ... and I guess we'd have to enable the change with an iris.FUTURE flag, too, so a fair bit of complexity to it.

However, although the alternative scheme is non-standard and switching would require careful documentation, that's also true if we were to adopt it now. So on balance, I still think the proposal here of using an (apparently) "normal" CF data variable still wins out, for compatibility purposes, ease of implementation and simplicity of explanation.

So I propose we adopt this approach + hope it doesn't get broken!

Agreed. The solution as it stands is neat and efficient.

If the future behaviour of netCDF changed and it did write out a full data array of missing values to the file, the implementation as it stands would still work fine in Iris (albeit with the indirect risk of causing the write to fail due to unexpectedly filling up a disk).

Copy link
Contributor

@ukmo-ccbunney ukmo-ccbunney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@ukmo-ccbunney ukmo-ccbunney merged commit 24e258f into SciTools:main Oct 22, 2025
22 checks passed
pp-mo added a commit to pp-mo/iris that referenced this pull request Oct 22, 2025
stephenworsley pushed a commit that referenced this pull request Oct 30, 2025
* Initial WIP for dataless merges -- cannot yet merge datafull+dataless.

* Starting tests.

* Functioning backstop: merge can pass-through dataless, but not actually merge them.

* Dataless merge, combine dataless with/without dataful.

* Tidy awkward layout in test.

* Ensure that cube.shape can only be a tuple (or None).

* Make test_merge check against dataless input in all its tests.

* Improve tests, and test for lazy merge result.

* Fix typo.

* Expand documentation.

* Fix broken ref + tweak whatsnew.

* Fixes following implementation of dataless save-and-load (#6739).

* Remove redundant checks.

* Make make_gridcube() dataless, and improve documentation cross-refs.

* Review changes: small fixes to docs.

* Use the intended dtype for data of all-masked arrays.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Make saving possible for dataless cubes

2 participants