-
Notifications
You must be signed in to change notification settings - Fork 297
Dataless netcdf load+save. #6739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #6739 +/- ##
=======================================
Coverage 90.24% 90.25%
=======================================
Files 91 91
Lines 24613 24630 +17
Branches 4604 4609 +5
=======================================
+ Hits 22212 22229 +17
Misses 1624 1624
Partials 777 777 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sensible to me. Just a couple of observations:
- Is it documented somewhere that the underlying netCDF library will not write any data for a netCDF variable if all the data is masked? Or are we relying on undocumented/assumed behaviour?
- Should we make some warning on an empty cube load? It is possible (but not sensible) that someone could actually set data in the "dataless" variable (outside Iris) and this would be lost on load if the didn't remove the
iris_dataless_cubeattribute.
Hmm. I thought so, but when I checked it is a bit sketchy. In detail, the docs here do say that a variable extends along an 'unlimited' dimension when written : In practice, that means a chunk is created when written to My assumption is, that a fixed-size variable (i.e. with no unlimited dimension) is a single chunk, and it still doesn't get created until (at least partially) written to. However, I have now added a test which demonstrates that a large variable does not take up space. |
TBH I don't have much appetite for this. I'm doubtful that creating a file with Iris and then post-modifying it would be a common thing. |
|
NOTE: this is using "load_raw" to check loading back, only because merging dataless cubes is not yet supported. |
OK. That certainly seems to be the case when I look at the file size of a netCDF file with a "dataless" variable. I was thinking, the other way we could do this is by writing to a scalar variable of the correct That way there would never be any danger of the variable being more than a handful of bytes in the file. Just throwing it out there... |
@trexfeathers + @stephenworsley + I discussed this option a bit at the sprint standup today. I am concerned that this alternative is rather nonstandard CF, and wouldn't be understood by other CF compliant software (e.g. xarray, cf-python, or whatever ?) From a strict CF point of view, I think there could be a problem if that this type of encoding might risk not correctly identifying any attached aux-coords, cell-measures or ancillaries : either it would not be perceived as a data-variable, or the references to those other variables would not be valid because the main variable doesn't map the correct dimensions. At the very least, the variable wouldn't be treated as a "normal" data variable by other software, whereas with the current scheme it should be. I think it's also a fair bet that the changes to support the alternative encoding will be more complicated than what I've done here already! |
|
Also, @stephenworsley has suggested that perhaps we should allow saving dataless cubes only when the user activates a special control to assert that this was intended. (e.g. I think that would make more sense in the case that we adopted you alternate "scalar variable encoding" idea, Do you think this has merit ? |
Yes - valid points. I guess it would be aver Iris specific approach and would risk problems with other software. I think my concern was more that the dataless cube gets saved, then is modified outside Iris to add some data, then loaded back into Iris, but the data is lost because we still have an However, as already discussed in the comments above, this is maybe a bit of a contrived scenario, so I am happy to go with the original approach.
Potentially. Although surely all that would actually do is control whether the |
I think the idea is that, without the control, you would just get a "can't save dataless cubes" error. |
Ah OK - I understand. In that case I guess it makes sense. |
|
Hi @ukmo-ccbunney , I know I said to hold on this 'til I was happy I had thought a bit more about the possible downsides... My remaining concern was that if the cunning netcdf 'feature' I'm relying on here was to stop working (i.e. they start taking up space in the file) ...
However, although the alternative scheme is non-standard and switching would require careful documentation, that's also true if we were to adopt it now. So I propose we adopt this approach + hope it doesn't get broken! |
Agreed. The solution as it stands is neat and efficient. If the future behaviour of netCDF changed and it did write out a full data array of missing values to the file, the implementation as it stands would still work fine in Iris (albeit with the indirect risk of causing the write to fail due to unexpectedly filling up a disk). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀
* Initial WIP for dataless merges -- cannot yet merge datafull+dataless. * Starting tests. * Functioning backstop: merge can pass-through dataless, but not actually merge them. * Dataless merge, combine dataless with/without dataful. * Tidy awkward layout in test. * Ensure that cube.shape can only be a tuple (or None). * Make test_merge check against dataless input in all its tests. * Improve tests, and test for lazy merge result. * Fix typo. * Expand documentation. * Fix broken ref + tweak whatsnew. * Fixes following implementation of dataless save-and-load (#6739). * Remove redundant checks. * Make make_gridcube() dataless, and improve documentation cross-refs. * Review changes: small fixes to docs. * Use the intended dtype for data of all-masked arrays.
Closes #6727