Skip to content

Commit 37f4547

Browse files
Stricter array summary in CML output (SciTools#6743)
* - New array_summary utility to format numpy arrays in CML. - Add `checksum` to _DimensionalMetadata xml. - New `masked_count` attr added to cube and _DimensionalMetadata xml. * Add data stats to CML * Pass keyword options through xml call hierarchy * Replaced extra keywords in xml_element functions with settings in context manager * Added context manager for controlling CML output and formatting * Reinstated "no-masked-elements" crc output * Added docstring for `array_checksum` * Added `coord_order` option. Tidied up. * Updated CMLSettings.set ketword defaults to None. Now only updates the setting if the keyword is non-None. This allows for nested context managers without loosing previous settings. * Only strip trailing zeros for floats. Only output stats for > lenght 1 arrays * Turn off numpy-formatting for all CML output in tests * Added some CML formatting keywords to `_shared_utils.assert_CML` * Updated test results for new default CML formatting. Note modification of `unit/pandas/test_pandas.py` to disable coord checksum * New cube XML tests to covert new formatting options * Update docstring for Cube.xml and CubeList.xml. Also make CML_Settings public class so it appears in docs * Added whatsnew * Typo in doctest * Fix doc tests (switched to using code-block) * Update docs/src/whatsnew/latest.rst Fix typo Co-authored-by: Bill Little <[email protected]> * Update lib/iris/tests/_shared_utils.py Fix docstring link Co-authored-by: Bill Little <[email protected]> * Update lib/iris/util.py Co-authored-by: Bill Little <[email protected]> * Update lib/iris/tests/_shared_utils.py Co-authored-by: Bill Little <[email protected]> * Added typing for CMLSettings data attributes * Update lib/iris/util.py Co-authored-by: Bill Little <[email protected]> * Update lib/iris/util.py Co-authored-by: Bill Little <[email protected]> * Put `numpy.typing.ArrayLike` in `TYPE_CHECKING` block * Factored out `fixed_std` nested function in cube.py and coords.py * Fix broken username link --------- Co-authored-by: Bill Little <[email protected]>
1 parent fa6d61d commit 37f4547

File tree

554 files changed

+6432
-31080
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

554 files changed

+6432
-31080
lines changed

docs/src/whatsnew/latest.rst

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,10 @@ This document explains the changes made to Iris for this release
4040
:func:`~iris.fileformats.netcdf.saver.save_mesh` also supports ``zlib``
4141
compression. (:issue:`6565`, :pull:`6728`)
4242

43+
#. `@ukmo-ccbunney`_ added a new :class:`~iris.util.CMLSettings` class to control
44+
the formatting of Cube CML output via a context manager.
45+
(:issue:`6244`, :pull:`6743`)
46+
4347

4448
🐛 Bugs Fixed
4549
=============
@@ -109,9 +113,12 @@ This document explains the changes made to Iris for this release
109113
#. `@melissaKG`_ upgraded Iris' tests to no longer use the deprecated
110114
``git whatchanged`` command. (:pull:`6672`)
111115

112-
#. `@ukmo-ccbunney` merged functionality of ``assert_CML_approx_data`` into
116+
#. `@ukmo-ccbunney`_ merged functionality of ``assert_CML_approx_data`` into
113117
``assert_CML`` via the use of a new ``approx_data`` keyword. (:pull:`6713`)
114118

119+
#. `@ukmo-ccbunney`_ ``assert_CML`` now uses stricter array formatting to avoid
120+
changes in tests due to Numpy version changes. (:pull:`6743`)
121+
115122

116123
.. comment
117124
Whatsnew author names (@github name) in alphabetical order. Note that,
@@ -124,4 +131,4 @@ This document explains the changes made to Iris for this release
124131
.. comment
125132
Whatsnew resources in alphabetical order:
126133
127-
.. _netcdf-c#3183: https://github.com/Unidata/netcdf-c/issues/3183
134+
.. _netcdf-c#3183: https://github.com/Unidata/netcdf-c/issues/3183

lib/iris/coords.py

Lines changed: 70 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
import iris.exceptions
3333
import iris.time
3434
import iris.util
35+
from iris.util import CML_SETTINGS
3536
import iris.warnings
3637

3738
#: The default value for ignore_axis which controls guess_coord_axis' behaviour
@@ -853,10 +854,45 @@ def xml_element(self, doc):
853854
if self.coord_system:
854855
element.appendChild(self.coord_system.xml_element(doc))
855856

857+
is_masked_array = np.ma.isMaskedArray(self._values)
858+
856859
# Add the values
857860
element.setAttribute("value_type", str(self._value_type_name()))
858861
element.setAttribute("shape", str(self.shape))
859862

863+
# data checksum
864+
if CML_SETTINGS.coord_checksum:
865+
crc = iris.util.array_checksum(self._values)
866+
element.setAttribute("checksum", crc)
867+
868+
if is_masked_array:
869+
# Add the number of masked elements
870+
if np.ma.is_masked(self._values):
871+
crc = iris.util.array_checksum(self._values.mask)
872+
else:
873+
crc = "no-masked-elements"
874+
element.setAttribute("mask_checksum", crc)
875+
876+
# array ordering:
877+
def _order(array):
878+
order = ""
879+
if array.flags["C_CONTIGUOUS"]:
880+
order = "C"
881+
elif array.flags["F_CONTIGUOUS"]:
882+
order = "F"
883+
return order
884+
885+
if CML_SETTINGS.coord_order:
886+
element.setAttribute("order", _order(self._values))
887+
if is_masked_array:
888+
element.setAttribute("mask_order", _order(self._values.mask))
889+
890+
# masked element count:
891+
if CML_SETTINGS.masked_value_count and is_masked_array:
892+
element.setAttribute(
893+
"masked_count", str(np.count_nonzero(self._values.mask))
894+
)
895+
860896
# The values are referred to "points" of a coordinate and "data"
861897
# otherwise.
862898
if isinstance(self, Coord):
@@ -865,7 +901,31 @@ def xml_element(self, doc):
865901
values_term = "indices"
866902
else:
867903
values_term = "data"
868-
element.setAttribute(values_term, self._xml_array_repr(self._values))
904+
element.setAttribute(
905+
values_term,
906+
self._xml_array_repr(self._values),
907+
)
908+
909+
if iris.util.CML_SETTINGS.coord_data_array_stats and len(self._values) > 1:
910+
data = self._values
911+
912+
if np.issubdtype(data.dtype.type, np.number):
913+
data_min = data.min()
914+
data_max = data.max()
915+
if data_min == data_max:
916+
# When data is constant, std() is too sensitive.
917+
data_std = 0
918+
else:
919+
data_std = data.std()
920+
921+
stats_xml_element = doc.createElement("stats")
922+
stats_xml_element.setAttribute("std", str(data_std))
923+
stats_xml_element.setAttribute("min", str(data_min))
924+
stats_xml_element.setAttribute("max", str(data_max))
925+
stats_xml_element.setAttribute("masked", str(ma.is_masked(data)))
926+
stats_xml_element.setAttribute("mean", str(data.mean()))
927+
928+
element.appendChild(stats_xml_element)
869929

870930
return element
871931

@@ -896,7 +956,11 @@ def _xml_array_repr(data):
896956
if hasattr(data, "to_xml_attr"):
897957
result = data._values.to_xml_attr()
898958
else:
899-
result = iris.util.format_array(data)
959+
edgeitems = CML_SETTINGS.array_edgeitems
960+
if CML_SETTINGS.numpy_formatting:
961+
result = iris.util.format_array(data, edgeitems=edgeitems)
962+
else:
963+
result = iris.util.array_summary(data, edgeitems=edgeitems)
900964
return result
901965

902966
def _value_type_name(self):
@@ -2565,7 +2629,10 @@ def xml_element(self, doc):
25652629

25662630
# Add bounds, points are handled by the parent class.
25672631
if self.has_bounds():
2568-
element.setAttribute("bounds", self._xml_array_repr(self.bounds))
2632+
element.setAttribute(
2633+
"bounds",
2634+
self._xml_array_repr(self.bounds),
2635+
)
25692636

25702637
return element
25712638

lib/iris/cube.py

Lines changed: 69 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222
from typing import TYPE_CHECKING, Any, Optional, TypeGuard
2323
import warnings
2424
from xml.dom.minidom import Document
25-
import zlib
2625

2726
from cf_units import Unit
2827
import dask.array as da
@@ -56,6 +55,7 @@
5655
from iris.mesh import MeshCoord
5756
import iris.exceptions
5857
import iris.util
58+
from iris.util import CML_SETTINGS
5959
import iris.warnings
6060

6161
__all__ = ["Cube", "CubeAttrsDict", "CubeList"]
@@ -171,7 +171,10 @@ def insert(self, index, cube):
171171
super(CubeList, self).insert(index, cube)
172172

173173
def xml(self, checksum=False, order=True, byteorder=True):
174-
"""Return a string of the XML that this list of cubes represents."""
174+
"""Return a string of the XML that this list of cubes represents.
175+
176+
See :func:`iris.util.CML_SETTINGS.set` for controlling the XML output formatting.
177+
"""
175178
with np.printoptions(legacy=NP_PRINTOPTIONS_LEGACY):
176179
doc = Document()
177180
cubes_xml_element = doc.createElement("cubes")
@@ -3902,12 +3905,29 @@ def xml(
39023905
order: bool = True,
39033906
byteorder: bool = True,
39043907
) -> str:
3905-
"""Return a fully valid CubeML string representation of the Cube."""
3908+
"""Return a fully valid CubeML string representation of the Cube.
3909+
3910+
The format of the generated XML can be controlled using the
3911+
``iris.util.CML_SETTINGS.set`` method as a context manager.
3912+
3913+
For example, to include array statistics for the coordinate data:
3914+
3915+
.. code-block:: python
3916+
3917+
with CML_SETTINGS.set(coord_data_array_stats=True):
3918+
print(cube.xml())
3919+
3920+
See :func:`iris.util.CML_SETTINGS.set` for more details.
3921+
3922+
"""
39063923
with np.printoptions(legacy=NP_PRINTOPTIONS_LEGACY):
39073924
doc = Document()
39083925

39093926
cube_xml_element = self._xml_element(
3910-
doc, checksum=checksum, order=order, byteorder=byteorder
3927+
doc,
3928+
checksum=checksum,
3929+
order=order,
3930+
byteorder=byteorder,
39113931
)
39123932
cube_xml_element.setAttribute("xmlns", XML_NAMESPACE_URI)
39133933
doc.appendChild(cube_xml_element)
@@ -3916,7 +3936,13 @@ def xml(
39163936
doc = self._sort_xml_attrs(doc)
39173937
return iris.util._print_xml(doc)
39183938

3919-
def _xml_element(self, doc, checksum=False, order=True, byteorder=True):
3939+
def _xml_element(
3940+
self,
3941+
doc,
3942+
checksum=False,
3943+
order=True,
3944+
byteorder=True,
3945+
):
39203946
cube_xml_element = doc.createElement("cube")
39213947

39223948
if self.standard_name:
@@ -4006,39 +4032,46 @@ def dimmeta_xml_element(element, typename, dimscall):
40064032
data_xml_element = doc.createElement("data")
40074033
data_xml_element.setAttribute("shape", str(self.shape))
40084034

4009-
# NB. Getting a checksum triggers any deferred loading,
4035+
# NB. Getting a checksum or data stats triggers any deferred loading,
40104036
# in which case it also has the side-effect of forcing the
40114037
# byte order to be native.
4038+
40124039
if checksum:
40134040
data = self.data
4014-
4015-
# Ensure consistent memory layout for checksums.
4016-
def normalise(data):
4017-
data = np.ascontiguousarray(data)
4018-
if data.dtype.newbyteorder("<") != data.dtype:
4019-
data = data.byteswap(False)
4020-
data.dtype = data.dtype.newbyteorder("<")
4021-
return data
4022-
4041+
crc = iris.util.array_checksum(data)
4042+
data_xml_element.setAttribute("checksum", crc)
40234043
if ma.isMaskedArray(data):
4024-
# Fill in masked values to avoid the checksum being
4025-
# sensitive to unused numbers. Use a fixed value so
4026-
# a change in fill_value doesn't affect the
4027-
# checksum.
4028-
crc = "0x%08x" % (zlib.crc32(normalise(data.filled(0))) & 0xFFFFFFFF,)
4029-
data_xml_element.setAttribute("checksum", crc)
40304044
if ma.is_masked(data):
4031-
crc = "0x%08x" % (zlib.crc32(normalise(data.mask)) & 0xFFFFFFFF,)
4045+
crc = iris.util.array_checksum(data.mask)
40324046
else:
40334047
crc = "no-masked-elements"
40344048
data_xml_element.setAttribute("mask_checksum", crc)
4049+
4050+
if CML_SETTINGS.data_array_stats:
4051+
data = self.data
4052+
data_min = data.min()
4053+
data_max = data.max()
4054+
if data_min == data_max:
4055+
# When data is constant, std() is too sensitive.
4056+
data_std = 0
40354057
else:
4036-
crc = "0x%08x" % (zlib.crc32(normalise(data)) & 0xFFFFFFFF,)
4037-
data_xml_element.setAttribute("checksum", crc)
4038-
elif self.has_lazy_data():
4039-
data_xml_element.setAttribute("state", "deferred")
4040-
else:
4041-
data_xml_element.setAttribute("state", "loaded")
4058+
data_std = data.std()
4059+
4060+
stats_xml_element = doc.createElement("stats")
4061+
stats_xml_element.setAttribute("std", str(data_std))
4062+
stats_xml_element.setAttribute("min", str(data_min))
4063+
stats_xml_element.setAttribute("max", str(data_max))
4064+
stats_xml_element.setAttribute("masked", str(ma.is_masked(data)))
4065+
stats_xml_element.setAttribute("mean", str(data.mean()))
4066+
4067+
data_xml_element.appendChild(stats_xml_element)
4068+
4069+
# We only print the "state" if we have not output checksum or data stats:
4070+
if not (checksum or CML_SETTINGS.data_array_stats):
4071+
if self.has_lazy_data():
4072+
data_xml_element.setAttribute("state", "deferred")
4073+
else:
4074+
data_xml_element.setAttribute("state", "loaded")
40424075

40434076
# Add the dtype, and also the array and mask orders if the
40444077
# data is loaded.
@@ -4065,8 +4098,14 @@ def _order(array):
40654098
if array_byteorder is not None:
40664099
data_xml_element.setAttribute("byteorder", array_byteorder)
40674100

4068-
if order and ma.isMaskedArray(data):
4069-
data_xml_element.setAttribute("mask_order", _order(data.mask))
4101+
if ma.isMaskedArray(data):
4102+
if CML_SETTINGS.masked_value_count:
4103+
data_xml_element.setAttribute(
4104+
"masked_count", str(np.count_nonzero(data.mask))
4105+
)
4106+
if order:
4107+
data_xml_element.setAttribute("mask_order", _order(data.mask))
4108+
40704109
else:
40714110
dtype = self.lazy_data().dtype
40724111
data_xml_element.setAttribute("dtype", dtype.name)

lib/iris/tests/_shared_utils.py

Lines changed: 34 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -366,8 +366,10 @@ def assert_CML(
366366
request: pytest.FixtureRequest,
367367
cubes,
368368
reference_filename=None,
369-
checksum=True,
370369
approx_data=False,
370+
checksum=True,
371+
coord_checksum=None,
372+
numpy_formatting=None,
371373
**kwargs,
372374
):
373375
"""Test that the CML for the given cubes matches the contents of
@@ -379,6 +381,9 @@ def assert_CML(
379381
The data payload of individual cubes is not compared unless ``checksum``
380382
or ``approx_data`` are True.
381383
384+
Further control of the CML formatting can be made using the
385+
:data:`iris.util.CML_SETTINGS` context manager.
386+
382387
Notes
383388
-----
384389
The ``approx_data`` keyword provides functionality equivalent to the
@@ -393,20 +398,28 @@ def assert_CML(
393398
A pytest ``request`` fixture passed down from the calling test. Is
394399
required by :func:`result_path`. See :func:`result_path` Examples
395400
for how to access the ``request`` fixture.
396-
cubes :
401+
cubes : iris.cube.Cube or iris.cube.CubeList
397402
Either a Cube or a sequence of Cubes.
398403
reference_filename : optional, default=None
399404
The relative path (relative to the test results directory).
400405
If omitted, the result is generated from the calling
401406
method's name, class, and module using
402407
:meth:`iris.tests.IrisTest.result_path`.
403-
checksum : bool, optional
404-
When True, causes the CML to include a checksum for each
405-
Cube's data. Defaults to True.
406408
approx_data : bool, optional, default=False
407409
When True, the cube's data will be compared with the reference
408410
data and asserted to be within a specified tolerance. Implies
409411
``checksum=False``.
412+
checksum : bool, optional, default=True
413+
When True, causes the CML to include a checksum for each
414+
Cube's data. Defaults to True.
415+
coord_checksum : bool, optional, default=True
416+
When True, causes the CML to include a checksum for each
417+
Cube's coordinate data. Defaults to True.
418+
numpy_formatting : bool, optional, default=False
419+
When True, causes the CML to use numpy-style formatting for
420+
array data. When False, uses simplified array formatting
421+
that doesn't rely on Numpy's ``arr2string`` formatter.
422+
Defaults to False.
410423
411424
"""
412425
_check_for_request_fixture(request, "assert_CML")
@@ -417,20 +430,31 @@ def assert_CML(
417430
reference_filename = result_path(request, None, "cml")
418431
# Note: reference_path could be a tuple of path parts
419432
reference_path = get_result_path(reference_filename)
433+
434+
# default CML output options for tests:
435+
extra_format_options = {"numpy_formatting": False, "coord_checksum": True}
436+
# update formatting opts with keywords passed into this function:
437+
for k in extra_format_options.keys():
438+
if (user_opt := locals()[k]) is not None:
439+
extra_format_options[k] = user_opt
440+
420441
if approx_data:
421-
# compare data payload stats against known good stats
422-
checksum = False # ensure we are not comparing data checksums
442+
# compare data payload stats against known good stats.
443+
# Make sure options that compare exact data are disabled:
444+
checksum = False
445+
extra_format_options["data_array_stats"] = False
446+
423447
for i, cube in enumerate(cubes):
424448
# Build the json stats filename based on CML file path:
425449
fname = reference_path.removesuffix(".cml")
426450
fname += f".data.{i}.json"
427451
assert_data_almost_equal(cube.data, fname, **kwargs)
428-
if isinstance(cubes, (list, tuple)):
452+
453+
with iris.util.CML_SETTINGS.set(**extra_format_options):
429454
cml = iris.cube.CubeList(cubes).xml(
430455
checksum=checksum, order=False, byteorder=False
431456
)
432-
else:
433-
cml = cubes.xml(checksum=checksum, order=False, byteorder=False)
457+
434458
_check_same(cml, reference_path)
435459

436460

0 commit comments

Comments
 (0)