Skip to content

Commit 84305e4

Browse files
authored
Merge branch 'main' into pydap4_scale
2 parents 5f2adfb + ef180b8 commit 84305e4

16 files changed

+411
-28
lines changed

doc/user-guide/hierarchical-data.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -453,6 +453,8 @@ The result is a new tree, containing only the nodes matching the condition.
453453

454454
(Yes, under the hood :py:meth:`~xarray.DataTree.filter` is just syntactic sugar for the pattern we showed you in :ref:`iterating over trees` !)
455455

456+
If you want to filter out empty nodes you can use :py:meth:`~xarray.DataTree.prune`.
457+
456458
.. _Tree Contents:
457459

458460
Tree Contents

doc/whats-new.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ v2025.07.2 (unreleased)
1212

1313
New Features
1414
~~~~~~~~~~~~
15+
- Added :py:meth:`DataTree.prune` method to remove empty nodes while preserving tree structure.
16+
Useful for cleaning up DataTree after time-based filtering operations (:issue:`10590`, :pull:`10598`).
17+
By `Alfonso Ladino <https://github.com/aladinor>`_.
1518

1619
- :py:meth:`DataTree.to_netcdf` can now write to a file-like object, or return bytes if called without a filepath. (:issue:`10570`)
1720
By `Matthew Willson <https://github.com/mjwillson>`_.
@@ -24,6 +27,13 @@ New Features
2427
Breaking changes
2528
~~~~~~~~~~~~~~~~
2629

30+
- When writing to NetCDF files with groups, Xarray no longer redefines dimensions
31+
that have the same size in parent groups (:issue:`10241`). This conforms with
32+
`CF Conventions for group scrope <https://cfconventions.org/cf-conventions/cf-conventions.html#_scope>`_
33+
but may require adjustments for code that consumes NetCDF files produced by
34+
Xarray.
35+
By `Stephan Hoyer <https://github.com/shoyer>`_.
36+
2737

2838
Deprecations
2939
~~~~~~~~~~~~
@@ -60,6 +70,10 @@ Bug fixes
6070
Documentation
6171
~~~~~~~~~~~~~
6272

73+
- Clarify lazy behaviour and eager loading for ``chunks=None`` in :py:func:`~xarray.open_dataset`, :py:func:`~xarray.open_dataarray`, :py:func:`~xarray.open_datatree`, :py:func:`~xarray.open_groups` and :py:func:`~xarray.open_zarr` (:issue:`10612`, :pull:`10627`).
74+
By `Kai Mühlbauer <https://github.com/kmuehlbauer>`_.
75+
76+
6377

6478
Internal Changes
6579
~~~~~~~~~~~~~~~~

xarray/backends/api.py

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -578,8 +578,10 @@ def open_dataset(
578578
579579
- ``chunks="auto"`` will use dask ``auto`` chunking taking into account the
580580
engine preferred chunks.
581-
- ``chunks=None`` skips using dask, which is generally faster for
582-
small arrays.
581+
- ``chunks=None`` skips using dask. This uses xarray's internally private
582+
:ref:`lazy indexing classes <internal design.lazy indexing>`,
583+
but data is eagerly loaded into memory as numpy arrays when accessed.
584+
This can be more efficient for smaller arrays or when large arrays are sliced before computation.
583585
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
584586
- ``chunks={}`` loads the data with dask using the engine's preferred chunk
585587
size, generally identical to the format's chunk size. If not available, a
@@ -819,8 +821,10 @@ def open_dataarray(
819821
820822
- ``chunks='auto'`` will use dask ``auto`` chunking taking into account the
821823
engine preferred chunks.
822-
- ``chunks=None`` skips using dask, which is generally faster for
823-
small arrays.
824+
- ``chunks=None`` skips using dask. This uses xarray's internally private
825+
:ref:`lazy indexing classes <internal design.lazy indexing>`,
826+
but data is eagerly loaded into memory as numpy arrays when accessed.
827+
This can be more efficient for smaller arrays, though results may vary.
824828
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
825829
- ``chunks={}`` loads the data with dask using engine preferred chunks if
826830
exposed by the backend, otherwise with a single chunk for all arrays.
@@ -1044,8 +1048,10 @@ def open_datatree(
10441048
10451049
- ``chunks="auto"`` will use dask ``auto`` chunking taking into account the
10461050
engine preferred chunks.
1047-
- ``chunks=None`` skips using dask, which is generally faster for
1048-
small arrays.
1051+
- ``chunks=None`` skips using dask. This uses xarray's internally private
1052+
:ref:`lazy indexing classes <internal design.lazy indexing>`,
1053+
but data is eagerly loaded into memory as numpy arrays when accessed.
1054+
This can be more efficient for smaller arrays, though results may vary.
10491055
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
10501056
- ``chunks={}`` loads the data with dask using the engine's preferred chunk
10511057
size, generally identical to the format's chunk size. If not available, a
@@ -1288,8 +1294,10 @@ def open_groups(
12881294
12891295
- ``chunks="auto"`` will use dask ``auto`` chunking taking into account the
12901296
engine preferred chunks.
1291-
- ``chunks=None`` skips using dask, which is generally faster for
1292-
small arrays.
1297+
- ``chunks=None`` skips using dask. This uses xarray's internally private
1298+
:ref:`lazy indexing classes <internal design.lazy indexing>`,
1299+
but data is eagerly loaded into memory as numpy arrays when accessed.
1300+
This can be more efficient for smaller arrays, though results may vary.
12931301
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
12941302
- ``chunks={}`` loads the data with dask using the engine's preferred chunk
12951303
size, generally identical to the format's chunk size. If not available, a

xarray/backends/common.py

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -256,6 +256,20 @@ def find_root_and_group(ds):
256256
return ds, group
257257

258258

259+
def collect_ancestor_dimensions(group) -> dict[str, int]:
260+
"""Returns dimensions defined in parent groups.
261+
262+
If dimensions are defined in multiple ancestors, use the size of the closest
263+
ancestor.
264+
"""
265+
dims = {}
266+
while (group := group.parent) is not None:
267+
for k, v in group.dimensions.items():
268+
if k not in dims:
269+
dims[k] = len(v)
270+
return dims
271+
272+
259273
def datatree_from_dict_with_io_cleanup(groups_dict: Mapping[str, Dataset]) -> DataTree:
260274
"""DataTree.from_dict with file clean-up."""
261275
try:
@@ -308,6 +322,9 @@ class AbstractDataStore:
308322
def get_dimensions(self): # pragma: no cover
309323
raise NotImplementedError()
310324

325+
def get_parent_dimensions(self): # pragma: no cover
326+
return {}
327+
311328
def get_attrs(self): # pragma: no cover
312329
raise NotImplementedError()
313330

@@ -563,21 +580,22 @@ def set_dimensions(self, variables, unlimited_dims=None):
563580
if unlimited_dims is None:
564581
unlimited_dims = set()
565582

583+
parent_dims = self.get_parent_dimensions()
566584
existing_dims = self.get_dimensions()
567585

568586
dims = {}
569587
for v in unlimited_dims: # put unlimited_dims first
570588
dims[v] = None
571589
for v in variables.values():
572-
dims.update(dict(zip(v.dims, v.shape, strict=True)))
590+
dims |= v.sizes
573591

574592
for dim, length in dims.items():
575593
if dim in existing_dims and length != existing_dims[dim]:
576594
raise ValueError(
577595
"Unable to update size for existing dimension"
578596
f"{dim!r} ({length} != {existing_dims[dim]})"
579597
)
580-
elif dim not in existing_dims:
598+
elif dim not in existing_dims and length != parent_dims.get(dim):
581599
is_unlimited = dim in unlimited_dims
582600
self.set_dimension(dim, length, is_unlimited)
583601

xarray/backends/h5netcdf_.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
WritableCFDataStore,
1717
_normalize_path,
1818
_open_remote_file,
19+
collect_ancestor_dimensions,
1920
datatree_from_dict_with_io_cleanup,
2021
find_root_and_group,
2122
)
@@ -287,6 +288,9 @@ def get_attrs(self):
287288
def get_dimensions(self):
288289
return FrozenDict((k, len(v)) for k, v in self.ds.dimensions.items())
289290

291+
def get_parent_dimensions(self):
292+
return FrozenDict(collect_ancestor_dimensions(self.ds))
293+
290294
def get_encoding(self):
291295
return {
292296
"unlimited_dims": {

xarray/backends/netCDF4_.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
T_PathFileOrDataStore,
1717
WritableCFDataStore,
1818
_normalize_path,
19+
collect_ancestor_dimensions,
1920
datatree_from_dict_with_io_cleanup,
2021
find_root_and_group,
2122
robust_getitem,
@@ -518,6 +519,9 @@ def get_attrs(self):
518519
def get_dimensions(self):
519520
return FrozenDict((k, len(v)) for k, v in self.ds.dimensions.items())
520521

522+
def get_parent_dimensions(self):
523+
return FrozenDict(collect_ancestor_dimensions(self.ds))
524+
521525
def get_encoding(self):
522526
return {
523527
"unlimited_dims": {

xarray/backends/zarr.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1370,8 +1370,10 @@ def open_zarr(
13701370
13711371
- ``chunks='auto'`` will use dask ``auto`` chunking taking into account the
13721372
engine preferred chunks.
1373-
- ``chunks=None`` skips using dask, which is generally faster for
1374-
small arrays.
1373+
- ``chunks=None`` skips using dask. This uses xarray's internally private
1374+
:ref:`lazy indexing classes <internal design.lazy indexing>`,
1375+
but data is eagerly loaded into memory as numpy arrays when accessed.
1376+
This can be more efficient for smaller arrays, though results may vary.
13751377
- ``chunks=-1`` loads the data with dask using a single chunk for all arrays.
13761378
- ``chunks={}`` loads the data with dask using engine preferred chunks if
13771379
exposed by the backend, otherwise with a single chunk for all arrays.

xarray/core/coordinates.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ def to_index(self, ordered_dims: Sequence[Hashable] | None = None) -> pd.Index:
177177

178178
# compute the cartesian product
179179
code_list += [
180-
np.tile(np.repeat(code, repeat_counts[i]), tile_counts[i]).tolist()
180+
np.tile(np.repeat(code, repeat_counts[i]), tile_counts[i])
181181
for code in codes
182182
]
183183
level_list += levels

xarray/core/datatree.py

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1450,6 +1450,73 @@ def filter_like(self, other: DataTree) -> DataTree:
14501450
other_keys = {key for key, _ in other.subtree_with_keys}
14511451
return self.filter(lambda node: node.relative_to(self) in other_keys)
14521452

1453+
def prune(self, drop_size_zero_vars: bool = False) -> DataTree:
1454+
"""
1455+
Remove empty nodes from the tree.
1456+
1457+
Returns a new tree containing only nodes that contain data variables with actual data.
1458+
Intermediate nodes are kept if they are required to support non-empty children.
1459+
1460+
Parameters
1461+
----------
1462+
drop_size_zero_vars : bool, default False
1463+
If True, also considers variables with zero size as empty.
1464+
If False, keeps nodes with data variables even if they have zero size.
1465+
1466+
Returns
1467+
-------
1468+
DataTree
1469+
A new tree with empty nodes removed.
1470+
1471+
See Also
1472+
--------
1473+
filter
1474+
1475+
Examples
1476+
--------
1477+
>>> dt = xr.DataTree.from_dict(
1478+
... {
1479+
... "/a": xr.Dataset({"foo": ("x", [1, 2])}),
1480+
... "/b": xr.Dataset({"bar": ("x", [])}),
1481+
... "/c": xr.Dataset(),
1482+
... }
1483+
... )
1484+
>>> dt.prune() # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
1485+
<xarray.DataTree>
1486+
Group: /
1487+
├── Group: /a
1488+
│ Dimensions: (x: 2)
1489+
│ Dimensions without coordinates: x
1490+
│ Data variables:
1491+
│ foo (x) int64 16B 1 2
1492+
└── Group: /b
1493+
Dimensions: (x: 0)
1494+
Dimensions without coordinates: x
1495+
Data variables:
1496+
bar (x) float64 0B...
1497+
1498+
The ``drop_size_zero_vars`` parameter controls whether variables
1499+
with zero size are considered empty:
1500+
1501+
>>> dt.prune(drop_size_zero_vars=True)
1502+
<xarray.DataTree>
1503+
Group: /
1504+
└── Group: /a
1505+
Dimensions: (x: 2)
1506+
Dimensions without coordinates: x
1507+
Data variables:
1508+
foo (x) int64 16B 1 2
1509+
"""
1510+
non_empty_cond: Callable[[DataTree], bool]
1511+
if drop_size_zero_vars:
1512+
non_empty_cond = lambda node: len(node.data_vars) > 0 and any(
1513+
var.size > 0 for var in node.data_vars.values()
1514+
)
1515+
else:
1516+
non_empty_cond = lambda node: len(node.data_vars) > 0
1517+
1518+
return self.filter(non_empty_cond)
1519+
14531520
def match(self, pattern: str) -> DataTree:
14541521
"""
14551522
Return nodes with paths matching pattern.

0 commit comments

Comments
 (0)