Skip to content

Conversation

@TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Mar 6, 2025

This is a second attempt at addressing #335, being more brutal about removing options that aren't used. It is also intended to make implementing #473 easier.

The idea is that no-one really cares about all the complexity of distinguishing between 1D coordinate variables with and without indexes. Instead we should just default to the same index-creation behaviour as xarray uses, and the easiest way to do that is just to use xr.open_dataset and drop variables the user didn't actually want to load.

This will be inefficient right now (in the same way that the current implementation is inefficient) because we fully scan over the whole file twice. But this sets up for #473, which will avoid scanning over the file more than once.

  • Closes Add loadable_variables='coords' option #335
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

@TomNicholas
Copy link
Member Author

I just removed indexes={} everywhere in the tests, but maybe that should actually be done in a follow-up PR...

@TomNicholas
Copy link
Member Author

FYI @maxrjones @sharkinsspatial this PR has got to the point where I think the only failing tests are those which use a kerchunk-based reader, as I haven't ported the kerchunk translation code yet. So you could maybe build off this branch already...

from virtualizarr import open_virtual_dataset

with open_virtual_dataset(netcdf4_file, indexes={}) as ds:
with open_virtual_dataset(netcdf4_file, loadable_variables=[]) as ds:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required otherwise we get inlined variables in the kerchunk file which we don't know how to read (#489)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I also moved this to a new api.py file.

Copy link
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you @TomNicholas! Just had a few nits

@TomNicholas TomNicholas merged commit b9b6903 into zarr-developers:develop Mar 24, 2025
10 checks passed
@TomNicholas TomNicholas deleted the refactor_loadable_variables branch March 24, 2025 20:52
@TomNicholas TomNicholas mentioned this pull request Mar 25, 2025
8 tasks
TomNicholas added a commit that referenced this pull request Mar 25, 2025
* need latest version of xarray to import internals correctly

* Fix metadata equality for nan fill value (#502)

* add check that works for fill_values too

* note about removing once merged upstream

* type hint

* regression test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove accidental changes to pyproject.toml

* Update pyproject.toml

* ignore mypy

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Setup intersphinx mapping for docs (#503)

* Setup intersphinx mapping for docs

---------

Co-authored-by: Kyle Barron <[email protected]>

* Change default loadable_variables (and indexes) to match xarray's behaviour (#477)

* draft refactor

* sketch of simplified handling of loadable_variables

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* get at least some tests working

* separate VirtualBackend api definition from common utilities

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove indexes={} everywhere in tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* stop passing through loadable_variables to where it isn't used

* implement logic to load 1D dimension coords by default

* remove more instances of indexes={}

* remove more indexes={}

* refactor logic for choosing loadable_variables

* fix more tets

* xfail Aimee's test that I don't understand

* xfail test that explicitly specifies no indexes

* made a bunch more stuff pass

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix netcdf3 reader

* fix bad import in FITS reader

* fix import in tiff reader

* fix import in icechunk test

* release note

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update docstring

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix fits reader

* xfail on empty dict for indexes

* linting

* actually test new expected behaviour

* fix logic for setting loadable_variables

* update docs page to reflect new behaviour

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix expected behaviour in another tests

* additional assert

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use encode_dataset_coordinates in kerchunk writer

* Encode zarr vars

* fix some mypy errors

* move drop_variables implmentation to the end of every reader

* override loadable_variables and raise warning

* fix failing test by not creating loadable variables that would get inlined by default

* improve error message

* remove some more occurrences of indexes={}

* skip slow test

* slay mypy errors

* docs typos

* should fix dmrpp test

* Delete commented-out code

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove unecessary test skip

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Max Jones <[email protected]>

* Update pyproject.toml deps (#504)

* re-add icechunk to upstream tests

* add pytest-asyncio to test envs

* passing serial open_virtual_mfdataset test

* passes with lithops but only for the HDF backend

* add test for dask

* refactored serial and lithops codepaths to use an executor pattern

* xfail lithops

* consolidate tests by parametrizing over parallel kwarg

* re-enable lithops test

* remove unneeded get_executor function

* add test for using dask distributed to parallelize

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Max Jones <[email protected]>
Co-authored-by: Kyle Barron <[email protected]>
TomNicholas added a commit that referenced this pull request Mar 29, 2025
* copy implementation from xarray

* sketch idea for lithops parallelization

* standardize naming of variables

* add to public API

* fix errors caused by trying to import xarray types

* start writing tests

* passing test for combining in serial

* requires_kerchunk

* test for lithops with default LocalHost executor

* notes on confusing AssertionError

* ensure lithops is installed

* remove uneeded fixture

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Additions to `open_virtual_mfdataset` (#508)

* need latest version of xarray to import internals correctly

* Fix metadata equality for nan fill value (#502)

* add check that works for fill_values too

* note about removing once merged upstream

* type hint

* regression test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove accidental changes to pyproject.toml

* Update pyproject.toml

* ignore mypy

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Setup intersphinx mapping for docs (#503)

* Setup intersphinx mapping for docs

---------

Co-authored-by: Kyle Barron <[email protected]>

* Change default loadable_variables (and indexes) to match xarray's behaviour (#477)

* draft refactor

* sketch of simplified handling of loadable_variables

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* get at least some tests working

* separate VirtualBackend api definition from common utilities

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove indexes={} everywhere in tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* stop passing through loadable_variables to where it isn't used

* implement logic to load 1D dimension coords by default

* remove more instances of indexes={}

* remove more indexes={}

* refactor logic for choosing loadable_variables

* fix more tets

* xfail Aimee's test that I don't understand

* xfail test that explicitly specifies no indexes

* made a bunch more stuff pass

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix netcdf3 reader

* fix bad import in FITS reader

* fix import in tiff reader

* fix import in icechunk test

* release note

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update docstring

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix fits reader

* xfail on empty dict for indexes

* linting

* actually test new expected behaviour

* fix logic for setting loadable_variables

* update docs page to reflect new behaviour

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix expected behaviour in another tests

* additional assert

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use encode_dataset_coordinates in kerchunk writer

* Encode zarr vars

* fix some mypy errors

* move drop_variables implmentation to the end of every reader

* override loadable_variables and raise warning

* fix failing test by not creating loadable variables that would get inlined by default

* improve error message

* remove some more occurrences of indexes={}

* skip slow test

* slay mypy errors

* docs typos

* should fix dmrpp test

* Delete commented-out code

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove unecessary test skip

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Max Jones <[email protected]>

* Update pyproject.toml deps (#504)

* re-add icechunk to upstream tests

* add pytest-asyncio to test envs

* passing serial open_virtual_mfdataset test

* passes with lithops but only for the HDF backend

* add test for dask

* refactored serial and lithops codepaths to use an executor pattern

* xfail lithops

* consolidate tests by parametrizing over parallel kwarg

* re-enable lithops test

* remove unneeded get_executor function

* add test for using dask distributed to parallelize

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Max Jones <[email protected]>
Co-authored-by: Kyle Barron <[email protected]>

* Additions to `open_virtual_mfdataset` (#509)

* need latest version of xarray to import internals correctly

* passing serial open_virtual_mfdataset test

* passes with lithops but only for the HDF backend

* add test for dask

* refactored serial and lithops codepaths to use an executor pattern

* xfail lithops

* consolidate tests by parametrizing over parallel kwarg

* re-enable lithops test

* remove unneeded get_executor function

* add test for using dask distributed to parallelize

* Add ManifestStore for loading data from ManifestArrays (#490)

* Draft ManifestStore implementation

---------

Co-authored-by: Tom Nicholas <[email protected]>
Co-authored-by: Kyle Barron <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make it work for dask delayed

* correct docstring

---------

Co-authored-by: Max Jones <[email protected]>
Co-authored-by: Kyle Barron <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* More open_virtual_mfdataset (#510)

* need latest version of xarray to import internals correctly

* passing serial open_virtual_mfdataset test

* passes with lithops but only for the HDF backend

* add test for dask

* refactored serial and lithops codepaths to use an executor pattern

* xfail lithops

* consolidate tests by parametrizing over parallel kwarg

* re-enable lithops test

* remove unneeded get_executor function

* add test for using dask distributed to parallelize

* make it work for dask delayed

* correct docstring

* added compliant executor for lithops

* add links to lithops issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Final fixes for open_virtual_mfdataset (#517)

* need latest version of xarray to import internals correctly

* passing serial open_virtual_mfdataset test

* passes with lithops but only for the HDF backend

* add test for dask

* refactored serial and lithops codepaths to use an executor pattern

* xfail lithops

* consolidate tests by parametrizing over parallel kwarg

* re-enable lithops test

* remove unneeded get_executor function

* add test for using dask distributed to parallelize

* make it work for dask delayed

* correct docstring

* added compliant executor for lithops

* add links to lithops issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* specify dask and lithops executors with a string again

* fix easy typing stuff

* fix typing errors by aligning executor signatures

* remove open_virtual_mfdataset from public API for now

* release note

* refactor construction of expected result

* implement preprocess arg, and dodge lithops bug

* update comment

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Apply suggestions from code reviewRemRemove new deps

* remove rogue print statement

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Max Jones <[email protected]>
Co-authored-by: Kyle Barron <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add loadable_variables='coords' option

2 participants