Skip to content

[BUG] Allow skip_variables keyword argument in open_virtual_mfdataset #1225

@jackiryan

Description

@jackiryan

Is this issue already tracked somewhere, or is this a new report?

  • I've reviewed existing issues and couldn't find a duplicate for this problem.

Have you checked the status of Earthdata services?

  • I've executed earthaccess.status() and both CMR and EDL returned 'OK'.

Current Behavior

I am trying to virtualize a collection of HDF5 granules that are on the MODIS sinusoidal grid. There are several issues with doing this, one of which appears to be in virtualizarr itself, but this ticket will cover an issue in earthaccess. There was an issue with remote_protocol not being passed correctly but that appears to be fixed in 0.16.0

Aside from that, the collection I'm working with has a "Projection" variable that doesn't play nicely with the DMRPP parser. So if I try to open it like this:

# 1. some earthaccess search on the VNP43MA4 collection
# 2. Some filtering code to only get one row of granules from the SIN grid

vds = earthaccess.open_virtual_mfdataset(
    row_results,
    group="/HDFEOS/GRIDS/VIIRS_Grid_BRDF/Data_Fields",
    access="indirect",
    concat_dim="XDim",
    loadable_variables=["XDim", "YDim"]
)

I get a long error but the relevant part is at the end:

  File "/app/venv/lib/python3.13/site-packages/virtualizarr/parsers/dmrpp.py", line 74, in __call__
    manifest_store = parser.parse_dataset(object_store=store, group=self.group)
  File "/app/venv/lib/python3.13/site-packages/virtualizarr/parsers/dmrpp.py", line 181, in parse_dataset
    manifest_group = self._parse_dataset(dataset_element)
  File "/app/venv/lib/python3.13/site-packages/virtualizarr/parsers/dmrpp.py", line 281, in _parse_dataset
    variable = self._parse_variable(var_tag)
  File "/app/venv/lib/python3.13/site-packages/virtualizarr/parsers/dmrpp.py", line 391, in _parse_variable
    dimension_tags = self._find_dimension_tags(var_tag)
  File "/app/venv/lib/python3.13/site-packages/virtualizarr/parsers/dmrpp.py", line 370, in _find_dimension_tags
    dimension_tag = self.find_node_fqn(d.attrib["name"])
                                       ~~~~~~~~^^^^^^^^
KeyError: 'name'

Expected Behavior

I reverse engineered the earthaccess.open_virtual_mfdataset function and found that if the skip_variables kwarg could be passed to the DMRPPParser, the dataset could be virtualized:

import virtualizarr as vz
from obstore.store import HTTPStore
from virtualizarr.parsers import DMRPPParser
from virtualizarr.registry import ObjectStoreRegistry

# Assume the domain and tile_urls variables come from some parsing code from the earthaccess results
    http_store = HTTPStore.from_url(
        f"https://{domain}",
        client_options={
            "default_headers": {
                "Authorization": f"Bearer {token}",
            },
        },
    )
    obstore_registry = ObjectStoreRegistry({f"https://{domain}": http_store})

    vds = vz.open_virtual_mfdataset(
        urls=tile_urls,
        registry=obstore_registry,
        parser=DMRPPParser(
            group="/HDFEOS/GRIDS/VIIRS_Grid_BRDF/Data_Fields",
            skip_variables=["Projection"],
        ),
        combine="nested",
        concat_dim="XDim",
        parallel="dask",
        loadable_variables=["XDim", "YDim"]
    )

So this code will work with no issues.

Steps To Reproduce

import earthaccess
import re

earthaccess.login()
results = earthaccess.search_data(
    short_name="VNP43MA4",
    temporal="2026-01-27",
)
tile_dict = {}
for res in results:
    url = res.data_links(access="indirect")[0]
    match = re.search(r'\.h(\d{2})v(\d{2})\.', url)
    if match:
        h, v = int(match.group(1)), int(match.group(2))
        tile_dict[(h, v)] = res

row = 2
# this will be unsorted so technically the virtual dataset should be combine "by coords" as well, but it is enough
# to reproduce the issue
row_results = [tile_dict[(h, v)] for h, v in tile_dict.keys() if v == row]
vds = earthaccess.open_virtual_mfdataset(
    row_results,
    group="/HDFEOS/GRIDS/VIIRS_Grid_BRDF/Data_Fields",
    access="indirect",
    concat_dim="XDim",
    loadable_variables=["XDim", "YDim"]
)

Environment

- OS: MacOS Tahoe 26.3
- Python: 3.13.9
- earthaccess: 0.16.0

Additional Context

Image

The Projection variable does not have the same fields that the data variables in the group I am virtualizing do, and should be skipped.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions