Skip to content

refactor(datasets): use the shared scverse-misc dataset registry + downloader#1213

Open
timtreis wants to merge 8 commits into
scverse:mainfrom
timtreis:feat/datasets-via-scverse-misc
Open

refactor(datasets): use the shared scverse-misc dataset registry + downloader#1213
timtreis wants to merge 8 commits into
scverse:mainfrom
timtreis:feat/datasets-via-scverse-misc

Conversation

@timtreis

@timtreis timtreis commented Jun 15, 2026

Copy link
Copy Markdown
Member

Replaces squidpy's internal pooch-based dataset registry/downloader with the shared infrastructure from scverse-misc (scverse-misc[datasets]>=0.1.1).

  • Public API unchanged: sq.datasets.* keep the same signatures and behavior.
  • ~750 lines of duplicated registry/download code deleted; pooch is now transitive.
  • datasets.yaml stays in squidpy; the anndata / image / visium_10x loaders register via register_loader.

timtreis and others added 5 commits June 24, 2026 01:01
Replace squidpy's internal pooch-based registry/downloader with the shared
scverse_misc.datasets system (scverse-misc[datasets]):

- _registry.py: build a scverse_misc DatasetRegistry from datasets.yaml,
  folding squidpy-specific shape/library_id into the generic metadata mapping.
  Drops squidpy's duplicated FileEntry/DatasetEntry/DatasetRegistry/DatasetType.
- _downloader.py: register squidpy's domain loaders (image -> ImageContainer,
  visium_10x -> read.visium, spatialdata -> read_zarr) via register_loader and
  override the built-in anndata loader for the shape warning. The pooch
  download/verify/extract machinery now lives in scverse-misc.
- _datasets.py: public API unchanged; type dispatch uses plain strings.
- pyproject: drop direct pooch dep (now via scverse-misc[datasets]).

Net ~750 lines deleted. Public API (sq.datasets.*) is unchanged.

Depends on scverse/scverse-misc#40.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scverse-misc now ships a generic spatialdata loader, so squidpy no longer needs
its own; it registers only its domain loaders (image, visium_10x) plus the
anndata shape-warning override.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the redundant 'visium' prefix in visium() so downloads land in
<datasetdir>/visium_10x/<sample>/ like every other type (was doubly nested
under visium/visium_10x). Update the hires-image path assertion accordingly.

Verified: all @internet datasets tests pass (8 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
registry.{anndata,image,spatialdata}_datasets were squidpy's old registry
properties, removed in the migration. Use dataset_names(type) instead. Visium
samples now cache to <datasetdir>/visium_10x/<sample>/ via the public API.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scverse-misc dropped its DatasetRegistry/Fetcher/FetchContext classes for a typed
data model + functions. Adapt:
- _registry: parse_registry() -> (base_url, dict[str, DatasetEntry]); get_registry()
  returns the dict, get_base_url() the base. shape/library_id/doc_header now live in
  entry.metadata.
- _downloader: loaders are (entry, target, download, **kwargs); DatasetDownloader wraps
  fetch(); visium uses pooch.Untar instead of a manual tarfile loop.
- _datasets/tests updated to the dict + metadata shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@timtreis timtreis force-pushed the feat/datasets-via-scverse-misc branch from fd53f80 to 947e68e Compare June 23, 2026 23:02
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.81250% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.82%. Comparing base (ae5ceb4) to head (c517bbd).

Files with missing lines Patch % Lines
src/squidpy/datasets/_downloader.py 80.95% 6 Missing and 2 partials ⚠️
src/squidpy/datasets/_datasets.py 77.77% 1 Missing and 1 partial ⚠️
src/squidpy/datasets/_registry.py 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1213      +/-   ##
==========================================
+ Coverage   76.81%   76.82%   +0.01%     
==========================================
  Files          63       63              
  Lines        9267     9088     -179     
  Branches     1565     1536      -29     
==========================================
- Hits         7118     6982     -136     
+ Misses       1547     1521      -26     
+ Partials      602      585      -17     
Files with missing lines Coverage Δ
src/squidpy/datasets/_registry.py 88.88% <92.30%> (+2.65%) ⬆️
src/squidpy/datasets/_datasets.py 84.21% <77.77%> (+2.67%) ⬆️
src/squidpy/datasets/_downloader.py 80.39% <80.95%> (+8.63%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

timtreis and others added 2 commits June 24, 2026 01:45
- replace DatasetDownloader class + get_downloader singleton + module
  wrapper with a single download() over scverse_misc.datasets.fetch
- restore pooch as a direct dependency (used via pooch.Untar in the
  visium loader); drop now-unused direct pyyaml dependency
- drop redundant visium() name validation, keeping its specific message
@timtreis timtreis marked this pull request as ready for review June 23, 2026 23:45
@timtreis timtreis requested a review from selmanozleyen June 24, 2026 08:52
raise ValueError(f"Unknown Visium sample: {sample_id}. Available samples: {dataset_names('visium_10x')}")

return downloader.download(sample_id, base_dir, include_hires_tiff=include_hires_tiff)
# downloads land in <datasetdir>/visium_10x/<sample_id>/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this used to be /x/visium/x/ on old code but it's fine I guess to move the cache path. It's also seen in

-    expected_image_path = (Path(settings.datasetdir) / "visium" / sample / "image.tif").resolve()
+    expected_image_path = (Path(settings.datasetdir) / "visium_10x" / sample / "image.tif").resolve()

but good to document

def visium_datasets(self) -> list[str]:
"""Return names of all Visium datasets (alias for visium_10x_datasets)."""
return self.visium_10x_datasets
def get_registry() -> dict[str, DatasetEntry]:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we can have a prettier name for _parsed() and get rid of these two functions. No need to have two separate functions get_registry() and get_base_url()

def visium_datasets(self) -> list[str]:
"""Return names of all Visium datasets (alias for visium_10x_datasets)."""
return self.visium_10x_datasets
def get_registry() -> dict[str, DatasetEntry]:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict[str, DatasetEntry]: should be MappingProxy because we don't want it to be mutable

Comment thread pyproject.toml
"spatialdata>=0.7.2", # 0.7.2 dropped xarray-schema (pkg_resources break, #1115)
# dataset registry + downloader now live in scverse-misc
"scverse-misc[datasets]>=0.1.1",
"spatialdata>=0.7.2", # 0.7.2 dropped xarray-schema (pkg_resources break, #1115)

@selmanozleyen selmanozleyen Jun 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the diff here, can you undo it?

"spatialdata>=0.7.2",            # 0.7.2 dropped xarray-schema (pkg_resources break, #1115)

Comment on lines +123 to +124
if sample_id not in get_registry():
raise ValueError(f"Unknown Visium sample: {sample_id}. Available samples: {dataset_names('visium_10x')}")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid-but-wrong-type name (e.g. imc, an AnnData dataset) passes the guard, then fails deep inside the anndata loader with a confusing TypeError: read_h5ad() got an unexpected keyword argument 'include_hires_tiff' instead of a clear error.

Suggested change
if sample_id not in get_registry():
raise ValueError(f"Unknown Visium sample: {sample_id}. Available samples: {dataset_names('visium_10x')}")
visium_samples = dataset_names("visium_10x")
if sample_id not in visium_samples:
raise ValueError(f"Unknown Visium sample: {sample_id}. Available samples: {visium_samples}")


def test_unknown_dataset(self):
assert "nonexistent" not in get_registry()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's add a test like this given that it fails

class TestVisiumValidation:
    """Offline validation of the ``visium()`` sample-name guard (no network)."""

    def test_unknown_sample_raises(self):
        with pytest.raises(ValueError, match="Unknown Visium sample"):
            visium("definitely_not_a_sample")

    def test_valid_but_wrong_type_rejected(self):
        # `imc` is a valid registry entry but an AnnData dataset, not a Visium 10x
        # sample. It must be rejected before any download is attempted.
        with pytest.raises(ValueError, match="Unknown Visium sample"):
            visium("imc")

@selmanozleyen selmanozleyen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments. But this is a nice win!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants