refactor(datasets): use the shared scverse-misc dataset registry + downloader#1213
refactor(datasets): use the shared scverse-misc dataset registry + downloader#1213timtreis wants to merge 8 commits into
Conversation
Replace squidpy's internal pooch-based registry/downloader with the shared scverse_misc.datasets system (scverse-misc[datasets]): - _registry.py: build a scverse_misc DatasetRegistry from datasets.yaml, folding squidpy-specific shape/library_id into the generic metadata mapping. Drops squidpy's duplicated FileEntry/DatasetEntry/DatasetRegistry/DatasetType. - _downloader.py: register squidpy's domain loaders (image -> ImageContainer, visium_10x -> read.visium, spatialdata -> read_zarr) via register_loader and override the built-in anndata loader for the shape warning. The pooch download/verify/extract machinery now lives in scverse-misc. - _datasets.py: public API unchanged; type dispatch uses plain strings. - pyproject: drop direct pooch dep (now via scverse-misc[datasets]). Net ~750 lines deleted. Public API (sq.datasets.*) is unchanged. Depends on scverse/scverse-misc#40. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scverse-misc now ships a generic spatialdata loader, so squidpy no longer needs its own; it registers only its domain loaders (image, visium_10x) plus the anndata shape-warning override. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the redundant 'visium' prefix in visium() so downloads land in <datasetdir>/visium_10x/<sample>/ like every other type (was doubly nested under visium/visium_10x). Update the hires-image path assertion accordingly. Verified: all @internet datasets tests pass (8 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
registry.{anndata,image,spatialdata}_datasets were squidpy's old registry
properties, removed in the migration. Use dataset_names(type) instead. Visium
samples now cache to <datasetdir>/visium_10x/<sample>/ via the public API.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scverse-misc dropped its DatasetRegistry/Fetcher/FetchContext classes for a typed data model + functions. Adapt: - _registry: parse_registry() -> (base_url, dict[str, DatasetEntry]); get_registry() returns the dict, get_base_url() the base. shape/library_id/doc_header now live in entry.metadata. - _downloader: loaders are (entry, target, download, **kwargs); DatasetDownloader wraps fetch(); visium uses pooch.Untar instead of a manual tarfile loop. - _datasets/tests updated to the dict + metadata shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fd53f80 to
947e68e
Compare
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1213 +/- ##
==========================================
+ Coverage 76.81% 76.82% +0.01%
==========================================
Files 63 63
Lines 9267 9088 -179
Branches 1565 1536 -29
==========================================
- Hits 7118 6982 -136
+ Misses 1547 1521 -26
+ Partials 602 585 -17
🚀 New features to boost your workflow:
|
- replace DatasetDownloader class + get_downloader singleton + module wrapper with a single download() over scverse_misc.datasets.fetch - restore pooch as a direct dependency (used via pooch.Untar in the visium loader); drop now-unused direct pyyaml dependency - drop redundant visium() name validation, keeping its specific message
for more information, see https://pre-commit.ci
| raise ValueError(f"Unknown Visium sample: {sample_id}. Available samples: {dataset_names('visium_10x')}") | ||
|
|
||
| return downloader.download(sample_id, base_dir, include_hires_tiff=include_hires_tiff) | ||
| # downloads land in <datasetdir>/visium_10x/<sample_id>/ |
There was a problem hiding this comment.
this used to be /x/visium/x/ on old code but it's fine I guess to move the cache path. It's also seen in
- expected_image_path = (Path(settings.datasetdir) / "visium" / sample / "image.tif").resolve()
+ expected_image_path = (Path(settings.datasetdir) / "visium_10x" / sample / "image.tif").resolve()
but good to document
| def visium_datasets(self) -> list[str]: | ||
| """Return names of all Visium datasets (alias for visium_10x_datasets).""" | ||
| return self.visium_10x_datasets | ||
| def get_registry() -> dict[str, DatasetEntry]: |
There was a problem hiding this comment.
now we can have a prettier name for _parsed() and get rid of these two functions. No need to have two separate functions get_registry() and get_base_url()
| def visium_datasets(self) -> list[str]: | ||
| """Return names of all Visium datasets (alias for visium_10x_datasets).""" | ||
| return self.visium_10x_datasets | ||
| def get_registry() -> dict[str, DatasetEntry]: |
There was a problem hiding this comment.
dict[str, DatasetEntry]: should be MappingProxy because we don't want it to be mutable
| "spatialdata>=0.7.2", # 0.7.2 dropped xarray-schema (pkg_resources break, #1115) | ||
| # dataset registry + downloader now live in scverse-misc | ||
| "scverse-misc[datasets]>=0.1.1", | ||
| "spatialdata>=0.7.2", # 0.7.2 dropped xarray-schema (pkg_resources break, #1115) |
There was a problem hiding this comment.
why the diff here, can you undo it?
"spatialdata>=0.7.2", # 0.7.2 dropped xarray-schema (pkg_resources break, #1115)
| if sample_id not in get_registry(): | ||
| raise ValueError(f"Unknown Visium sample: {sample_id}. Available samples: {dataset_names('visium_10x')}") |
There was a problem hiding this comment.
valid-but-wrong-type name (e.g. imc, an AnnData dataset) passes the guard, then fails deep inside the anndata loader with a confusing TypeError: read_h5ad() got an unexpected keyword argument 'include_hires_tiff' instead of a clear error.
| if sample_id not in get_registry(): | |
| raise ValueError(f"Unknown Visium sample: {sample_id}. Available samples: {dataset_names('visium_10x')}") | |
| visium_samples = dataset_names("visium_10x") | |
| if sample_id not in visium_samples: | |
| raise ValueError(f"Unknown Visium sample: {sample_id}. Available samples: {visium_samples}") |
|
|
||
| def test_unknown_dataset(self): | ||
| assert "nonexistent" not in get_registry() | ||
|
|
There was a problem hiding this comment.
Maybe let's add a test like this given that it fails
class TestVisiumValidation:
"""Offline validation of the ``visium()`` sample-name guard (no network)."""
def test_unknown_sample_raises(self):
with pytest.raises(ValueError, match="Unknown Visium sample"):
visium("definitely_not_a_sample")
def test_valid_but_wrong_type_rejected(self):
# `imc` is a valid registry entry but an AnnData dataset, not a Visium 10x
# sample. It must be rejected before any download is attempted.
with pytest.raises(ValueError, match="Unknown Visium sample"):
visium("imc")
selmanozleyen
left a comment
There was a problem hiding this comment.
See my comments. But this is a nice win!
Replaces squidpy's internal pooch-based dataset registry/downloader with the shared infrastructure from scverse-misc (
scverse-misc[datasets]>=0.1.1).sq.datasets.*keep the same signatures and behavior.poochis now transitive.datasets.yamlstays in squidpy; theanndata/image/visium_10xloaders register viaregister_loader.