Adding a registry to have the hashes of datasets (restructured for aws s3) #1076

selmanozleyen · 2025-12-06T06:20:28Z

again a continuation of #1072 to add hashes and use the uploaded datasets

changes made

Now we don't download in unit tests since it's costly for s3 and we know they should be active links since we host it
Some datasets were using scanpy defaults and others where using DEFAULT_CACHE_DIR = '~/.cache/squidpy' now we do DEFAULT_CACHE_DIR because it is a global path
updated .scripts/ci/download_data.py to download all datasets
One unified DatasetDownloader class instead of multiple ways to download datasets in squidpy
Everything hard-coded like links and dataset names,links, hashes are in only datasets.yaml registry and when interfacing from registries e.g. visium_hne_image = _make_image_loader("visium_hne_image")
Now we don't need to redownload everything when the script is updated. It only triggers the run of the download file but the downloaded files from old script remains

for more information, see https://pre-commit.ci

codecov · 2025-12-07T12:18:29Z

Codecov Report

❌ Patch coverage is 78.23344% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.46%. Comparing base (975861d) to head (d38caf2).

Files with missing lines	Patch %	Lines
src/squidpy/datasets/_downloader.py	66.14%	32 Missing and 11 partials ⚠️
src/squidpy/datasets/_registry.py	88.00%	9 Missing and 6 partials ⚠️
src/squidpy/datasets/_datasets.py	81.96%	6 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1076      +/-   ##
==========================================
- Coverage   66.47%   66.46%   -0.01%     
==========================================
  Files          45       44       -1     
  Lines        7015     7124     +109     
  Branches     1184     1199      +15     
==========================================
+ Hits         4663     4735      +72     
- Misses       1890     1919      +29     
- Partials      462      470       +8

Files with missing lines	Coverage Δ
src/squidpy/read/_read.py	`35.16% <100.00%> (-0.71%)`	⬇️
src/squidpy/read/_utils.py	`76.19% <100.00%> (+0.58%)`	⬆️
src/squidpy/datasets/_datasets.py	`81.96% <81.96%> (ø)`
src/squidpy/datasets/_registry.py	`88.00% <88.00%> (ø)`
src/squidpy/datasets/_downloader.py	`66.14% <66.14%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…e its relative. It's fine if set it globally

for more information, see https://pre-commit.ci

…r each new script

…/squidpy into add-dataset-hashes

.scripts/ci/download_data.py

timtreis · 2025-12-09T10:09:59Z

src/squidpy/datasets/__init__.py

+    # Image datasets
+    "visium_fluo_image_crop",
+    "visium_hne_image_crop",
+    "visium_hne_image",


nit, but don't fully understand the comments here? Everything visium_x is f.e. "10x Genomics". Either remove or more semantically useful split

I tried to split it based on the old files. Do you have any suggestions

src/squidpy/datasets/_datasets.py

timtreis · 2025-12-09T10:11:24Z

src/squidpy/datasets/_datasets.py

+# =============================================================================
+
+# 10x Genomics Visium datasets (adata_with_image type)
+VisiumDatasets = Literal[


Where is this list coming from? Are we loading all of them?

It is from the old API (main branch). Didn't want to remove it because it was public. (under _10x.py)

timtreis · 2025-12-09T10:12:51Z

src/squidpy/datasets/_datasets.py

+    """
+    Download Visium `datasets <https://support.10xgenomics.com/spatial-gene-expression/datasets>`_ from *10x Genomics*.
+
+    Uses the unified downloader which supports S3 with fallback to 10x Genomics servers.


I'm not sure if we should have that fallback - the reason for S3 was control over the data existence. Feels overengineered here? I don't expect AWS to fail much

I agree, we could then get rid of that agent-spoofing

I wanted to save the original links basically. But I can get rid of them

src/squidpy/datasets/_datasets.py