Skip to content

Conversation

@selmanozleyen
Copy link
Member

@selmanozleyen selmanozleyen commented Dec 6, 2025

again a continuation of #1072 to add hashes and use the uploaded datasets

changes made

  • Now we don't download in unit tests since it's costly for s3 and we know they should be active links since we host it
  • Some datasets were using scanpy defaults and others where using DEFAULT_CACHE_DIR = '~/.cache/squidpy' now we do DEFAULT_CACHE_DIR because it is a global path
  • updated .scripts/ci/download_data.py to download all datasets
  • One unified DatasetDownloader class instead of multiple ways to download datasets in squidpy
  • Everything hard-coded like links and dataset names,links, hashes are in only datasets.yaml registry and when interfacing from registries e.g. visium_hne_image = _make_image_loader("visium_hne_image")
  • Now we don't need to redownload everything when the script is updated. It only triggers the run of the download file but the downloaded files from old script remains

@codecov
Copy link

codecov bot commented Dec 7, 2025

Codecov Report

❌ Patch coverage is 78.23344% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.46%. Comparing base (975861d) to head (d38caf2).

Files with missing lines Patch % Lines
src/squidpy/datasets/_downloader.py 66.14% 32 Missing and 11 partials ⚠️
src/squidpy/datasets/_registry.py 88.00% 9 Missing and 6 partials ⚠️
src/squidpy/datasets/_datasets.py 81.96% 6 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1076      +/-   ##
==========================================
- Coverage   66.47%   66.46%   -0.01%     
==========================================
  Files          45       44       -1     
  Lines        7015     7124     +109     
  Branches     1184     1199      +15     
==========================================
+ Hits         4663     4735      +72     
- Misses       1890     1919      +29     
- Partials      462      470       +8     
Files with missing lines Coverage Δ
src/squidpy/read/_read.py 35.16% <100.00%> (-0.71%) ⬇️
src/squidpy/read/_utils.py 76.19% <100.00%> (+0.58%) ⬆️
src/squidpy/datasets/_datasets.py 81.96% <81.96%> (ø)
src/squidpy/datasets/_registry.py 88.00% <88.00%> (ø)
src/squidpy/datasets/_downloader.py 66.14% <66.14%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@selmanozleyen selmanozleyen self-assigned this Dec 8, 2025
# Image datasets
"visium_fluo_image_crop",
"visium_hne_image_crop",
"visium_hne_image",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but don't fully understand the comments here? Everything visium_x is f.e. "10x Genomics". Either remove or more semantically useful split

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to split it based on the old files. Do you have any suggestions

# =============================================================================

# 10x Genomics Visium datasets (adata_with_image type)
VisiumDatasets = Literal[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this list coming from? Are we loading all of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is from the old API (main branch). Didn't want to remove it because it was public. (under _10x.py)

"""
Download Visium `datasets <https://support.10xgenomics.com/spatial-gene-expression/datasets>`_ from *10x Genomics*.
Uses the unified downloader which supports S3 with fallback to 10x Genomics servers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should have that fallback - the reason for S3 was control over the data existence. Feels overengineered here? I don't expect AWS to fail much

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we could then get rid of that agent-spoofing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to save the original links basically. But I can get rid of them

self,
entry: DatasetEntry,
path: Path | str | None = None,
) -> Any: # Returns SpatialData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you returning Any but saying SpatialData?

from typing import TYPE_CHECKING, Any

import pooch
from scanpy import logging as logg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of the scanpy public API, please remove

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but why is it not marked with an underscore then? our notebooks also use it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're also moving Squidpy to use the spatialdata logger in the other functions, feel free to use that one instead

"""
Download Visium `datasets <https://support.10xgenomics.com/spatial-gene-expression/datasets>`_ from *10x Genomics*.
Uses the unified downloader which supports S3 with fallback to 10x Genomics servers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we could then get rid of that agent-spoofing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants