-
Notifications
You must be signed in to change notification settings - Fork 95
Feat: Add support for loading datasets from URLs #782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat: Add support for loading datasets from URLs #782
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #782 +/- ##
===========================================
- Coverage 100.00% 99.81% -0.19%
===========================================
Files 34 34
Lines 2111 2147 +36
===========================================
+ Hits 2111 2143 +32
- Misses 0 4 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds two distinct features to the movement library: (1) URL loading support for pose and bounding box datasets, and (2) ROI (Region of Interest) save/load functionality. The URL loading feature allows users to load datasets directly from public URLs without manual downloading, addressing Issue #340. The ROI feature enables serialization and deserialization of ROI objects to/from JSON files.
Changes:
- Added URL detection and downloading capability to
load_poses.from_file()andload_bboxes.from_file()using pooch for caching - Implemented
to_file()andfrom_file()methods for ROI objects (BaseRegionOfInterest, PolygonOfInterest, LineOfInterest) - Added comprehensive tests for ROI save/load functionality
- Updated documentation with examples and added shapely to intersphinx mapping
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| movement/io/load_poses.py | Added URL detection and automatic downloading via pooch for pose datasets |
| movement/io/load_bboxes.py | Added URL detection and automatic downloading via pooch for bounding box datasets |
| movement/roi/base.py | Implemented to_file() and from_file() methods for serializing/deserializing ROIs; updated docstring type annotations |
| movement/roi/polygon.py | Added _from_geometry() classmethod to construct PolygonOfInterest from shapely geometry |
| movement/roi/line.py | Added _from_geometry() classmethod to construct LineOfInterest from shapely geometry |
| tests/test_unit/test_roi/test_save_load.py | Added comprehensive tests for ROI save/load functionality including edge cases |
| docs/source/conf.py | Added shapely to intersphinx mapping for cross-documentation linking |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Download file if it is a URL | ||
| if str(file_path).startswith(("http://", "https://")): | ||
| with hide_pooch_hash_logs(): | ||
| file_path = pooch.retrieve( | ||
| url=file_path, | ||
| known_hash=None, | ||
| path=Path( | ||
| "~", ".movement", "data", "public_datasets" | ||
| ).expanduser(), | ||
| progressbar=True, |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block for URL downloading is duplicated in both load_poses.py and load_bboxes.py. Consider extracting this logic into a helper function to reduce duplication and improve maintainability. The helper function could be placed in a utility module or in a common location accessible to both modules.
| data = json.loads(file_path.read_text()) | ||
| geometry = shapely.from_wkt(data["geometry_wkt"]) |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The from_file method lacks error handling for malformed JSON or invalid WKT geometry. If the JSON is malformed or the WKT string is invalid, the error messages from json.loads or shapely.from_wkt may not be clear to users. Consider adding try-except blocks with more informative error messages, such as "Failed to parse ROI file: invalid JSON format" or "Failed to parse geometry from file: invalid WKT string".
| with hide_pooch_hash_logs(): | ||
| file_path = pooch.retrieve( | ||
| url=file_path, | ||
| known_hash=None, | ||
| path=Path( | ||
| "~", ".movement", "data", "public_datasets" |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caching directory ~/.movement/data/public_datasets is created by pooch automatically, but the PR description mentions this is "separate from the internal test data cache". However, the sample_data.py module uses ~/.movement/data as its DATA_DIR. Consider documenting why this separation is needed or if there's a benefit to using the same parent directory but different subdirectories. Also, consider checking if the directory should be created explicitly with proper permissions.
| with hide_pooch_hash_logs(): | |
| file_path = pooch.retrieve( | |
| url=file_path, | |
| known_hash=None, | |
| path=Path( | |
| "~", ".movement", "data", "public_datasets" | |
| # Cache public example datasets under the main movement data directory, | |
| # in a dedicated subdirectory separate from any internal test data cache. | |
| public_cache_dir = Path("~", ".movement", "data", "public_datasets").expanduser() | |
| # Explicitly create the cache directory with user-only permissions where supported. | |
| public_cache_dir.mkdir(parents=True, exist_ok=True, mode=0o700) | |
| file_path = pooch.retrieve( | |
| url=file_path, | |
| known_hash=None, | |
| path=public_cache_dir, |
| @classmethod | ||
| def _from_geometry( | ||
| cls, | ||
| geometry: shapely.Polygon, | ||
| name: str | None = None, | ||
| ) -> PolygonOfInterest: | ||
| """Construct a PolygonOfInterest from a shapely geometry. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| geometry : shapely.Polygon | ||
| The shapely geometry to construct from. | ||
| name : str, optional | ||
| Name for the PolygonOfInterest. | ||
|
|
||
| Returns | ||
| ------- | ||
| PolygonOfInterest | ||
| A new PolygonOfInterest instance. | ||
|
|
||
| """ | ||
| exterior = geometry.exterior.coords | ||
| holes = ( | ||
| [interior.coords for interior in geometry.interiors] | ||
| if geometry.interiors | ||
| else None | ||
| ) | ||
| return cls(exterior_boundary=exterior, holes=holes, name=name) |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _from_geometry method is marked as a classmethod with a leading underscore (indicating it's internal), but it's being called from the base class's from_file method. While this works, consider whether this method should be part of a documented API or if it should remain internal. If it's intended to be internal, the current implementation is acceptable. If users might need to construct ROIs from shapely geometries directly, consider making it public by removing the underscore prefix and adding it to the public API documentation.
| # Download file if it is a URL | ||
| if str(file_path).startswith(("http://", "https://")): | ||
| with hide_pooch_hash_logs(): | ||
| file_path = pooch.retrieve( | ||
| url=file_path, | ||
| known_hash=None, | ||
| path=Path( | ||
| "~", ".movement", "data", "public_datasets" | ||
| ).expanduser(), | ||
| progressbar=True, |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block is duplicated from load_poses.py (lines 161-170). Consider extracting this URL downloading logic into a shared helper function to avoid code duplication.
|
|
||
| """ | ||
| # Download file if it is a URL | ||
| if str(file_path).startswith(("http://", "https://")): |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL validation only checks if the path starts with "http://" or "https://". Consider adding additional validation to ensure the URL is well-formed and potentially restricting to trusted domains if appropriate. While pooch.retrieve is secure by default, adding URL validation could prevent potential issues with malformed URLs.
|
|
||
| """ | ||
| # Download file if it is a URL | ||
| if str(file_path).startswith(("http://", "https://")): |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL validation only checks if the path starts with "http://" or "https://". Consider adding additional validation to ensure the URL is well-formed and potentially restricting to trusted domains if appropriate. While pooch.retrieve is secure by default, adding URL validation could prevent potential issues with malformed URLs.
| with hide_pooch_hash_logs(): | ||
| file_path = pooch.retrieve( | ||
| url=file_path, | ||
| known_hash=None, | ||
| path=Path( | ||
| "~", ".movement", "data", "public_datasets" | ||
| ).expanduser(), | ||
| progressbar=True, |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no error handling for potential download failures. If pooch.retrieve fails (e.g., due to network issues, invalid URL, or server errors), the error will propagate without context. Consider wrapping the pooch.retrieve call in a try-except block to provide more informative error messages to users, such as "Failed to download file from URL: {url}. Please check your internet connection and verify the URL is accessible."
| with hide_pooch_hash_logs(): | |
| file_path = pooch.retrieve( | |
| url=file_path, | |
| known_hash=None, | |
| path=Path( | |
| "~", ".movement", "data", "public_datasets" | |
| ).expanduser(), | |
| progressbar=True, | |
| try: | |
| file_path = pooch.retrieve( | |
| url=file_path, | |
| known_hash=None, | |
| path=Path( | |
| "~", ".movement", "data", "public_datasets" | |
| ).expanduser(), | |
| progressbar=True, | |
| ) | |
| except Exception as exc: | |
| error_msg = ( | |
| f"Failed to download file from URL: {file_path}. " | |
| "Please check your internet connection and verify the URL is accessible." | |
| ) | |
| logger.error(error_msg) | |
| raise RuntimeError(error_msg) from exc |
| with hide_pooch_hash_logs(): | ||
| file_path = pooch.retrieve( | ||
| url=file_path, | ||
| known_hash=None, | ||
| path=Path( | ||
| "~", ".movement", "data", "public_datasets" | ||
| ).expanduser(), |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pooch.retrieve function returns a string path, but the file_path variable is typed as Path | str in the function signature. After the URL download block, file_path will be a string (returned by pooch), which may cause type inconsistencies downstream if the code expects a Path object. Consider wrapping the result in Path() to maintain type consistency: file_path = Path(pooch.retrieve(...)).
| with hide_pooch_hash_logs(): | |
| file_path = pooch.retrieve( | |
| url=file_path, | |
| known_hash=None, | |
| path=Path( | |
| "~", ".movement", "data", "public_datasets" | |
| ).expanduser(), | |
| file_path = Path( | |
| pooch.retrieve( | |
| url=file_path, | |
| known_hash=None, | |
| path=Path( | |
| "~", ".movement", "data", "public_datasets" | |
| ).expanduser(), | |
| progressbar=True, | |
| ) |
| with hide_pooch_hash_logs(): | ||
| file_path = pooch.retrieve( | ||
| url=file_path, | ||
| known_hash=None, | ||
| path=Path( | ||
| "~", ".movement", "data", "public_datasets" | ||
| ).expanduser(), | ||
| progressbar=True, |
Copilot
AI
Jan 27, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using the hide_pooch_hash_logs() context manager (from movement.sample_data) when downloading files to suppress SHA256 hash messages, which may not be useful for end users. The existing sample_data.py module uses this pattern to provide a cleaner user experience. Example: with hide_pooch_hash_logs(): file_path = pooch.retrieve(...)


Description
What is this PR
Why is this PR needed?
This PR addresses Issue #340, which requested the ability to load datasets directly from public URLs. Previously, users had to manually download files before loading them into
movement. This feature streamlines workflows by allowing direct ingestion of remote data sources.What does this PR do?
It updates the from_file functions in users' primary entry points (
load_posesandload_bboxes) to automatically detect if a file path is a URL.http://orhttps://), the file is downloaded using the existing pooch dependency.~/.movement/data/public_datasets/(separate from the internal test data cache) to ensure efficient re-use.References
Closes #340
How has this PR been tested?
I created a local verification script that successfully loaded sample datasets from the GIN repository for both supported types:
single-mouse_EPM.predictions.h5(DeepLabCut) via URL.VIA_multiple-crabs_5-frames_labels.csv(VIA) via URL.In both cases, pooch successfully downloaded the files to the new cache directory, and
movementreturned validxarray.Datasetobjects.Is this a breaking change?
No. This is a purely additive change. Existing local file loading works exactly as before.
Does this PR require an update to the documentation?
Yes, I updated the docstrings for
load_poses.from_fileandload_bboxes.from_fileto include Examples of how to use the new URL loading functionality.Checklist: