Feat: Add support for loading datasets from URLs #782

Tushar7012 · 2026-01-27T11:24:32Z

Description

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?
This PR addresses Issue #340, which requested the ability to load datasets directly from public URLs. Previously, users had to manually download files before loading them into movement. This feature streamlines workflows by allowing direct ingestion of remote data sources.

What does this PR do?
It updates the from_file functions in users' primary entry points (load_poses and load_bboxes) to automatically detect if a file path is a URL.

Automatic Download: If a URL is detected (starting with http:// or https://), the file is downloaded using the existing pooch dependency.
Caching: Files are cached in ~/.movement/data/public_datasets/ (separate from the internal test data cache) to ensure efficient re-use.
Seamless Integration: The downloaded local path is then passed to the existing loading logic, identifying the file type and preserving all downstream functionality.

References

Closes #340

How has this PR been tested?

I created a local verification script that successfully loaded sample datasets from the GIN repository for both supported types:

Poses: Loaded single-mouse_EPM.predictions.h5 (DeepLabCut) via URL.
Bounding Boxes: Loaded VIA_multiple-crabs_5-frames_labels.csv (VIA) via URL.

In both cases, pooch successfully downloaded the files to the new cache directory, and movement returned valid xarray.Dataset objects.

Is this a breaking change?

No. This is a purely additive change. Existing local file loading works exactly as before.

Does this PR require an update to the documentation?

Yes, I updated the docstrings for load_poses.from_file and load_bboxes.from_file to include Examples of how to use the new URL loading functionality.

Checklist:

The code has been tested locally
Tests have been added to cover all new functionality
The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

…it#340)

codecov · 2026-01-27T11:29:10Z

Codecov Report

❌ Patch coverage is 92.45283% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.81%. Comparing base (665412f) to head (af6d4cf).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
movement/io/load_bboxes.py	60.00%	2 Missing ⚠️
movement/io/load_poses.py	60.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##              main     #782      +/-   ##
===========================================
- Coverage   100.00%   99.81%   -0.19%     
===========================================
  Files           34       34              
  Lines         2111     2147      +36     
===========================================
+ Hits          2111     2143      +32     
- Misses           0        4       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…euroinformatics-unit#782)

sonarqubecloud · 2026-01-27T11:31:01Z

Quality Gate failed

Failed conditions
2 Security Hotspots

See analysis details on SonarQube Cloud

Copilot

Pull request overview

This pull request adds two distinct features to the movement library: (1) URL loading support for pose and bounding box datasets, and (2) ROI (Region of Interest) save/load functionality. The URL loading feature allows users to load datasets directly from public URLs without manual downloading, addressing Issue #340. The ROI feature enables serialization and deserialization of ROI objects to/from JSON files.

Changes:

Added URL detection and downloading capability to load_poses.from_file() and load_bboxes.from_file() using pooch for caching
Implemented to_file() and from_file() methods for ROI objects (BaseRegionOfInterest, PolygonOfInterest, LineOfInterest)
Added comprehensive tests for ROI save/load functionality
Updated documentation with examples and added shapely to intersphinx mapping

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
movement/io/load_poses.py	Added URL detection and automatic downloading via pooch for pose datasets
movement/io/load_bboxes.py	Added URL detection and automatic downloading via pooch for bounding box datasets
movement/roi/base.py	Implemented `to_file()` and `from_file()` methods for serializing/deserializing ROIs; updated docstring type annotations
movement/roi/polygon.py	Added `_from_geometry()` classmethod to construct PolygonOfInterest from shapely geometry
movement/roi/line.py	Added `_from_geometry()` classmethod to construct LineOfInterest from shapely geometry
tests/test_unit/test_roi/test_save_load.py	Added comprehensive tests for ROI save/load functionality including edge cases
docs/source/conf.py	Added shapely to intersphinx mapping for cross-documentation linking

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-27T11:31:44Z

movement/io/load_poses.py

+    # Download file if it is a URL
+    if str(file_path).startswith(("http://", "https://")):
+        with hide_pooch_hash_logs():
+            file_path = pooch.retrieve(
+                url=file_path,
+                known_hash=None,
+                path=Path(
+                    "~", ".movement", "data", "public_datasets"
+                ).expanduser(),
+                progressbar=True,


This code block for URL downloading is duplicated in both load_poses.py and load_bboxes.py. Consider extracting this logic into a helper function to reduce duplication and improve maintainability. The helper function could be placed in a utility module or in a common location accessible to both modules.

Copilot · 2026-01-27T11:31:45Z

movement/roi/base.py

+        data = json.loads(file_path.read_text())
+        geometry = shapely.from_wkt(data["geometry_wkt"])


The from_file method lacks error handling for malformed JSON or invalid WKT geometry. If the JSON is malformed or the WKT string is invalid, the error messages from json.loads or shapely.from_wkt may not be clear to users. Consider adding try-except blocks with more informative error messages, such as "Failed to parse ROI file: invalid JSON format" or "Failed to parse geometry from file: invalid WKT string".

Copilot · 2026-01-27T11:31:45Z

movement/io/load_poses.py

+        with hide_pooch_hash_logs():
+            file_path = pooch.retrieve(
+                url=file_path,
+                known_hash=None,
+                path=Path(
+                    "~", ".movement", "data", "public_datasets"


The caching directory ~/.movement/data/public_datasets is created by pooch automatically, but the PR description mentions this is "separate from the internal test data cache". However, the sample_data.py module uses ~/.movement/data as its DATA_DIR. Consider documenting why this separation is needed or if there's a benefit to using the same parent directory but different subdirectories. Also, consider checking if the directory should be created explicitly with proper permissions.

Suggested change

with hide_pooch_hash_logs():

file_path = pooch.retrieve(

url=file_path,

known_hash=None,

path=Path(

"~", ".movement", "data", "public_datasets"

# Cache public example datasets under the main movement data directory,

# in a dedicated subdirectory separate from any internal test data cache.

public_cache_dir = Path("~", ".movement", "data", "public_datasets").expanduser()

# Explicitly create the cache directory with user-only permissions where supported.

public_cache_dir.mkdir(parents=True, exist_ok=True, mode=0o700)

file_path = pooch.retrieve(

url=file_path,

known_hash=None,

path=public_cache_dir,

Copilot · 2026-01-27T11:31:46Z

movement/roi/polygon.py

+    @classmethod
+    def _from_geometry(
+        cls,
+        geometry: shapely.Polygon,
+        name: str | None = None,
+    ) -> PolygonOfInterest:
+        """Construct a PolygonOfInterest from a shapely geometry.
+
+        Parameters
+        ----------
+        geometry : shapely.Polygon
+            The shapely geometry to construct from.
+        name : str, optional
+            Name for the PolygonOfInterest.
+
+        Returns
+        -------
+        PolygonOfInterest
+            A new PolygonOfInterest instance.
+
+        """
+        exterior = geometry.exterior.coords
+        holes = (
+            [interior.coords for interior in geometry.interiors]
+            if geometry.interiors
+            else None
+        )
+        return cls(exterior_boundary=exterior, holes=holes, name=name)


The _from_geometry method is marked as a classmethod with a leading underscore (indicating it's internal), but it's being called from the base class's from_file method. While this works, consider whether this method should be part of a documented API or if it should remain internal. If it's intended to be internal, the current implementation is acceptable. If users might need to construct ROIs from shapely geometries directly, consider making it public by removing the underscore prefix and adding it to the public API documentation.

Copilot · 2026-01-27T11:31:46Z

movement/io/load_bboxes.py

+    # Download file if it is a URL
+    if str(file_path).startswith(("http://", "https://")):
+        with hide_pooch_hash_logs():
+            file_path = pooch.retrieve(
+                url=file_path,
+                known_hash=None,
+                path=Path(
+                    "~", ".movement", "data", "public_datasets"
+                ).expanduser(),
+                progressbar=True,


This code block is duplicated from load_poses.py (lines 161-170). Consider extracting this URL downloading logic into a shared helper function to avoid code duplication.

Copilot · 2026-01-27T11:31:48Z

movement/io/load_poses.py


    """
+    # Download file if it is a URL
+    if str(file_path).startswith(("http://", "https://")):


The URL validation only checks if the path starts with "http://" or "https://". Consider adding additional validation to ensure the URL is well-formed and potentially restricting to trusted domains if appropriate. While pooch.retrieve is secure by default, adding URL validation could prevent potential issues with malformed URLs.

Copilot · 2026-01-27T11:31:48Z

movement/io/load_bboxes.py


    """
+    # Download file if it is a URL
+    if str(file_path).startswith(("http://", "https://")):


The URL validation only checks if the path starts with "http://" or "https://". Consider adding additional validation to ensure the URL is well-formed and potentially restricting to trusted domains if appropriate. While pooch.retrieve is secure by default, adding URL validation could prevent potential issues with malformed URLs.

Copilot · 2026-01-27T11:31:48Z

movement/io/load_bboxes.py

+        with hide_pooch_hash_logs():
+            file_path = pooch.retrieve(
+                url=file_path,
+                known_hash=None,
+                path=Path(
+                    "~", ".movement", "data", "public_datasets"
+                ).expanduser(),
+                progressbar=True,


There is no error handling for potential download failures. If pooch.retrieve fails (e.g., due to network issues, invalid URL, or server errors), the error will propagate without context. Consider wrapping the pooch.retrieve call in a try-except block to provide more informative error messages to users, such as "Failed to download file from URL: {url}. Please check your internet connection and verify the URL is accessible."

Suggested change

with hide_pooch_hash_logs():

file_path = pooch.retrieve(

url=file_path,

known_hash=None,

path=Path(

"~", ".movement", "data", "public_datasets"

).expanduser(),

progressbar=True,

try:

file_path = pooch.retrieve(

url=file_path,

known_hash=None,

path=Path(

"~", ".movement", "data", "public_datasets"

).expanduser(),

progressbar=True,

)

except Exception as exc:

error_msg = (

f"Failed to download file from URL: {file_path}. "

"Please check your internet connection and verify the URL is accessible."

)

logger.error(error_msg)

raise RuntimeError(error_msg) from exc

Copilot · 2026-01-27T11:31:49Z

movement/io/load_poses.py

+        with hide_pooch_hash_logs():
+            file_path = pooch.retrieve(
+                url=file_path,
+                known_hash=None,
+                path=Path(
+                    "~", ".movement", "data", "public_datasets"
+                ).expanduser(),


The pooch.retrieve function returns a string path, but the file_path variable is typed as Path | str in the function signature. After the URL download block, file_path will be a string (returned by pooch), which may cause type inconsistencies downstream if the code expects a Path object. Consider wrapping the result in Path() to maintain type consistency: file_path = Path(pooch.retrieve(...)).

Suggested change

with hide_pooch_hash_logs():

file_path = pooch.retrieve(

url=file_path,

known_hash=None,

path=Path(

"~", ".movement", "data", "public_datasets"

).expanduser(),

file_path = Path(

pooch.retrieve(

url=file_path,

known_hash=None,

path=Path(

"~", ".movement", "data", "public_datasets"

).expanduser(),

progressbar=True,

)

Copilot · 2026-01-27T11:31:49Z

movement/io/load_poses.py

+        with hide_pooch_hash_logs():
+            file_path = pooch.retrieve(
+                url=file_path,
+                known_hash=None,
+                path=Path(
+                    "~", ".movement", "data", "public_datasets"
+                ).expanduser(),
+                progressbar=True,


Consider using the hide_pooch_hash_logs() context manager (from movement.sample_data) when downloading files to suppress SHA256 hash messages, which may not be useful for end users. The existing sample_data.py module uses this pattern to provide a cleaner user experience. Example: with hide_pooch_hash_logs(): file_path = pooch.retrieve(...)

Tushar7012 added 3 commits January 20, 2026 20:18

Fix unclickable external types in API docs

278d69b

Implement save/load methods for RegionOfInterest objects

56c3eab

Feat: Add support for loading datasets from URLs (neuroinformatics-un…

1affbde

…it#340)

Copilot AI review requested due to automatic review settings January 27, 2026 11:24

Copilot started reviewing on behalf of Tushar7012 January 27, 2026 11:24 View session

Refactor: Move hash suppression to logging.py and use in IO loaders (n…

af6d4cf

…euroinformatics-unit#782)

Copilot AI reviewed Jan 27, 2026

View reviewed changes

		data = json.loads(file_path.read_text())
		geometry = shapely.from_wkt(data["geometry_wkt"])

Feat: Add support for loading datasets from URLs #782

Are you sure you want to change the base?

Feat: Add support for loading datasets from URLs #782

Uh oh!

Conversation

Tushar7012 commented Jan 27, 2026

Description

References

How has this PR been tested?

Is this a breaking change?

Does this PR require an update to the documentation?

Checklist:

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud bot commented Jan 27, 2026

Quality Gate failed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Jan 27, 2026 •

edited

Loading