Skip to content

Conversation

@Tushar7012
Copy link
Contributor

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
This PR addresses Issue #340, which requested the ability to load datasets directly from public URLs. Previously, users had to manually download files before loading them into movement. This feature streamlines workflows by allowing direct ingestion of remote data sources.

What does this PR do?
It updates the from_file functions in users' primary entry points (load_poses and load_bboxes) to automatically detect if a file path is a URL.

  • Automatic Download: If a URL is detected (starting with http:// or https://), the file is downloaded using the existing pooch dependency.
  • Caching: Files are cached in ~/.movement/data/public_datasets/ (separate from the internal test data cache) to ensure efficient re-use.
  • Seamless Integration: The downloaded local path is then passed to the existing loading logic, identifying the file type and preserving all downstream functionality.

References

Closes #340

How has this PR been tested?

I created a local verification script that successfully loaded sample datasets from the GIN repository for both supported types:

  1. Poses: Loaded single-mouse_EPM.predictions.h5 (DeepLabCut) via URL.
  2. Bounding Boxes: Loaded VIA_multiple-crabs_5-frames_labels.csv (VIA) via URL.

In both cases, pooch successfully downloaded the files to the new cache directory, and movement returned valid xarray.Dataset objects.

Is this a breaking change?

No. This is a purely additive change. Existing local file loading works exactly as before.

Does this PR require an update to the documentation?

Yes, I updated the docstrings for load_poses.from_file and load_bboxes.from_file to include Examples of how to use the new URL loading functionality.

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

Copilot AI review requested due to automatic review settings January 27, 2026 11:24
@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 92.45283% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.81%. Comparing base (665412f) to head (af6d4cf).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
movement/io/load_bboxes.py 60.00% 2 Missing ⚠️
movement/io/load_poses.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              main     #782      +/-   ##
===========================================
- Coverage   100.00%   99.81%   -0.19%     
===========================================
  Files           34       34              
  Lines         2111     2147      +36     
===========================================
+ Hits          2111     2143      +32     
- Misses           0        4       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
2 Security Hotspots

See analysis details on SonarQube Cloud

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds two distinct features to the movement library: (1) URL loading support for pose and bounding box datasets, and (2) ROI (Region of Interest) save/load functionality. The URL loading feature allows users to load datasets directly from public URLs without manual downloading, addressing Issue #340. The ROI feature enables serialization and deserialization of ROI objects to/from JSON files.

Changes:

  • Added URL detection and downloading capability to load_poses.from_file() and load_bboxes.from_file() using pooch for caching
  • Implemented to_file() and from_file() methods for ROI objects (BaseRegionOfInterest, PolygonOfInterest, LineOfInterest)
  • Added comprehensive tests for ROI save/load functionality
  • Updated documentation with examples and added shapely to intersphinx mapping

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
movement/io/load_poses.py Added URL detection and automatic downloading via pooch for pose datasets
movement/io/load_bboxes.py Added URL detection and automatic downloading via pooch for bounding box datasets
movement/roi/base.py Implemented to_file() and from_file() methods for serializing/deserializing ROIs; updated docstring type annotations
movement/roi/polygon.py Added _from_geometry() classmethod to construct PolygonOfInterest from shapely geometry
movement/roi/line.py Added _from_geometry() classmethod to construct LineOfInterest from shapely geometry
tests/test_unit/test_roi/test_save_load.py Added comprehensive tests for ROI save/load functionality including edge cases
docs/source/conf.py Added shapely to intersphinx mapping for cross-documentation linking

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +161 to +170
# Download file if it is a URL
if str(file_path).startswith(("http://", "https://")):
with hide_pooch_hash_logs():
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
).expanduser(),
progressbar=True,
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block for URL downloading is duplicated in both load_poses.py and load_bboxes.py. Consider extracting this logic into a helper function to reduce duplication and improve maintainability. The helper function could be placed in a utility module or in a common location accessible to both modules.

Copilot uses AI. Check for mistakes.
Comment on lines +628 to +629
data = json.loads(file_path.read_text())
geometry = shapely.from_wkt(data["geometry_wkt"])
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The from_file method lacks error handling for malformed JSON or invalid WKT geometry. If the JSON is malformed or the WKT string is invalid, the error messages from json.loads or shapely.from_wkt may not be clear to users. Consider adding try-except blocks with more informative error messages, such as "Failed to parse ROI file: invalid JSON format" or "Failed to parse geometry from file: invalid WKT string".

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +168
with hide_pooch_hash_logs():
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caching directory ~/.movement/data/public_datasets is created by pooch automatically, but the PR description mentions this is "separate from the internal test data cache". However, the sample_data.py module uses ~/.movement/data as its DATA_DIR. Consider documenting why this separation is needed or if there's a benefit to using the same parent directory but different subdirectories. Also, consider checking if the directory should be created explicitly with proper permissions.

Suggested change
with hide_pooch_hash_logs():
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
# Cache public example datasets under the main movement data directory,
# in a dedicated subdirectory separate from any internal test data cache.
public_cache_dir = Path("~", ".movement", "data", "public_datasets").expanduser()
# Explicitly create the cache directory with user-only permissions where supported.
public_cache_dir.mkdir(parents=True, exist_ok=True, mode=0o700)
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=public_cache_dir,

Copilot uses AI. Check for mistakes.
Comment on lines +88 to +115
@classmethod
def _from_geometry(
cls,
geometry: shapely.Polygon,
name: str | None = None,
) -> PolygonOfInterest:
"""Construct a PolygonOfInterest from a shapely geometry.

Parameters
----------
geometry : shapely.Polygon
The shapely geometry to construct from.
name : str, optional
Name for the PolygonOfInterest.

Returns
-------
PolygonOfInterest
A new PolygonOfInterest instance.

"""
exterior = geometry.exterior.coords
holes = (
[interior.coords for interior in geometry.interiors]
if geometry.interiors
else None
)
return cls(exterior_boundary=exterior, holes=holes, name=name)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _from_geometry method is marked as a classmethod with a leading underscore (indicating it's internal), but it's being called from the base class's from_file method. While this works, consider whether this method should be part of a documented API or if it should remain internal. If it's intended to be internal, the current implementation is acceptable. If users might need to construct ROIs from shapely geometries directly, consider making it public by removing the underscore prefix and adding it to the public API documentation.

Copilot uses AI. Check for mistakes.
Comment on lines +229 to +238
# Download file if it is a URL
if str(file_path).startswith(("http://", "https://")):
with hide_pooch_hash_logs():
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
).expanduser(),
progressbar=True,
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is duplicated from load_poses.py (lines 161-170). Consider extracting this URL downloading logic into a shared helper function to avoid code duplication.

Copilot uses AI. Check for mistakes.

"""
# Download file if it is a URL
if str(file_path).startswith(("http://", "https://")):
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL validation only checks if the path starts with "http://" or "https://". Consider adding additional validation to ensure the URL is well-formed and potentially restricting to trusted domains if appropriate. While pooch.retrieve is secure by default, adding URL validation could prevent potential issues with malformed URLs.

Copilot uses AI. Check for mistakes.

"""
# Download file if it is a URL
if str(file_path).startswith(("http://", "https://")):
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL validation only checks if the path starts with "http://" or "https://". Consider adding additional validation to ensure the URL is well-formed and potentially restricting to trusted domains if appropriate. While pooch.retrieve is secure by default, adding URL validation could prevent potential issues with malformed URLs.

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +238
with hide_pooch_hash_logs():
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
).expanduser(),
progressbar=True,
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no error handling for potential download failures. If pooch.retrieve fails (e.g., due to network issues, invalid URL, or server errors), the error will propagate without context. Consider wrapping the pooch.retrieve call in a try-except block to provide more informative error messages to users, such as "Failed to download file from URL: {url}. Please check your internet connection and verify the URL is accessible."

Suggested change
with hide_pooch_hash_logs():
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
).expanduser(),
progressbar=True,
try:
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
).expanduser(),
progressbar=True,
)
except Exception as exc:
error_msg = (
f"Failed to download file from URL: {file_path}. "
"Please check your internet connection and verify the URL is accessible."
)
logger.error(error_msg)
raise RuntimeError(error_msg) from exc

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +169
with hide_pooch_hash_logs():
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
).expanduser(),
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pooch.retrieve function returns a string path, but the file_path variable is typed as Path | str in the function signature. After the URL download block, file_path will be a string (returned by pooch), which may cause type inconsistencies downstream if the code expects a Path object. Consider wrapping the result in Path() to maintain type consistency: file_path = Path(pooch.retrieve(...)).

Suggested change
with hide_pooch_hash_logs():
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
).expanduser(),
file_path = Path(
pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
).expanduser(),
progressbar=True,
)

Copilot uses AI. Check for mistakes.
Comment on lines +163 to +170
with hide_pooch_hash_logs():
file_path = pooch.retrieve(
url=file_path,
known_hash=None,
path=Path(
"~", ".movement", "data", "public_datasets"
).expanduser(),
progressbar=True,
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using the hide_pooch_hash_logs() context manager (from movement.sample_data) when downloading files to suppress SHA256 hash messages, which may not be useful for end users. The existing sample_data.py module uses this pattern to provide a cleaner user experience. Example: with hide_pooch_hash_logs(): file_path = pooch.retrieve(...)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for downloading publicly available datasets

1 participant