Add methods to retrieve partition file paths and memory estimates #1085

Copilot · 2025-10-16T15:38:50Z

Overview

This PR adds three new methods to the HealpixDataset class to provide visibility into memory usage and file paths for catalog partitions, addressing issue #1083.

Problem

When preparing to pull data from a catalog, especially a remotely located one, users need to:

Estimate how much data will be transferred before computing
Discover the underlying Parquet file paths for each partition
Understand the impact of column and region selection on data size

Previously, this information was not accessible via the Catalog objects.

Solution

New Methods

1. `get_partition_file_paths()`

Returns a dictionary mapping HEALPix pixels to their underlying Parquet file paths.

file_paths = catalog.get_partition_file_paths()
for pixel, path in file_paths.items():
    print(f"{pixel}: {path}")
# Output: Order: 1, Pixel: 44: /path/to/Norder=1/Dir=0/Npix=44.parquet

2. `get_partition_metadata()`

Returns a DataFrame with detailed metadata for each partition including file path, total compressed size, and per-column sizes.

metadata = catalog.get_partition_metadata()
print(metadata[["pixel", "total_size_bytes"]])

3. `get_memory_estimate(include_index=False)`

Provides an upper bound estimate of memory usage for the currently loaded columns and partitions.

# Load catalog with selected columns
catalog = lsdb.open_catalog("my_catalog", columns=["ra", "dec"])

# Get memory estimate
estimate = catalog.get_memory_estimate()
print(f"Estimated data size: {estimate['total_mb']:.2f} MB")
print(f"Columns: {estimate['columns']}")
print(f"Per-column breakdown:")
for col, size in estimate['per_column_bytes'].items():
    print(f"  {col}: {size} bytes")

The estimate:

Uses compressed Parquet metadata for realistic size calculations
Respects column filtering (only counts selected columns)
Respects spatial filtering (only counts selected partitions)
Optionally includes/excludes the HEALPix index column
Returns sizes in multiple units (bytes, KB, MB, GB) plus per-column breakdown

Use Case Example

import lsdb

# Load catalog with column selection
catalog = lsdb.open_catalog("s3://my-bucket/catalog", columns=["ra", "dec", "mag"])

# Check how much data will be transferred
estimate = catalog.get_memory_estimate()
print(f"Will transfer approximately {estimate['total_mb']:.2f} MB")

# Further filter by region if needed
filtered = catalog.cone_search(ra=180, dec=0, radius_arcsec=3600)
filtered_estimate = filtered.get_memory_estimate()
print(f"After filtering: {filtered_estimate['total_mb']:.2f} MB")
print(f"Reduction: {100*(1-filtered_estimate['total_bytes']/estimate['total_bytes']):.1f}%")

Implementation Details

Methods read Parquet file metadata (footers) without loading actual data
Compressed sizes provide realistic upper bounds for data transfer estimates
Actual memory usage may be lower due to row filtering, compression ratios, and Dask's lazy evaluation
Graceful error handling for inaccessible files (FileNotFoundError, PermissionError, OSError)

Testing

8 comprehensive new tests in tests/lsdb/catalog/test_memory_estimate.py
All 236 existing catalog tests still pass
Test coverage includes:
- File path retrieval and validation
- Metadata consistency
- Memory estimation with various column selections
- Index inclusion/exclusion
- Spatial filtering effects

Documentation

Added to API reference: docs/reference/catalog_inspection.rst
Usage guide with examples: docs/tutorials/performance.rst
Comprehensive docstrings with examples for all methods

Benefits

Transparency: Users can see exactly which files will be accessed
Cost Awareness: Estimate data transfer before expensive operations
Optimization: Make informed decisions about column/region selection
Remote Catalogs: Especially valuable for cloud-hosted data where transfer costs matter
No Breaking Changes: All additions are new methods; existing code unaffected

Closes #1083

Original prompt

In the issue <@astronomy-commons/lsdb/issues/1083>, it is claimed that the memory usage of the selected columns and partitions isn't available via the objects that are accessible from Catalog. Is this the case? If not, how could that upper limit be retrieved? Is there even a way to discover the underlying path to each Parquet file behind each partition?

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Co-authored-by: gitosaurus <[email protected]>

Initial plan

f648c2a

Copilot AI assigned Copilot and gitosaurus Oct 16, 2025

Copilot started work on behalf of gitosaurus October 16, 2025 15:39 View session

Copilot AI and others added 4 commits October 16, 2025 15:51

Add methods to get partition file paths and memory estimates

e2e2876

Co-authored-by: gitosaurus <[email protected]>

Format code with black and add comprehensive tests

aa0c8da

Co-authored-by: gitosaurus <[email protected]>

Improve exception handling in get_partition_metadata

f69fdb5

Co-authored-by: gitosaurus <[email protected]>

Add documentation for new memory estimation methods

a03a455

Co-authored-by: gitosaurus <[email protected]>

Copilot AI changed the title ~~[WIP] Fix issue with memory usage in selected columns and partitions~~ Add methods to retrieve partition file paths and memory estimates Oct 16, 2025

Copilot AI requested a review from gitosaurus October 16, 2025 16:07

Copilot finished work on behalf of gitosaurus October 16, 2025 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add methods to retrieve partition file paths and memory estimates #1085

Add methods to retrieve partition file paths and memory estimates #1085

Copilot AI commented Oct 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add methods to retrieve partition file paths and memory estimates #1085

Are you sure you want to change the base?

Add methods to retrieve partition file paths and memory estimates #1085

Conversation

Copilot AI commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Problem

Solution

New Methods

1. get_partition_file_paths()

2. get_partition_metadata()

3. get_memory_estimate(include_index=False)

Use Case Example

Implementation Details

Testing

Documentation

Benefits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 16, 2025 •

edited

Loading

1. `get_partition_file_paths()`

2. `get_partition_metadata()`

3. `get_memory_estimate(include_index=False)`