Skip to content

Conversation

Copy link

Copilot AI commented Oct 16, 2025

Overview

This PR adds three new methods to the HealpixDataset class to provide visibility into memory usage and file paths for catalog partitions, addressing issue #1083.

Problem

When preparing to pull data from a catalog, especially a remotely located one, users need to:

  1. Estimate how much data will be transferred before computing
  2. Discover the underlying Parquet file paths for each partition
  3. Understand the impact of column and region selection on data size

Previously, this information was not accessible via the Catalog objects.

Solution

New Methods

1. get_partition_file_paths()

Returns a dictionary mapping HEALPix pixels to their underlying Parquet file paths.

file_paths = catalog.get_partition_file_paths()
for pixel, path in file_paths.items():
    print(f"{pixel}: {path}")
# Output: Order: 1, Pixel: 44: /path/to/Norder=1/Dir=0/Npix=44.parquet

2. get_partition_metadata()

Returns a DataFrame with detailed metadata for each partition including file path, total compressed size, and per-column sizes.

metadata = catalog.get_partition_metadata()
print(metadata[["pixel", "total_size_bytes"]])

3. get_memory_estimate(include_index=False)

Provides an upper bound estimate of memory usage for the currently loaded columns and partitions.

# Load catalog with selected columns
catalog = lsdb.open_catalog("my_catalog", columns=["ra", "dec"])

# Get memory estimate
estimate = catalog.get_memory_estimate()
print(f"Estimated data size: {estimate['total_mb']:.2f} MB")
print(f"Columns: {estimate['columns']}")
print(f"Per-column breakdown:")
for col, size in estimate['per_column_bytes'].items():
    print(f"  {col}: {size} bytes")

The estimate:

  • Uses compressed Parquet metadata for realistic size calculations
  • Respects column filtering (only counts selected columns)
  • Respects spatial filtering (only counts selected partitions)
  • Optionally includes/excludes the HEALPix index column
  • Returns sizes in multiple units (bytes, KB, MB, GB) plus per-column breakdown

Use Case Example

import lsdb

# Load catalog with column selection
catalog = lsdb.open_catalog("s3://my-bucket/catalog", columns=["ra", "dec", "mag"])

# Check how much data will be transferred
estimate = catalog.get_memory_estimate()
print(f"Will transfer approximately {estimate['total_mb']:.2f} MB")

# Further filter by region if needed
filtered = catalog.cone_search(ra=180, dec=0, radius_arcsec=3600)
filtered_estimate = filtered.get_memory_estimate()
print(f"After filtering: {filtered_estimate['total_mb']:.2f} MB")
print(f"Reduction: {100*(1-filtered_estimate['total_bytes']/estimate['total_bytes']):.1f}%")

Implementation Details

  • Methods read Parquet file metadata (footers) without loading actual data
  • Compressed sizes provide realistic upper bounds for data transfer estimates
  • Actual memory usage may be lower due to row filtering, compression ratios, and Dask's lazy evaluation
  • Graceful error handling for inaccessible files (FileNotFoundError, PermissionError, OSError)

Testing

  • 8 comprehensive new tests in tests/lsdb/catalog/test_memory_estimate.py
  • All 236 existing catalog tests still pass
  • Test coverage includes:
    • File path retrieval and validation
    • Metadata consistency
    • Memory estimation with various column selections
    • Index inclusion/exclusion
    • Spatial filtering effects

Documentation

  • Added to API reference: docs/reference/catalog_inspection.rst
  • Usage guide with examples: docs/tutorials/performance.rst
  • Comprehensive docstrings with examples for all methods

Benefits

  • Transparency: Users can see exactly which files will be accessed
  • Cost Awareness: Estimate data transfer before expensive operations
  • Optimization: Make informed decisions about column/region selection
  • Remote Catalogs: Especially valuable for cloud-hosted data where transfer costs matter
  • No Breaking Changes: All additions are new methods; existing code unaffected

Closes #1083

Original prompt

In the issue <@astronomy-commons/lsdb/issues/1083>, it is claimed that the memory usage of the selected columns and partitions isn't available via the objects that are accessible from Catalog. Is this the case? If not, how could that upper limit be retrieved? Is there even a way to discover the underlying path to each Parquet file behind each partition?


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI changed the title [WIP] Fix issue with memory usage in selected columns and partitions Add methods to retrieve partition file paths and memory estimates Oct 16, 2025
Copilot AI requested a review from gitosaurus October 16, 2025 16:07
Copilot finished work on behalf of gitosaurus October 16, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unevaluated (lazy) view of Catalog should show upper size limit as well as partitions

2 participants