|
| 1 | +HATS Catalog Structure and Performance |
| 2 | +======================================================================================== |
| 3 | + |
| 4 | +This page explains how HATS catalogs are laid out on disk and how that structure |
| 5 | +influences performance. It is a practical summary of the HATS technical |
| 6 | +note for users who want to understand how catalogs are organized and why certain |
| 7 | +operations are fast or slow. |
| 8 | + |
| 9 | +For the full technical description, see the `IVOA HATS note <https://www.ivoa.net/documents/Notes/HATS/20250822/NOTE-hats-ivoa-1.0-20250822.html>`_. |
| 10 | + |
| 11 | +Catalog Layout at a Glance |
| 12 | +---------------------------------------------------------------------------------------- |
| 13 | + |
| 14 | +A HATS catalog is a directory with: |
| 15 | + |
| 16 | +- a hierarchical spatial partitioning based on HEALPix orders |
| 17 | +- Parquet data files for leaf partitions |
| 18 | +- optional supplemental tables for e.g., cross-matching and indexing |
| 19 | +- metadata files that describe the catalog and its partitions |
| 20 | + |
| 21 | +This layout lets LSDB read only the partitions that overlap your query, which is the |
| 22 | +main driver of performance. |
| 23 | + |
| 24 | +Catalog Directory Structure |
| 25 | +---------------------------------------------------------------------------------------- |
| 26 | + |
| 27 | +HATS partitions the sky into a hierarchy of HEALPix pixels. Each pixel is mapped to |
| 28 | +a directory or file path that encodes its order and pixel index. Each leaf contains the Parquet data files. The directory structure is designed to: |
| 29 | + |
| 30 | +- keep file sizes roughly uniform (adaptive tiling) |
| 31 | +- support parallel reads of independent pixels |
| 32 | + |
| 33 | +Unlike a fixed grid, HATS adapts the pixel depth to local density. Dense regions |
| 34 | +are subdivided more deeply, while sparse regions stay at coarser orders. This |
| 35 | +balance keeps partitions at a manageable size and helps avoid hot spots during |
| 36 | +queries or cross-matches. |
| 37 | + |
| 38 | +Data Files |
| 39 | +---------------------------------------------------------------------------------------- |
| 40 | + |
| 41 | +Leaf partitions contain Parquet files with catalog rows. Main advantages of Parquet storage are: |
| 42 | + |
| 43 | +- column pruning (read only what you select) |
| 44 | +- predicate pushdown (filter rows without full scans) |
| 45 | +- efficient compression for large catalogs |
| 46 | + |
| 47 | +Catalog Collections |
| 48 | +---------------------------------------------------------------------------------------- |
| 49 | + |
| 50 | +A catalog collection is a grouping of related datasets, typically a set of the main catalog and supplemental tables. Collections provide a consistent |
| 51 | +entry point for discovery and help a user to access these supplemental tables, some of which are described below. Collection metadata describes the members and any |
| 52 | +shared properties. |
| 53 | + |
| 54 | +Supplemental Tables |
| 55 | +---------------------------------------------------------------------------------------- |
| 56 | + |
| 57 | +These additional tables can be used to improve performance: |
| 58 | + |
| 59 | +- **Margin cache:** buffers pixel boundaries so spatial operations (especially |
| 60 | + cross-matching) do not miss sources near edges. If your dataset is not a catalog collection, you will need to provide a margin cache separately. See :doc:`Margins documentation page </tutorials/margins>` for more details why margin caches are important for cross-matching. |
| 61 | +- **Index tables:** allows a user quick access given an index. Typical example is finding an object given an Object ID. Without an index table, these lookups are slow because they require FULL dataset scan in order to find a given object. Index table provides information which link the partitions with their Object ID and therefore minimize the loading times. |
| 62 | +- **Association tables:** precomputed links between related catalogs to speed up |
| 63 | + multi-survey joins. |
| 64 | + |
| 65 | +Skymaps and Coverage Files |
| 66 | +---------------------------------------------------------------------------------------- |
| 67 | + |
| 68 | +HATS catalogs may include sky coverage maps and other summary assets. These are |
| 69 | +used to quickly estimate coverage, data density, or overlap before reading data from the data leaves. |
| 70 | + |
| 71 | +Metadata and Auxiliary Files |
| 72 | +---------------------------------------------------------------------------------------- |
| 73 | + |
| 74 | +Metadata files describe the catalog and its partitions. Common files include: |
| 75 | + |
| 76 | +- ``hats.properties``: key/value fields describing the catalog and its version |
| 77 | +- ``partition_info.csv``: partition list with sizes and spatial info |
| 78 | +- ``dataset/_metadata`` and ``dataset/_common_metadata``: Parquet dataset-level metadata files. |
| 79 | + ``_common_metadata`` typically contains only the shared schema (column names, dtypes, and logical types) for the dataset, |
| 80 | + while ``_metadata`` usually aggregates per-file / per-row-group metadata (e.g., statistics, row group locations, and encodings). |
| 81 | +- ``dataset/data_thumbnail.parquet``: small sample of data for quick inspection |
| 82 | +- ``collection.properties``: metadata for catalog collections |
| 83 | + |
| 84 | +LSDB uses these files to plan queries, estimate cost, and decide which partitions need to be loaded. |
| 85 | + |
| 86 | +Performance Considerations |
| 87 | +---------------------------------------------------------------------------------------- |
| 88 | + |
| 89 | + |
| 90 | +- **Partition count matters:** selecting and operating on larger parts of the sky means that more pixels need to be opened. If possible, use spatial filters to reduce pixel selection early. |
| 91 | +- **True random access is expensive:** random access to many rows which are scattered across the sky will be slow, especially over network. This is because, even if only one row is needed from a given pixel, the entire pixel still needs to be downloaded and opened. Therefore, work on local data and/or try to design your access patterns to be as spatially coherent as possible. |
| 92 | +- **Column selection is critical:** Parquet column pruning is one of the biggest |
| 93 | + performance wins. Select only what you need. |
| 94 | + Column pruning is most effective when the storage backend supports efficient random reads (HTTP ``Range`` requests or S3 ranged ``GET``). |
| 95 | + If an HTTP endpoint does not support range reads, Parquet readers may be forced to download much larger parts of each file (up to the full file), |
| 96 | + reducing or eliminating the benefit of selecting a small subset of columns. Even when range reads are supported, many small range requests can be |
| 97 | + latency-bound; in practice S3 backends often sustain higher concurrency and throughput than generic HTTP servers. |
| 98 | +- **Metadata scans are not free:** even thought initial catalog access does not load the actual data, it does read the metadata files and can be slow over network, especially for catalogs with many partitions. The size of metadata scales with the number of partitions, so catalogs with many small partitions will have larger metadata overhead. Local cache should reduce repeated downloads of metadata. |
| 99 | + |
0 commit comments