Skip to content

Commit c35a003

Browse files
authored
Added init data access readme (#1198)
* Added init data access readme * Add top level section link * Improve clarity, add top link * Update links, improve clarity * Fix links for TAP and server * Add remote_data page, apply review suggestions * Add VO link, further cleaning * Update link to remote access pages
1 parent 3124747 commit c35a003

21 files changed

+236
-20
lines changed
555 KB
Loading

docs/data-access.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
Data Access
2+
========================================================================================
3+
4+
This section provides a high-level overview of how LSDB catalogs are hosted, served, and accessed across different providers and protocols.
5+
6+
.. toctree::
7+
:maxdepth: 2
8+
9+
1. data.lsdb.io <data-access/datalsdb>
10+
2. External Data Centers <data-access/external>
11+
3. Finding HATS Catalogs with VO <data-access/VO>
12+
4. HATS Catalog Structure <data-access/hats>
13+
5. Access with http/cloud <data-access/remote_data>
14+
6. Access with TAP <data-access/tap-lsdb>
15+
7. Server-side filtering <data-access/server-lsdb>

docs/data-access/VO.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
[Under construction] Finding HATS catalogs via VO
2+
========================================================================================
3+
4+
To discover additional HATS-related services registered across the community, use the :
5+
`Virtual Observatory Registry search <https://vao.stsci.edu/registry-search/?f%5BcapabilityType_facet%5D%5B%5D=Custom+Service&per_page=100&q=hats&search_field=all_fields/>`__
6+
7+
This can help you identify:
8+
9+
- other institutions serving HATS/LSDB-style catalogs
10+
- endpoints and access metadata (when provided)
11+
- service descriptions and maintainers

docs/data-access/datalsdb.rst

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
Data Access via data.lsdb.io
2+
========================================================================================
3+
4+
5+
This page walks through the `data.lsdb.io <https://data.lsdb.io/>`__ site and explains how the information on that page maps to where catalogs live and how you can access them.
6+
7+
At `data.lsdb.io <https://data.lsdb.io/>`__ we provide information about all catalogs that are served by LINCC Frameworks, Space Telescope Science Institute (STScI), IPAC/IRSA, and the select catalogs from Centre de Données astronomiques de Strasbourg (CDS). Further public catalogs are avaliable - see the :doc:`external data centers <external>` page for more information. Additionally, HATS catalogs for data products from Rubin Observatory are available via mutliple access points described in the :doc:`Rubin LSDB Access </tutorials/pre_executed/rubin_dp1>` page.
8+
9+
data.lsdb.io Layout
10+
----------------------------------------------------------------------------------------
11+
12+
.. figure:: ../_static/data-access-overview-annotated.png
13+
:alt: Annotated data.lsdb.io catalog browser with numbered callouts.
14+
:width: 100%
15+
16+
data.lsdb.io page layout with numbered callouts for key sections.
17+
18+
Let's consider the example Figure above and explain briefly the numbered callouts which map to these sections:
19+
20+
1. **Catalog list:** the left sidebar that lets you browse catalogs and releases. If there are multiple providers of a given catalog, they will be grouped under the same catalog name, with different hosting regions and access methods indicated.
21+
2. **Catalog overview:** the title and description for the selected catalog, including the simplest way on how to access the data through LSDB and how to download directly.
22+
3. **Region description:** describes the hosting region (for example, ``US-EAST``). See discussion below for details of various options.
23+
4. **Access type:** describes the access path, which are either ``S3`` or ``HTTP`` endpoints. See discussion below for details and more information is available in the :doc:`remote data access page <remote_data>`.
24+
5. **Catalog metadata:** summary table showing the number of rows, columns, partitions, size on disk and which version of HATS builder or pipeline was used to create the catalog. See discussion about versioning below for details.
25+
26+
27+
Region description and access type
28+
----------------------------------------------------------------------------------------
29+
30+
The region tabs (for example, ``US-EAST`` or ``Europe``) indicate the hosting region for that catalog copy. Which provider is the fastest for you will depend on number of factors, including your geographic location, network conditions, and whether you are accessing via ``HTTP`` or ``S3`` protocols.
31+
32+
Below are details about each provider:
33+
34+
``US-WEST & HTTP``: These datasets are hosted at University of Washington (UW) on static HTTP/S endpoints. They are accessible globally, but speed might be limited due to limited bandwidth that is avaliable from UW servers.
35+
36+
``US-EAST & S3``: These datasets are hosted in Amazon Web Services (AWS) S3 buckets in the US East region, provided by Space Telescope Science Institute (STScI). They can be accessed via ``s3://`` URLs, and should provided better performance due to robust cloud infrastructure.
37+
38+
``US-WEST & S3``: These datasets are hosted in Amazon Web Services (AWS) S3 buckets in the US West region, provided by IPAC/IRSA. They can also be accessed via ``s3://`` URLs, and should provide better performance due to robust cloud infrastructure.
39+
40+
``Europe & HTTP``: These datasets are hosted at the Centre de Données astronomiques de Strasbourg (CDS) in Europe on static HTTP/S endpoints. They are using experimental HATS-on-the-fly serving infrastructure. More information is avaliable at :doc:`External providers page <external>`.
41+
42+
43+
Catalog Metadata - Version
44+
----------------------------------------------------------------------------------------
45+
46+
The version indicates which version of the HATS builder or pipeline was used to create the catalog. This should enable you understand which features are available in the catalog. In particular, hats builder versions >= 0.6.0 support catalog collections, which changes how the auxiliary data is accessed. See page describing the :doc:`format <hats>` for more details.

docs/data-access/external.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
Data Access via External Data Centers
2+
========================================================================================
3+
4+
Find below the access pages for external data centers who provide a large number of LSDB public catalogs. These links have more information about how to access the data, possible constraints and contact information.
5+
6+
7+
Links
8+
----------------------------------------------------------------------------------------
9+
10+
- LIneA (Brazil) LSDB access guide: https://data.linea.org.br/en/index.html
11+
- CDS (France) LSDB access guide: https://vizcat.cds.unistra.fr/hats/
12+
13+
Notes
14+
----------------------------------------------------------------------------------------
15+
16+
- CDS uses an experimental HATS-on-the-fly serving infrastructure. It may not support all LSDB features and may be somewhat slower than other providers; however, eventually all catalogs served via Vizier should be avaliable! More technical information can be found in `this presentation <http://wiki.ivoa.net/internal/IVOA/InterOpJune2025Apps/OnTheFlyHATS-FXPIneau-v1.pdf>`_.

docs/data-access/hats.rst

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
HATS Catalog Structure and Performance
2+
========================================================================================
3+
4+
This page explains how HATS catalogs are laid out on disk and how that structure
5+
influences performance. It is a practical summary of the HATS technical
6+
note for users who want to understand how catalogs are organized and why certain
7+
operations are fast or slow.
8+
9+
For the full technical description, see the `IVOA HATS note <https://www.ivoa.net/documents/Notes/HATS/20250822/NOTE-hats-ivoa-1.0-20250822.html>`_.
10+
11+
Catalog Layout at a Glance
12+
----------------------------------------------------------------------------------------
13+
14+
A HATS catalog is a directory with:
15+
16+
- a hierarchical spatial partitioning based on HEALPix orders
17+
- Parquet data files for leaf partitions
18+
- optional supplemental tables for e.g., cross-matching and indexing
19+
- metadata files that describe the catalog and its partitions
20+
21+
This layout lets LSDB read only the partitions that overlap your query, which is the
22+
main driver of performance.
23+
24+
Catalog Directory Structure
25+
----------------------------------------------------------------------------------------
26+
27+
HATS partitions the sky into a hierarchy of HEALPix pixels. Each pixel is mapped to
28+
a directory or file path that encodes its order and pixel index. Each leaf contains the Parquet data files. The directory structure is designed to:
29+
30+
- keep file sizes roughly uniform (adaptive tiling)
31+
- support parallel reads of independent pixels
32+
33+
Unlike a fixed grid, HATS adapts the pixel depth to local density. Dense regions
34+
are subdivided more deeply, while sparse regions stay at coarser orders. This
35+
balance keeps partitions at a manageable size and helps avoid hot spots during
36+
queries or cross-matches.
37+
38+
Data Files
39+
----------------------------------------------------------------------------------------
40+
41+
Leaf partitions contain Parquet files with catalog rows. Main advantages of Parquet storage are:
42+
43+
- column pruning (read only what you select)
44+
- predicate pushdown (filter rows without full scans)
45+
- efficient compression for large catalogs
46+
47+
Catalog Collections
48+
----------------------------------------------------------------------------------------
49+
50+
A catalog collection is a grouping of related datasets, typically a set of the main catalog and supplemental tables. Collections provide a consistent
51+
entry point for discovery and help a user to access these supplemental tables, some of which are described below. Collection metadata describes the members and any
52+
shared properties.
53+
54+
Supplemental Tables
55+
----------------------------------------------------------------------------------------
56+
57+
These additional tables can be used to improve performance:
58+
59+
- **Margin cache:** buffers pixel boundaries so spatial operations (especially
60+
cross-matching) do not miss sources near edges. If your dataset is not a catalog collection, you will need to provide a margin cache separately. See :doc:`Margins documentation page </tutorials/margins>` for more details why margin caches are important for cross-matching.
61+
- **Index tables:** allows a user quick access given an index. Typical example is finding an object given an Object ID. Without an index table, these lookups are slow because they require FULL dataset scan in order to find a given object. Index table provides information which link the partitions with their Object ID and therefore minimize the loading times.
62+
- **Association tables:** precomputed links between related catalogs to speed up
63+
multi-survey joins.
64+
65+
Skymaps and Coverage Files
66+
----------------------------------------------------------------------------------------
67+
68+
HATS catalogs may include sky coverage maps and other summary assets. These are
69+
used to quickly estimate coverage, data density, or overlap before reading data from the data leaves.
70+
71+
Metadata and Auxiliary Files
72+
----------------------------------------------------------------------------------------
73+
74+
Metadata files describe the catalog and its partitions. Common files include:
75+
76+
- ``hats.properties``: key/value fields describing the catalog and its version
77+
- ``partition_info.csv``: partition list with sizes and spatial info
78+
- ``dataset/_metadata`` and ``dataset/_common_metadata``: Parquet dataset-level metadata files.
79+
``_common_metadata`` typically contains only the shared schema (column names, dtypes, and logical types) for the dataset,
80+
while ``_metadata`` usually aggregates per-file / per-row-group metadata (e.g., statistics, row group locations, and encodings).
81+
- ``dataset/data_thumbnail.parquet``: small sample of data for quick inspection
82+
- ``collection.properties``: metadata for catalog collections
83+
84+
LSDB uses these files to plan queries, estimate cost, and decide which partitions need to be loaded.
85+
86+
Performance Considerations
87+
----------------------------------------------------------------------------------------
88+
89+
90+
- **Partition count matters:** selecting and operating on larger parts of the sky means that more pixels need to be opened. If possible, use spatial filters to reduce pixel selection early.
91+
- **True random access is expensive:** random access to many rows which are scattered across the sky will be slow, especially over network. This is because, even if only one row is needed from a given pixel, the entire pixel still needs to be downloaded and opened. Therefore, work on local data and/or try to design your access patterns to be as spatially coherent as possible.
92+
- **Column selection is critical:** Parquet column pruning is one of the biggest
93+
performance wins. Select only what you need.
94+
Column pruning is most effective when the storage backend supports efficient random reads (HTTP ``Range`` requests or S3 ranged ``GET``).
95+
If an HTTP endpoint does not support range reads, Parquet readers may be forced to download much larger parts of each file (up to the full file),
96+
reducing or eliminating the benefit of selecting a small subset of columns. Even when range reads are supported, many small range requests can be
97+
latency-bound; in practice S3 backends often sustain higher concurrency and throughput than generic HTTP servers.
98+
- **Metadata scans are not free:** even thought initial catalog access does not load the actual data, it does read the metadata files and can be slow over network, especially for catalogs with many partitions. The size of metadata scales with the number of partitions, so catalogs with many small partitions will have larger metadata overhead. Local cache should reduce repeated downloads of metadata.
99+
Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,6 @@
66
"source": [
77
"# Accessing remote data\n",
88
"\n",
9-
"## Learning Objectives\n",
10-
"\n",
11-
"By the end of this tutorial, you will:\n",
12-
"\n",
13-
"* Be able to load catalogs from HTTP sources\n",
14-
"* Understand how to connect to other kinds of file systems\n",
15-
"\n",
16-
"## Introduction\n",
179
"\n",
1810
"If you're accessing HATS catalogs on a local file system, a typical path string like `\"/path/to/catalogs\"` will be sufficient. This tutorial will help you get started if you need to access data over HTTP/S, cloud storage, or have some additional parameters for connecting to your data.\n",
1911
"\n",

docs/data-access/server-lsdb.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
[Under Construction] server.lsdb.io Experimental Service
2+
========================================================================================
3+
4+
The LSDB server service provides server-side filtering access to public HATS catalogs.
5+
6+
Link
7+
----------------------------------------------------------------------------------------
8+
9+
- Server endpoint: https://server.lsdb.io
10+
11+
12+

docs/data-access/tap-lsdb.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
[Under Construction] tap.data.lsdb.io Experimental Service
2+
========================================================================================
3+
4+
The LSDB TAP service provides a TAP-compatible endpoint for catalog discovery and querying.
5+
6+
Link
7+
----------------------------------------------------------------------------------------
8+
9+
- TAP endpoint: https://tap.data.lsdb.io
10+
11+
12+
13+

docs/index.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,14 @@ Using this Guide
3939

4040
Learn the LSDB features by working through our guides
4141

42+
.. grid:: 1 1 1 1
43+
44+
.. grid-item-card:: Data Access
45+
:link: data-access
46+
:link-type: doc
47+
48+
How LSDB catalogs are served and accessed across providers
49+
4250
.. grid:: 1 1 2 2
4351

4452
.. grid-item-card:: API Reference
@@ -60,6 +68,7 @@ Using this Guide
6068
Home page <self>
6169
Getting Started <getting-started>
6270
Tutorials <tutorials>
71+
Data Access <data-access>
6372
API Reference <reference>
6473

6574
.. toctree::

0 commit comments

Comments
 (0)