Added init data access readme (#1198)

nevencaplar · web-flow · commit c35a003d2d3d · 2026-01-21T17:24:32.000-08:00
* Added init data access readme

* Add top level section link

* Improve clarity, add top link

* Update links, improve clarity

* Fix links for TAP and server

* Add remote_data page, apply review suggestions

* Add VO link, further cleaning

* Update link to remote access pages
diff --git a/docs/_static/data-access-overview-annotated.png b/docs/_static/data-access-overview-annotated.png
diff --git a/docs/data-access.rst b/docs/data-access.rst
@@ -0,0 +1,15 @@
+Data Access
+========================================================================================
+
+This section provides a high-level overview of how LSDB catalogs are hosted, served, and accessed across different providers and protocols. 
+
+.. toctree::
+   :maxdepth: 2
+
+   1. data.lsdb.io <data-access/datalsdb>
+   2. External Data Centers <data-access/external>
+   3. Finding HATS Catalogs with VO <data-access/VO> 
+   4. HATS Catalog Structure <data-access/hats>
+   5. Access with http/cloud <data-access/remote_data>
+   6. Access with TAP <data-access/tap-lsdb>
+   7. Server-side filtering <data-access/server-lsdb>
diff --git a/docs/data-access/VO.rst b/docs/data-access/VO.rst
@@ -0,0 +1,11 @@
+[Under construction] Finding HATS catalogs via VO
+========================================================================================
+
+To discover additional HATS-related services registered across the community, use the :
+`Virtual Observatory Registry search <https://vao.stsci.edu/registry-search/?f%5BcapabilityType_facet%5D%5B%5D=Custom+Service&per_page=100&q=hats&search_field=all_fields/>`__ 
+
+This can help you identify:
+
+- other institutions serving HATS/LSDB-style catalogs
+- endpoints and access metadata (when provided)
+- service descriptions and maintainers
diff --git a/docs/data-access/datalsdb.rst b/docs/data-access/datalsdb.rst
@@ -0,0 +1,46 @@
+Data Access via data.lsdb.io
+========================================================================================
+
+
+This page walks through the `data.lsdb.io <https://data.lsdb.io/>`__ site and explains how the information on that page maps to where catalogs live and how you can access them.
+
+At `data.lsdb.io <https://data.lsdb.io/>`__ we provide information about all catalogs that are served by LINCC Frameworks, Space Telescope Science Institute (STScI), IPAC/IRSA, and the select catalogs from Centre de Données astronomiques de Strasbourg (CDS). Further public catalogs are avaliable - see the :doc:`external data centers <external>` page for more information. Additionally, HATS catalogs for data products from Rubin Observatory are available via mutliple access points described in the :doc:`Rubin LSDB Access </tutorials/pre_executed/rubin_dp1>` page.
+
+data.lsdb.io Layout
+----------------------------------------------------------------------------------------
+
+.. figure:: ../_static/data-access-overview-annotated.png
+   :alt: Annotated data.lsdb.io catalog browser with numbered callouts.
+   :width: 100%
+
+   data.lsdb.io page layout with numbered callouts for key sections.
+
+Let's consider the example Figure above and explain briefly the numbered callouts which map to these sections:
+
+1. **Catalog list:** the left sidebar that lets you browse catalogs and releases. If there are multiple providers of a given catalog, they will be grouped under the same catalog name, with different hosting regions and access methods indicated.
+2. **Catalog overview:** the title and description for the selected catalog, including the simplest way on how to access the data through LSDB and how to  download directly.
+3. **Region description:** describes the hosting region (for example, ``US-EAST``). See discussion below for details of various options.
+4. **Access type:** describes the access path, which are either ``S3`` or ``HTTP`` endpoints. See discussion below for details and more information is available in the :doc:`remote data access page <remote_data>`.
+5. **Catalog metadata:** summary table showing the number of rows, columns, partitions, size on disk and which version of HATS builder or pipeline was used to create the catalog. See discussion about versioning below for details.
+
+
+Region description and access type
+----------------------------------------------------------------------------------------
+
+The region tabs (for example, ``US-EAST`` or ``Europe``) indicate the hosting region for that catalog copy. Which provider is the fastest for you will depend on number of factors, including your geographic location, network conditions, and whether you are accessing via ``HTTP`` or ``S3`` protocols.
+
+Below are details about each provider: 
+
+``US-WEST & HTTP``: These datasets are hosted at University of Washington (UW) on static HTTP/S endpoints. They are accessible globally, but speed might be limited due to limited bandwidth that is avaliable from UW servers.
+
+``US-EAST & S3``: These datasets are hosted in Amazon Web Services (AWS) S3 buckets in the US East region, provided by Space Telescope Science Institute (STScI). They can be accessed via ``s3://`` URLs, and should provided better performance due to robust cloud infrastructure.
+
+``US-WEST & S3``: These datasets are hosted in Amazon Web Services (AWS) S3 buckets in the US West region, provided by IPAC/IRSA. They can also be accessed via ``s3://`` URLs, and should provide better performance due to robust cloud infrastructure.
+
+``Europe & HTTP``: These datasets are hosted at the Centre de Données astronomiques de Strasbourg (CDS) in Europe on static HTTP/S endpoints. They are using experimental HATS-on-the-fly serving infrastructure. More information is avaliable at :doc:`External providers page <external>`.
+
+
+Catalog Metadata - Version
+----------------------------------------------------------------------------------------
+
+The version indicates which version of the HATS builder or pipeline was used to create the catalog. This should enable you understand which features are available in the catalog. In particular, hats builder versions >= 0.6.0 support catalog collections, which changes how the auxiliary data is accessed. See page describing the :doc:`format <hats>` for more details.
diff --git a/docs/data-access/external.rst b/docs/data-access/external.rst
@@ -0,0 +1,16 @@
+Data Access via External Data Centers
+========================================================================================
+
+Find below the access pages for external data centers who provide a large number of LSDB public catalogs. These links have more information about how to access the data, possible constraints and contact information. 
+
+
+Links
+----------------------------------------------------------------------------------------
+
+- LIneA (Brazil) LSDB access guide: https://data.linea.org.br/en/index.html 
+- CDS (France) LSDB access guide: https://vizcat.cds.unistra.fr/hats/ 
+
+Notes
+----------------------------------------------------------------------------------------
+
+- CDS uses an experimental HATS-on-the-fly serving infrastructure. It may not support all LSDB features and may be somewhat slower than other providers; however, eventually all catalogs served via Vizier should be avaliable! More technical information can be found in `this presentation <http://wiki.ivoa.net/internal/IVOA/InterOpJune2025Apps/OnTheFlyHATS-FXPIneau-v1.pdf>`_.
diff --git a/docs/data-access/hats.rst b/docs/data-access/hats.rst
@@ -0,0 +1,99 @@
+HATS Catalog Structure and Performance
+========================================================================================
+
+This page explains how HATS catalogs are laid out on disk and how that structure
+influences performance. It is a practical summary of the HATS technical
+note for users who want to understand how catalogs are organized and why certain
+operations are fast or slow.
+
+For the full technical description, see the `IVOA HATS note <https://www.ivoa.net/documents/Notes/HATS/20250822/NOTE-hats-ivoa-1.0-20250822.html>`_.
+
+Catalog Layout at a Glance
+----------------------------------------------------------------------------------------
+
+A HATS catalog is a directory with:
+
+- a hierarchical spatial partitioning based on HEALPix orders
+- Parquet data files for leaf partitions
+- optional supplemental tables for e.g., cross-matching and indexing
+- metadata files that describe the catalog and its partitions
+
+This layout lets LSDB read only the partitions that overlap your query, which is the
+main driver of performance.
+
+Catalog Directory Structure
+----------------------------------------------------------------------------------------
+
+HATS partitions the sky into a hierarchy of HEALPix pixels. Each pixel is mapped to
+a directory or file path that encodes its order and pixel index. Each leaf contains the Parquet data files. The directory structure is designed to:
+
+- keep file sizes roughly uniform (adaptive tiling)
+- support parallel reads of independent pixels
+
+Unlike a fixed grid, HATS adapts the pixel depth to local density. Dense regions
+are subdivided more deeply, while sparse regions stay at coarser orders. This
+balance keeps partitions at a manageable size and helps avoid hot spots during
+queries or cross-matches.
+
+Data Files
+----------------------------------------------------------------------------------------
+
+Leaf partitions contain Parquet files with catalog rows. Main advantages of Parquet storage are:
+
+- column pruning (read only what you select)
+- predicate pushdown (filter rows without full scans)
+- efficient compression for large catalogs
+
+Catalog Collections
+----------------------------------------------------------------------------------------
+
+A catalog collection is a grouping of related datasets, typically a set of the main catalog and supplemental tables. Collections provide a consistent
+entry point for discovery and help a user to access these supplemental tables, some of which are described below. Collection metadata describes the members and any
+shared properties.
+
+Supplemental Tables
+----------------------------------------------------------------------------------------
+
+These additional tables can be used to improve performance:
+
+- **Margin cache:** buffers pixel boundaries so spatial operations (especially
+  cross-matching) do not miss sources near edges. If your dataset is not a catalog collection, you will need to provide a margin cache separately. See :doc:`Margins documentation page </tutorials/margins>` for more details why margin caches are important for cross-matching.
+- **Index tables:** allows a user quick access given an index. Typical example is finding an object given an Object ID. Without an index table, these lookups are slow because they require FULL dataset scan in order to find a given object. Index table provides information which link the partitions with their Object ID and therefore minimize the loading times. 
+- **Association tables:** precomputed links between related catalogs to speed up
+  multi-survey joins.
+
+Skymaps and Coverage Files
+----------------------------------------------------------------------------------------
+
+HATS catalogs may include sky coverage maps and other summary assets. These are
+used to quickly estimate coverage, data density, or overlap before reading data from the data leaves.
+
+Metadata and Auxiliary Files
+----------------------------------------------------------------------------------------
+
+Metadata files describe the catalog and its partitions. Common files include:
+
+- ``hats.properties``: key/value fields describing the catalog and its version
+- ``partition_info.csv``: partition list with sizes and spatial info
+- ``dataset/_metadata`` and ``dataset/_common_metadata``: Parquet dataset-level metadata files. 
+  ``_common_metadata`` typically contains only the shared schema (column names, dtypes, and logical types) for the dataset,
+  while ``_metadata`` usually aggregates per-file / per-row-group metadata (e.g., statistics, row group locations, and encodings).
+- ``dataset/data_thumbnail.parquet``: small sample of data for quick inspection
+- ``collection.properties``: metadata for catalog collections
+
+LSDB uses these files to plan queries, estimate cost, and decide which partitions need to be loaded.
+
+Performance Considerations
+----------------------------------------------------------------------------------------
+
+
+- **Partition count matters:** selecting and operating on larger parts of the sky means that more pixels need to be opened. If possible, use spatial filters to reduce pixel selection early.
+- **True random access is expensive:** random access to many rows which are scattered across the sky will be slow, especially over network. This is because, even if only one row is needed from a given pixel, the entire pixel still needs to be downloaded and opened. Therefore, work on local data and/or try to design your access patterns to be as spatially coherent as possible. 
+- **Column selection is critical:** Parquet column pruning is one of the biggest
+  performance wins. Select only what you need.
+  Column pruning is most effective when the storage backend supports efficient random reads (HTTP ``Range`` requests or S3 ranged ``GET``).
+  If an HTTP endpoint does not support range reads, Parquet readers may be forced to download much larger parts of each file (up to the full file),
+  reducing or eliminating the benefit of selecting a small subset of columns. Even when range reads are supported, many small range requests can be
+  latency-bound; in practice S3 backends often sustain higher concurrency and throughput than generic HTTP servers.
+- **Metadata scans are not free:** even thought initial catalog access does not load the actual data, it does read the metadata files and can be slow over network, especially for catalogs with many partitions. The size of metadata scales with the number of partitions, so catalogs with many small partitions will have larger metadata overhead. Local cache should reduce repeated downloads of metadata. 
+
diff --git a/docs/data-access/remote_data.ipynb b/docs/data-access/remote_data.ipynb
@@ -6,14 +6,6 @@
    "source": [
     "# Accessing remote data\n",
     "\n",
-    "## Learning Objectives\n",
-    "\n",
-    "By the end of this tutorial, you will:\n",
-    "\n",
-    "* Be able to load catalogs from HTTP sources\n",
-    "* Understand how to connect to other kinds of file systems\n",
-    "\n",
-    "## Introduction\n",
     "\n",
     "If you're accessing HATS catalogs on a local file system, a typical path string like `\"/path/to/catalogs\"` will be sufficient. This tutorial will help you get started if you need to access data over HTTP/S, cloud storage, or have some additional parameters for connecting to your data.\n",
     "\n",
diff --git a/docs/data-access/server-lsdb.rst b/docs/data-access/server-lsdb.rst
@@ -0,0 +1,12 @@
+[Under Construction] server.lsdb.io Experimental Service
+========================================================================================
+
+The LSDB server service provides server-side filtering access to public HATS catalogs.
+
+Link
+----------------------------------------------------------------------------------------
+
+- Server endpoint: https://server.lsdb.io
+
+
+
diff --git a/docs/data-access/tap-lsdb.rst b/docs/data-access/tap-lsdb.rst
@@ -0,0 +1,13 @@
+[Under Construction] tap.data.lsdb.io Experimental Service
+========================================================================================
+
+The LSDB TAP service provides a TAP-compatible endpoint for catalog discovery and querying.
+
+Link
+----------------------------------------------------------------------------------------
+
+- TAP endpoint: https://tap.data.lsdb.io
+
+
+
+ 
diff --git a/docs/index.rst b/docs/index.rst
@@ -39,6 +39,14 @@ Using this Guide
 
        Learn the LSDB features by working through our guides
 
+.. grid:: 1 1 1 1
+
+   .. grid-item-card:: Data Access
+       :link: data-access
+       :link-type: doc
+
+       How LSDB catalogs are served and accessed across providers
+
 .. grid:: 1 1 2 2
 
    .. grid-item-card:: API Reference
@@ -60,6 +68,7 @@ Using this Guide
    Home page <self>
    Getting Started <getting-started>
    Tutorials <tutorials>
+   Data Access <data-access>
    API Reference <reference>
 
 .. toctree::
diff --git a/docs/tutorial_toc/toc_hats.rst b/docs/tutorial_toc/toc_hats.rst
@@ -6,4 +6,4 @@ HATS creation and reading
 
     Import catalogs </tutorials/import_catalogs>
     Manual catalog verification </tutorials/pre_executed/manual_verification>
-    Accessing remote data </tutorials/remote_data>
+
diff --git a/docs/tutorials/catalog_object.ipynb b/docs/tutorials/catalog_object.ipynb
@@ -87,12 +87,15 @@
     "slideshow": {
      "slide_type": ""
     },
-    "tags": []
+    "tags": [],
+    "vscode": {
+     "languageId": "raw"
+    }
    },
    "source": [
     ".. nbinfo::    Additional Help\n",
     "\n",
-    "    For tips on accessing remote data, see our :doc:`Accessing remote data tutorial </tutorials/remote_data>`"
+    "    For tips on accessing remote data, see our :doc:`Accessing remote data guide </data-access/remote_data>`"
    ]
   },
   {
diff --git a/docs/tutorials/margins.ipynb b/docs/tutorials/margins.ipynb
@@ -55,7 +55,7 @@
     "    and our :doc:`Dask cluster configuration </tutorials/dask-cluster-tips>` page for LSDB-specific tips. \n",
     "    Note that dask also provides its own `best practices <https://docs.dask.org/en/stable/best-practices.html>`__, which may also be useful to consult.\n",
     "    \n",
-    "    For tips on accessing remote data, see our :doc:`Accessing remote data tutorial </tutorials/remote_data>`"
+    "    For tips on accessing remote data, see our :doc:`Accessing remote data guide </data-access/remote_data>`"
    ]
   },
   {
diff --git a/docs/tutorials/pre_executed/crossmatching.ipynb b/docs/tutorials/pre_executed/crossmatching.ipynb
@@ -61,7 +61,7 @@
     "    and our :doc:`Dask cluster configuration </tutorials/dask-cluster-tips>` page for LSDB-specific tips. \n",
     "    Note that dask also provides its own `best practices <https://docs.dask.org/en/stable/best-practices.html>`__, which may also be useful to consult.\n",
     "    \n",
-    "    For tips on accessing remote data, see our :doc:`Accessing remote data tutorial </tutorials/remote_data>`."
+    "    For tips on accessing remote data, see our :doc:`Accessing remote data guide </data-access/remote_data>`."
    ]
   },
   {
diff --git a/docs/tutorials/pre_executed/des-gaia.ipynb b/docs/tutorials/pre_executed/des-gaia.ipynb
@@ -250,7 +250,7 @@
     "    and our :doc:`Dask cluster configuration </tutorials/dask-cluster-tips>` page for LSDB-specific tips. \n",
     "    Note that dask also provides its own `best practices <https://docs.dask.org/en/stable/best-practices.html>`__, which may also be useful to consult.\n",
     "    \n",
-    "    For tips on accessing remote data, see our :doc:`Accessing remote data tutorial </tutorials/remote_data>`"
+    "    For tips on accessing remote data, see our :doc:`Accessing remote data guide </data-access/remote_data>`"
    ]
   },
   {
diff --git a/docs/tutorials/pre_executed/map_partitions.ipynb b/docs/tutorials/pre_executed/map_partitions.ipynb
@@ -47,7 +47,7 @@
     "    and our :doc:`Dask cluster configuration </tutorials/dask-cluster-tips>` page for LSDB-specific tips. \n",
     "    Note that dask also provides its own `best practices <https://docs.dask.org/en/stable/best-practices.html>`__, which may also be useful to consult.\n",
     "    \n",
-    "    For tips on accessing remote data, see our :doc:`Accessing remote data tutorial </tutorials/remote_data>`"
+    "    For tips on accessing remote data, see our :doc:`Accessing remote data guide </data-access/remote_data>`"
    ]
   },
   {
diff --git a/docs/tutorials/pre_executed/plotting.ipynb b/docs/tutorials/pre_executed/plotting.ipynb
@@ -43,7 +43,7 @@
     "    and our :doc:`Dask cluster configuration </tutorials/dask-cluster-tips>` page for LSDB-specific tips. \n",
     "    Note that dask also provides its own `best practices <https://docs.dask.org/en/stable/best-practices.html>`__, which may also be useful to consult.\n",
     "    \n",
-    "    For tips on accessing remote data, see our :doc:`Accessing remote data tutorial </tutorials/remote_data>`"
+    "    For tips on accessing remote data, see our :doc:`Accessing remote data guide </data-access/remote_data>`"
    ]
   },
   {
diff --git a/docs/tutorials/pre_executed/timeseries.ipynb b/docs/tutorials/pre_executed/timeseries.ipynb
@@ -63,7 +63,7 @@
     "    and our :doc:`Dask cluster configuration </tutorials/dask-cluster-tips>` page for LSDB-specific tips. \n",
     "    Note that dask also provides its own `best practices <https://docs.dask.org/en/stable/best-practices.html>`__, which may also be useful to consult.\n",
     "    \n",
-    "    For tips on accessing remote data, see our :doc:`Accessing remote data tutorial </tutorials/remote_data>`"
+    "    For tips on accessing remote data, see our :doc:`Accessing remote data guide </data-access/remote_data>`"
    ]
   },
   {
diff --git a/docs/tutorials/pre_executed/ztf-alerts-sne.ipynb b/docs/tutorials/pre_executed/ztf-alerts-sne.ipynb
@@ -177,7 +177,7 @@
     ".. nbinfo::\n",
     "    Additional Help \n",
     "    \n",
-    "    For tips on accessing remote data, see our :doc:`Accessing remote data tutorial </tutorials/remote_data>`"
+    "    For tips on accessing remote data, see our :doc:`Accessing remote data guide </data-access/remote_data>`"
    ]
   },
   {
diff --git a/docs/tutorials/region_selection.ipynb b/docs/tutorials/region_selection.ipynb
diff --git a/docs/tutorials/row_filtering.ipynb b/docs/tutorials/row_filtering.ipynb