Skip to content

Commit 7b62495

Browse files
committed
Add Lines and Models to schema info
1 parent ed7e4c4 commit 7b62495

File tree

1 file changed

+37
-20
lines changed

1 file changed

+37
-20
lines changed

tutorials/parquet-catalog-demos/euclid-hats-parquet.md

Lines changed: 37 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -30,19 +30,20 @@ By the end of this tutorial, you will:
3030
## 1. Introduction
3131

3232
The Collection includes a HATS Catalog (main data product), Margin Cache (10 arcsec), and Index Table (object_id).
33-
The Catalog includes the twelve Euclid Q1 tables listed below, joined on the column 'object_id' into a single Parquet dataset with 1,329 columns (one row per Euclid MER Object).
34-
Among them, Euclid has provided several different redshift measurements, several flux measurements for each Euclid band, and flux measurements for bands from several ground-based observatories -- in addition to morphological and other measurements.
35-
These were produced for different science goals using different algorithms and/or configurations.
33+
The Catalog includes the twelve Euclid Q1 tables listed below, joined on the column 'object_id' into a single Parquet dataset with 1,594 columns.
34+
There are 29,953,430 rows, one per Euclid MER Object, and the total data volume is 400 GB.
35+
The data includes several different redshift measurements, several flux measurements for each Euclid band, and flux measurements for bands from several ground-based observatories -- in addition to morphological and other measurements.
36+
Each was produced for different science goals using different algorithms and/or configurations.
3637

3738
Having all columns in the same dataset makes access convenient because the user doesn't have to make separate calls for data from different tables and/or join the results.
38-
However, figuring out which, e.g., flux measurements to use amongst so many can be challenging.
39+
However, figuring out which columns to use amongst so many can be challenging.
3940
In the sections below, we look at some of their distributions and reproduce figures from several papers in order to highlight some of the options and point out their differences.
4041
The Appendix explains how the columns in this Parquet dataset are named and organized.
4142
For more information about the meaning and provenance of a column, refer to the links provided with the list of tables below.
4243

4344
### 1.1 Euclid Q1 tables and docs
4445

45-
The Euclid Q1 HATS Catalog includes the following twelve Q1 tables, which are organized underneath the Euclid processing function (MER, PHZ, or SPE) that created it.
46+
The Euclid Q1 HATS Catalog includes the following 14 Q1 tables which are organized underneath the Euclid processing function (MER, PHZ, or SPE) that created it.
4647
Links to the Euclid papers describing the processing functions are provided, as well as pointers for each table.
4748
Table names are linked to their original schemas.
4849

@@ -53,14 +54,16 @@ Table names are linked to their original schemas.
5354
- PHZ - [Euclid Collaboration: Tucci et al., 2025](https://arxiv.org/pdf/2503.15306) (hereafter, Tucci)
5455
- [phz](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputcatalog.html#photo-z-catalog) - Sec. 5 (phz_photo_z)
5556
- [class](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#classification-catalog) - Sec. 4 (phz_classification)
56-
- [physparam](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#physical-parameters-catalog) - Sec. 6 (6.1; phz_physical_parameters) _Notice that this is **galaxies** and uses a different algorithm._
57+
- [physparam](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#physical-parameters-catalog) - Sec. 6 (6.1; phz_physical_parameters) _Notice that this is **galaxies**._
5758
- [galaxysed](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputcatalog.html#galaxy-sed-catalog) - App. B (B.1 phz_galaxy_sed)
5859
- [physparamqso](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#qso-physical-parameters-catalog) - Sec. 6 (6.2; phz_qso_physical_parameters)
5960
- [starclass](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#star-template) - Sec. 6 (6.3; phz_star_template)
6061
- [starsed](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputcatalog.html#star-sed-catalog) - App. B (B.1 phz_star_sed)
6162
- [physparamnir](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/phzdpd/dpcards/phz_phzpfoutputforl3.html#nir-physical-parameters-catalog) - Sec. 6 (6.4; phz_nir_physical_parameters)
6263
- SPE - [Euclid Collaboration: Le Brun et al., 2025](https://arxiv.org/pdf/2503.15308) (hereafter, Le Brun)
6364
- [z](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/spedpd/dpcards/spe_spepfoutputcatalog.html#redshift-catalog) - Sec. 2 (spectro_zcatalog_spe_quality, spectro_zcatalog_spe_classification, spectro_zcatalog_spe_galaxy_candidates, spectro_zcatalog_spe_star_candidates, and spectro_zcatalog_spe_qso_candidates)
65+
- [lines](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/spedpd/dpcards/spe_spepfoutputcatalog.html#lines-catalog) HDU1 rows with SPE_LINE_NAME == Halpha only - Sec. 5 (spectro_line_features_catalog_spe_line_features_cat) _Notice that lines were identified assuming **galaxy** regardless of the classification._
66+
- [models](http://st-dm.pages.euclid-sgs.uk/data-product-doc/dmq1/spedpd/dpcards/spe_spepfoutputcatalog.html#models-catalog) HDU2 only - Sec. 5 (spectro_model_catalog_spe_lines_catalog)
6467

6568
See also:
6669

@@ -1035,19 +1038,32 @@ In the right panel (Galaxy), we see good agreement between the PDFs except at z=
10351038

10361039
## Appendix: Schema details
10371040

1038-
This Euclid Q1 HATS Catalog contains the twelve Euclid tables listed in the introduction, joined on 'object_id' into a single dataset.
1039-
In addition, the Euclid 'TILEID' for each object has been added, as well as a few HATS- and HEALPix-related columns.
1040-
All Euclid column names other than 'object_id' and 'tileid' have the table name prepended (e.g., 'DECLINATION' -> 'MER_DECLINATION').
1041-
In addition, all non-alphanumeric characters have been replaced with an underscore for compatibility with various libraries and services (e.g., 'E(B-V)' -> 'PHYSPARAMQSO_E_B_V_').
1042-
Finally, the (SPE) Z table required special handling, as follows:
1043-
1044-
The original FITS files for the Z table contain the spectroscopic redshift estimates for GALAXY_CANDIDATES, STAR_CANDIDATES, and QSO_CANDIDATES (in HDUs 3, 4, and 5 respectively) which required special handling to be included in this Parquet product.
1045-
There are up to 5 redshift estimates per 'object_id', per HDU.
1046-
For the Parquet, these were pivoted so that there is one row per 'object_id' in order to facilitate the table joins.
1047-
The resulting columns were named by combining the table name (Z), the HDU name, the original column name, and the rank of the given redshift estimate (i.e., the value in the original 'SPE_RANK' column).
1048-
For example, the 'SPE_PDF' column for the highest ranked redshift estimate in the 'GALAXY_CANDIDATES' table is called 'Z_GALAXY_CANDIDATES_SPE_PDF_RANK0'.
1049-
1050-
Here, we follow IRSA's
1041+
This Euclid Q1 HATS Catalog contains the 14 Euclid tables listed in the introduction, joined on 'object_id' into a single parquet dataset.
1042+
In addition, the Euclid 'tileid' for each object has been added, as well as a few HATS- and HEALPix-related columns.
1043+
All Euclid column names have been lower-cased and the table name has been prepended (e.g., 'FLUX_H_TEMPLFIT' -> 'mer_flux_h_templift'), except for the following:
1044+
1045+
- object_id : Euclid MER Object ID. Unique identifier of a row in this dataset.
1046+
- tileid : ID of the Euclid Tile the object was detected in.
1047+
- ra : Right ascension. This is 'RIGHT_ASCENSION' from the 'mer' table. Named shortened to match other IRSA services.
1048+
- dec : Declination. This is 'DECLINATION' from the 'mer' table. Named shortened to match other IRSA services.
1049+
1050+
In addition to the above changes, all non-alphanumeric characters in column names have been replaced with an underscore for compatibility with various libraries and services (e.g., 'E(B-V)' -> 'physparamqso_e_b_v_').
1051+
Finally, the SPE tables 'z', 'lines', and 'models' required special handling as follows:
1052+
1053+
- z : The original FITS files contain the spectroscopic redshift estimates for GALAXY_CANDIDATES, STAR_CANDIDATES, and QSO_CANDIDATES (HDUs 3, 4, and 5 respectively) with up to 5 estimates per 'object_id', per HDU.
1054+
For the parquet dataset, these were pivoted so that there is one row per 'object_id' in order to facilitate the table joins.
1055+
The resulting columns were named by combining the table name (z), the HDU name, the original column name, and the rank of the given redshift estimate (i.e., the value in the original 'SPE_RANK' column).
1056+
For example, the 'SPE_PDF' column for the highest ranked redshift estimate in the 'GALAXY_CANDIDATES' table is called 'z_galaxy_candidates_spe_pdf_rank0'.
1057+
- lines : The parquet dataset only includes the rows from HDU1 with 'SPE_LINE_NAME' == 'Halpha'.
1058+
Similar to above, there are up to 5 sets of columns per 'object_id', one per redshift estimate.
1059+
Column names have been appended with both the rank and the line name.
1060+
For example, the column originally called 'SPE_LINE_FLUX_GF' is named 'lines_spe_line_flux_gf_rank0_halpha' for the Halpha line identified with the highest ranked redshift estimate.
1061+
- models : The parquet dataset only includes HDU2 -- the model parameters for the galaxy solutions.
1062+
This table has the same structure as 'z'.
1063+
In addition to the table name, 'galaxy' has been appended to the column names.
1064+
For example, the column originally called 'SPE_VEL_DISP_E' is named 'models_galaxy_spe_vel_disp_e_rank0' for the velocity dispersion of emission lines needed to fit the highest ranked galaxy redshift estimate.
1065+
1066+
Below, we follow IRSA's
10511067
[Cloud Access notebook](https://caltech-ipac.github.io/irsa-tutorials/tutorials/cloud_access/cloud-access-intro.html#navigate-a-catalog-and-perform-a-basic-query)
10521068
to inspect the parquet schema.
10531069

@@ -1066,6 +1082,7 @@ print(f"{len(schema)} columns total")
10661082
+++
10671083

10681084
To find all columns from a given table, search for column names that start with the table name followed by an underscore.
1085+
Table names are given in section 1.1.
10691086

10701087
```{code-cell}
10711088
# Find all column names from the phz table.
@@ -1128,6 +1145,6 @@ schema.names[-5:]
11281145

11291146
**Authors:** Troy Raen (Developer; Caltech/IPAC-IRSA) and the IRSA Data Science Team.
11301147

1131-
**Updated:** 2025-06-29
1148+
**Updated:** 2025-06-30
11321149

11331150
**Contact:** [IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or problems.

0 commit comments

Comments
 (0)