Merge pull request #37 from Caltech-IPAC/raen/patch/neowise-add-yr11-and-addendum

bsipocz · web-flow · commit 17a8578812b3 · 2025-03-13T11:29:53.000-07:00
IRSA-6436 Add NEOWISE year11 and addendum
diff --git a/tutorials/parquet-catalog-demos/neowise-source-table-lightcurves.md b/tutorials/parquet-catalog-demos/neowise-source-table-lightcurves.md
@@ -11,6 +11,11 @@ kernelspec:
   name: python3
 ---
 
+An executed version of this notebook can be seen on
+[IRSA's website](https://irsa.ipac.caltech.edu/docs/notebooks/neowise-source-table-lightcurves.html).
+
++++
+
 # Make Light Curves from NEOWISE Single-exposure Source Table
 
 +++
@@ -31,7 +36,7 @@ Learning Goals:
 This notebook loads light curves from the
 [NEOWISE](https://irsa.ipac.caltech.edu/Missions/wise.html) Single-exposure Source Table
 for a sample of about 2000 cataclysmic variables from [Downes et al. (2001)](https://doi.org/10.1086/320802).
-The NEOWISE Single-exposure Source Table is a very large catalog -- 10 years and 40 terabytes in total
+The NEOWISE Single-exposure Source Table is a very large catalog -- 11 years and 42 terabytes in total
 with 145 columns and 200 billion rows.
 When searching this catalog, it is important to consider the requirements of your use case and
 the format of this dataset.
@@ -57,7 +62,7 @@ The specific strategy we employ is:
 
 The efficiency of this method will increase with the number of rows needed from each partition.
 For example, a cone search radius of 1 arcsec will require about 10 CPUs, 65G RAM, and
-50 minutes to load the data from all 10 NEOWISE years.
+50 minutes to load the data from all 11 NEOWISE years.
 Increasing the radius to 10 arcsec will return about 2.5x more rows using roughly the same resources.
 Increasing the target sample size can result in similar efficiency gains.
 To try out this notebook with fewer resources, use a subset of NEOWISE years.
@@ -104,7 +109,8 @@ Real use cases are likely to require all ten years but it can be helpful to star
 fewer while exploring to make things run faster.
 
 ```{code-cell} ipython3
-YEARS = list(range(1, 11))  # all years => about 11 CPU, 65G RAM, and 50 minutes runtime
+# all years => about 11 CPU, 65G RAM, and 50 minutes runtime
+YEARS = [f"year{yr}" for yr in range(1, 12)] + ["addendum"]
 
 # To try out a smaller version of the notebook,
 # uncomment the next line and choose your own subset of years.
@@ -136,7 +142,7 @@ We'll load it as a pyarrow dataset.
 bucket = "nasa-irsa-wise"
 base_prefix = "wise/neowiser/catalogs/p1bs_psd/healpix_k5"
 metadata_path = (
-    lambda yr: f"{bucket}/{base_prefix}/year{yr}/neowiser-healpix_k5-year{yr}.parquet/_metadata"
+    lambda yr: f"{bucket}/{base_prefix}/{yr}/neowiser-healpix_k5-{yr}.parquet/_metadata"
 )
 fs = pyarrow.fs.S3FileSystem(region="us-west-2", anonymous=True)
 
@@ -461,6 +467,6 @@ This has to do with differences in what does / does not get copied into the chil
 
 **Author:** Troy Raen (IRSA Developer) and the IPAC Science Platform team
 
-**Updated:** 2024-08-08
+**Updated:** 2025-03-07
 
 **Contact:** [the IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or reporting problems.
diff --git a/tutorials/parquet-catalog-demos/neowise-source-table-strategies.md b/tutorials/parquet-catalog-demos/neowise-source-table-strategies.md
@@ -11,14 +11,19 @@ kernelspec:
   name: python3
 ---
 
+An executed version of this notebook can be seen on
+[IRSA's website](https://irsa.ipac.caltech.edu/docs/notebooks/neowise-source-table-strategies.html).
+
++++
+
 # Strategies to Efficiently Work with NEOWISE Single-exposure Source Table in Parquet
 
 +++
 
 This notebook discusses strategies for working with the Apache Parquet version of the
 [NEOWISE](https://irsa.ipac.caltech.edu/Missions/wise.html) Single-exposure Source Table
 and provides the basic code needed for each approach.
-This is a very large catalog -- 10 years and 40 terabytes in total with 145 columns and 200 billion rows.
+This is a very large catalog -- 11 years and 42 terabytes in total with 145 columns and 200 billion rows.
 Most of the work shown in this notebook is how to efficiently deal with so much data.
 
 Learning Goals:
@@ -34,7 +39,7 @@ Learning Goals:
 
 +++
 
-The NEOWISE Single-exposure Source Table comprises 10 years of data.
+The NEOWISE Single-exposure Source Table comprises 11 years of data.
 Each year on its own would be considered "large" compared to astronomy catalogs produced
 contemporaneously, so working with the full dataset requires extra consideration.
 In this Parquet version, each year is stored as an independent Parquet dataset.
@@ -139,11 +144,11 @@ Expect the notebook to require about 4G RAM and 1 minute of runtime per year.
 
 ```{code-cell} ipython3
 # All NEOWISE years => about 40G RAM and 10 minutes runtime
-YEARS = list(range(1, 11))
+YEARS = [f"year{yr}" for yr in range(1, 12)] + ["addendum"]
 
 # To reduce the needed RAM or runtime, uncomment the next line and choose your own years.
-# Years 1 and 9 are needed for the median_file and biggest_file (defined below).
-# YEARS = [1, 9]
+# Years 1 and 8 are needed for the median_file and biggest_file (defined below).
+# YEARS = [1, 8]
 ```
 
 Column and partition variables:
@@ -181,20 +186,20 @@ def neowise_path(year, file="_metadata"):
     # This information can be found at https://irsa.ipac.caltech.edu/cloud_access/.
     bucket = "nasa-irsa-wise"
     base_prefix = "wise/neowiser/catalogs/p1bs_psd/healpix_k5"
-    root_dir = f"{bucket}/{base_prefix}/year{year}/neowiser-healpix_k5-year{year}.parquet"
+    root_dir = f"{bucket}/{base_prefix}/{year}/neowiser-healpix_k5-{year}.parquet"
     return f"{root_dir}/{file}"
 ```
 
 Some representative partitions and files (see dataset stats in the Appendix for how we determine these values):
 
 ```{code-cell} ipython3
 # pixel index of the median partition and the biggest partition by number of rows
-median_part = 10_936
+median_part = 11_831
 biggest_part = 8_277
 
 # path to the median file and the biggest file by file size on disk (see Appendix)
-median_file = neowise_path(9, "healpix_k0=3/healpix_k5=3420/part0.snappy.parquet")
-biggest_file = neowise_path(1, "healpix_k0=2/healpix_k5=2551/part0.snappy.parquet")
+median_file = neowise_path("year8", "healpix_k0=1/healpix_k5=1986/part0.snappy.parquet")
+biggest_file = neowise_path("year1", "healpix_k0=2/healpix_k5=2551/part0.snappy.parquet")
 ```
 
 Convenience function for displaying a table size:
@@ -400,7 +405,7 @@ for year, year_ds in zip(YEARS, neowise_ds.children):
     # we'll just look at some basic metadata.
     num_rows = sum(frag.metadata.num_rows for frag in year_ds.get_fragments())
     num_files = len(year_ds.files)
-    print(f"NEOWISE year {year} dataset: {num_rows:,} rows in {num_files:,} files")
+    print(f"NEOWISE {year} dataset: {num_rows:,} rows in {num_files:,} files")
 ```
 
 ## Appendix
@@ -559,6 +564,6 @@ per_part.sort_values("numrows").iloc[len(per_part.index) // 2]
 
 **Author:** Troy Raen (IRSA Developer) and the IPAC Science Platform team
 
-**Updated:** 2024-08-08
+**Updated:** 2025-03-07
 
 **Contact:** [the IRSA Helpdesk](https://irsa.ipac.caltech.edu/docs/help_desk.html) with questions or reporting problems.