@@ -11,14 +11,19 @@ kernelspec:
1111 name : python3
1212---
1313
14+ An executed version of this notebook can be seen on
15+ [ IRSA's website] ( https://irsa.ipac.caltech.edu/docs/notebooks/neowise-source-table-strategies.html ) .
16+
17+ +++
18+
1419# Strategies to Efficiently Work with NEOWISE Single-exposure Source Table in Parquet
1520
1621+++
1722
1823This notebook discusses strategies for working with the Apache Parquet version of the
1924[ NEOWISE] ( https://irsa.ipac.caltech.edu/Missions/wise.html ) Single-exposure Source Table
2025and provides the basic code needed for each approach.
21- This is a very large catalog -- 10 years and 40 terabytes in total with 145 columns and 200 billion rows.
26+ This is a very large catalog -- 11 years and 42 terabytes in total with 145 columns and 200 billion rows.
2227Most of the work shown in this notebook is how to efficiently deal with so much data.
2328
2429Learning Goals:
@@ -34,7 +39,7 @@ Learning Goals:
3439
3540+++
3641
37- The NEOWISE Single-exposure Source Table comprises 10 years of data.
42+ The NEOWISE Single-exposure Source Table comprises 11 years of data.
3843Each year on its own would be considered "large" compared to astronomy catalogs produced
3944contemporaneously, so working with the full dataset requires extra consideration.
4045In this Parquet version, each year is stored as an independent Parquet dataset.
@@ -139,11 +144,11 @@ Expect the notebook to require about 4G RAM and 1 minute of runtime per year.
139144
140145``` {code-cell} ipython3
141146# All NEOWISE years => about 40G RAM and 10 minutes runtime
142- YEARS = list( range(1, 11))
147+ YEARS = [f"year{yr}" for yr in range(1, 12)] + ["addendum"]
143148
144149# To reduce the needed RAM or runtime, uncomment the next line and choose your own years.
145- # Years 1 and 9 are needed for the median_file and biggest_file (defined below).
146- # YEARS = [1, 9 ]
150+ # Years 1 and 8 are needed for the median_file and biggest_file (defined below).
151+ # YEARS = [1, 8 ]
147152```
148153
149154Column and partition variables:
@@ -181,20 +186,20 @@ def neowise_path(year, file="_metadata"):
181186 # This information can be found at https://irsa.ipac.caltech.edu/cloud_access/.
182187 bucket = "nasa-irsa-wise"
183188 base_prefix = "wise/neowiser/catalogs/p1bs_psd/healpix_k5"
184- root_dir = f"{bucket}/{base_prefix}/year {year}/neowiser-healpix_k5-year {year}.parquet"
189+ root_dir = f"{bucket}/{base_prefix}/{year}/neowiser-healpix_k5-{year}.parquet"
185190 return f"{root_dir}/{file}"
186191```
187192
188193Some representative partitions and files (see dataset stats in the Appendix for how we determine these values):
189194
190195``` {code-cell} ipython3
191196# pixel index of the median partition and the biggest partition by number of rows
192- median_part = 10_936
197+ median_part = 11_831
193198biggest_part = 8_277
194199
195200# path to the median file and the biggest file by file size on disk (see Appendix)
196- median_file = neowise_path(9 , "healpix_k0=3 /healpix_k5=3420 /part0.snappy.parquet")
197- biggest_file = neowise_path(1 , "healpix_k0=2/healpix_k5=2551/part0.snappy.parquet")
201+ median_file = neowise_path("year8" , "healpix_k0=1 /healpix_k5=1986 /part0.snappy.parquet")
202+ biggest_file = neowise_path("year1" , "healpix_k0=2/healpix_k5=2551/part0.snappy.parquet")
198203```
199204
200205Convenience function for displaying a table size:
@@ -400,7 +405,7 @@ for year, year_ds in zip(YEARS, neowise_ds.children):
400405 # we'll just look at some basic metadata.
401406 num_rows = sum(frag.metadata.num_rows for frag in year_ds.get_fragments())
402407 num_files = len(year_ds.files)
403- print(f"NEOWISE year {year} dataset: {num_rows:,} rows in {num_files:,} files")
408+ print(f"NEOWISE {year} dataset: {num_rows:,} rows in {num_files:,} files")
404409```
405410
406411## Appendix
@@ -559,6 +564,6 @@ per_part.sort_values("numrows").iloc[len(per_part.index) // 2]
559564
560565** Author:** Troy Raen (IRSA Developer) and the IPAC Science Platform team
561566
562- ** Updated:** 2024-08-08
567+ ** Updated:** 2025-03-07
563568
564569** Contact:** [ the IRSA Helpdesk] ( https://irsa.ipac.caltech.edu/docs/help_desk.html ) with questions or reporting problems.
0 commit comments