@@ -11,14 +11,19 @@ kernelspec:
11
11
name : python3
12
12
---
13
13
14
+ An executed version of this notebook can be seen on
15
+ [ IRSA's website] ( https://irsa.ipac.caltech.edu/docs/notebooks/neowise-source-table-strategies.html ) .
16
+
17
+ +++
18
+
14
19
# Strategies to Efficiently Work with NEOWISE Single-exposure Source Table in Parquet
15
20
16
21
+++
17
22
18
23
This notebook discusses strategies for working with the Apache Parquet version of the
19
24
[ NEOWISE] ( https://irsa.ipac.caltech.edu/Missions/wise.html ) Single-exposure Source Table
20
25
and provides the basic code needed for each approach.
21
- This is a very large catalog -- 10 years and 40 terabytes in total with 145 columns and 200 billion rows.
26
+ This is a very large catalog -- 11 years and 42 terabytes in total with 145 columns and 200 billion rows.
22
27
Most of the work shown in this notebook is how to efficiently deal with so much data.
23
28
24
29
Learning Goals:
@@ -34,7 +39,7 @@ Learning Goals:
34
39
35
40
+++
36
41
37
- The NEOWISE Single-exposure Source Table comprises 10 years of data.
42
+ The NEOWISE Single-exposure Source Table comprises 11 years of data.
38
43
Each year on its own would be considered "large" compared to astronomy catalogs produced
39
44
contemporaneously, so working with the full dataset requires extra consideration.
40
45
In this Parquet version, each year is stored as an independent Parquet dataset.
@@ -139,11 +144,11 @@ Expect the notebook to require about 4G RAM and 1 minute of runtime per year.
139
144
140
145
``` {code-cell} ipython3
141
146
# All NEOWISE years => about 40G RAM and 10 minutes runtime
142
- YEARS = list( range(1, 11))
147
+ YEARS = [f"year{yr}" for yr in range(1, 12)] + ["addendum"]
143
148
144
149
# To reduce the needed RAM or runtime, uncomment the next line and choose your own years.
145
- # Years 1 and 9 are needed for the median_file and biggest_file (defined below).
146
- # YEARS = [1, 9 ]
150
+ # Years 1 and 8 are needed for the median_file and biggest_file (defined below).
151
+ # YEARS = [1, 8 ]
147
152
```
148
153
149
154
Column and partition variables:
@@ -181,20 +186,20 @@ def neowise_path(year, file="_metadata"):
181
186
# This information can be found at https://irsa.ipac.caltech.edu/cloud_access/.
182
187
bucket = "nasa-irsa-wise"
183
188
base_prefix = "wise/neowiser/catalogs/p1bs_psd/healpix_k5"
184
- root_dir = f"{bucket}/{base_prefix}/year {year}/neowiser-healpix_k5-year {year}.parquet"
189
+ root_dir = f"{bucket}/{base_prefix}/{year}/neowiser-healpix_k5-{year}.parquet"
185
190
return f"{root_dir}/{file}"
186
191
```
187
192
188
193
Some representative partitions and files (see dataset stats in the Appendix for how we determine these values):
189
194
190
195
``` {code-cell} ipython3
191
196
# pixel index of the median partition and the biggest partition by number of rows
192
- median_part = 10_936
197
+ median_part = 11_831
193
198
biggest_part = 8_277
194
199
195
200
# path to the median file and the biggest file by file size on disk (see Appendix)
196
- median_file = neowise_path(9 , "healpix_k0=3 /healpix_k5=3420 /part0.snappy.parquet")
197
- biggest_file = neowise_path(1 , "healpix_k0=2/healpix_k5=2551/part0.snappy.parquet")
201
+ median_file = neowise_path("year8" , "healpix_k0=1 /healpix_k5=1986 /part0.snappy.parquet")
202
+ biggest_file = neowise_path("year1" , "healpix_k0=2/healpix_k5=2551/part0.snappy.parquet")
198
203
```
199
204
200
205
Convenience function for displaying a table size:
@@ -400,7 +405,7 @@ for year, year_ds in zip(YEARS, neowise_ds.children):
400
405
# we'll just look at some basic metadata.
401
406
num_rows = sum(frag.metadata.num_rows for frag in year_ds.get_fragments())
402
407
num_files = len(year_ds.files)
403
- print(f"NEOWISE year {year} dataset: {num_rows:,} rows in {num_files:,} files")
408
+ print(f"NEOWISE {year} dataset: {num_rows:,} rows in {num_files:,} files")
404
409
```
405
410
406
411
## Appendix
@@ -559,6 +564,6 @@ per_part.sort_values("numrows").iloc[len(per_part.index) // 2]
559
564
560
565
** Author:** Troy Raen (IRSA Developer) and the IPAC Science Platform team
561
566
562
- ** Updated:** 2024-08-08
567
+ ** Updated:** 2025-03-07
563
568
564
569
** Contact:** [ the IRSA Helpdesk] ( https://irsa.ipac.caltech.edu/docs/help_desk.html ) with questions or reporting problems.
0 commit comments